Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to customize security group rules #392

Closed
randomvariable opened this issue Nov 20, 2018 · 35 comments · Fixed by #4228
Closed

Ability to customize security group rules #392

randomvariable opened this issue Nov 20, 2018 · 35 comments · Fixed by #4228
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@randomvariable
Copy link
Member

/kind feature

Describe the solution you'd like
#341 adds Calico rules to the security groups we apply to the cluster.
We also additionally include a bunch of "useful" defaults in the bootstrap cloudformation.

How do we manage the lifecycle of AWS resources which are critical to cluster operation but are being delivered by pluggable components?

Should we extend ProviderConfig for security rules?

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Also wondering if the bundler @justinsb and others are working on include the concept of cloud resources?

cc @sethp-nr

Environment:

  • Cluster-api-provider-aws version:
  • Kubernetes version: (use kubectl version):
  • OS (e.g. from /etc/os-release):
@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 20, 2018
@sethp-nr
Copy link
Contributor

It's an interesting question, for sure. My interpretation of the landscape was that this project was a "reasonable default" implementation for a kubernetes cluster on AWS, which in my mind implied picking a CNI provider & wiring it up.

The other implementation strategy I considered was modifying the security group rules to allow "all traffic" between and within the controlplane and node SGs. IMO that's a different but still reasonable "default" spot to be, and it would leave the question of how to network largely up to the choices in the addons.yaml file.

@detiber detiber added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Nov 26, 2018
@timothysc timothysc added this to the Next milestone Jan 4, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 28, 2019
@vincepri
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 28, 2019
@ncdc ncdc changed the title Handling AWS resources for cluster addons Ability to customize security group rules Jul 1, 2019
@ncdc ncdc modified the milestones: Next, v0.4 Jul 1, 2019
@dlipovetsky
Copy link
Contributor

Some notes:
There are two ways to allow end users to customize rules:

  1. Modify rules in CAPA-reconciled security groups
  2. Add security groups

Security groups are just collections of rules. An instance can have multiple security groups; AWS internally merges them all into one group, which is the most permissive of all the groups.

Currently, CAPA defines groups for specific machine roles (bastion host, control plane, node). The groups contain rules for the CNI plugin and stacked etcd. When an end users wants to use a CNI plugin other than Calico, or use external etcd, then those rules should not exist.

If CAPA can define groups for components, instead of machine roles, then it should be possible for the groups to contain only the necessary set of rules. That implies that end users are free to add security groups with the guarantee that their groups will not break cluster functionality because of the way AWS' internal merge works.

I strongly prefer this to asking end users to modify CAPA-reconciled security groups. If end users are allowed to modify CAPA-reconciled security groups, CAPA itself will need to guarantee that end users are not breaking cluster functionality.

@detiber
Copy link
Member

detiber commented Aug 1, 2019

@dlipovetsky One potential downside to separating out everything into separate groups is that we could start running into AWS limits on the number of Security Groups that can be attached to an instance.

Based on the docs, it looks like we are limited to a maximum of 5 security groups per network interface, which means if we would exceed that, then we would have to start allocating multiple ENIs on instance creation as well as managing the security groups properly across the multiple ENIs.

@dlipovetsky
Copy link
Contributor

@detiber Good point, thanks.

@dlipovetsky
Copy link
Contributor

To increase or decrease this limit, contact AWS Support. The maximum is 16. The limit for security groups per network interface multiplied by the limit for rules per security group cannot exceed 1000. For example, if you increase this limit to 10, we decrease the limit for your number of rules per security group to 100.

Per the docs linked, I want to note that while the default limit is 5 security groups per network interface, the limit can be increased on request, though there's no guarantee it would be.

End users should not have to request limit increases just to use CAPA's default security group configuration. But can end users be asked to request limit increases to be able to use CAP with custom security groups?

@dlipovetsky
Copy link
Contributor

dlipovetsky commented Aug 14, 2019

Another user story: Prometheus is a popular solution for monitoring. The Node Exporter is a popular way of collecting host-level metrics for Prometheus. Frequently, Prometheus "scrapes" (queries) Node Exporter endpoints. One Node Exporter runs on every host, and it runs in the host's network namespace. Therefore Node Exporter serves metrics on the host's IP, on some unprivileged port (typically 9100).

With the default security group rules, a Prometheus Pod running on one host cannot reach a Node Exporter running in a different host.

I'm reaching out to the Prometheus folks to understand the consequence of running Node Exporter in the Pod network namespace and will report back here.

Update: I found some evidence in the Prometheus helm chart for running Node Exporter in the Pod network namespace: helm/charts#12747.

@ncdc
Copy link
Contributor

ncdc commented Aug 26, 2019

This will likely require API changes

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 24, 2019
@vincepri
Copy link
Member

/remove-lifecycle stale
/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 25, 2019
@benmoss
Copy link

benmoss commented Dec 18, 2019

This is a pain for us trying to add Windows support, since Calico only supports Windows in their proprietary version from Tigera. The main OSS CNI with good Windows support is Flannel, so we're stuck documenting how to create SGs manually. This will also likely break cluster deletion, since you won't be able to delete the VPC when security groups still exist inside of it. It feels like if CAPA is going to create any SGs it should be possible to control what they are.

I'd be interested in working on adding support for this but it's not clear what the API would look like.

@bsideup
Copy link

bsideup commented Dec 6, 2021

@sedefsavas thanks for sharing! Are there any short term plans to implement this issue then? e.g. by accepting #2876 ?

@sedefsavas
Copy link
Contributor

sedefsavas commented Mar 17, 2022

Another customization to cover under this:
#3314

Also,
#3271

@richardcase
Copy link
Member

/remove-lifecycle frozen

@k8s-ci-robot k8s-ci-robot removed the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Jul 12, 2022
@richardcase
Copy link
Member

/milestone clear

@k8s-ci-robot k8s-ci-robot removed this from the v1.5.0 milestone Jul 25, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 23, 2022
@rymancl
Copy link

rymancl commented Oct 23, 2022

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 23, 2022
@richardcase
Copy link
Member

@belgaied2 - i think we can use this issue for our discussion. Would you be able to add your use case here?

@belgaied2
Copy link

Thanks @richardcase ,

Yes, this issue is of interest to us over at cluster-api-provider-rke2 since RKE2 relies on node registration mechanism that uses a custom port on control plane machines.

Principle is as follows:

  • First control plane machine initializes a cluster and open the port 9345.
  • subsequent nodes (both control plane and workers) will connect to that port to register themselves in the cluster.

This means that, if our provider needs to integrate with CAPA, it needs to customize the securityGroup for the controlplane (which is created by CAPA).

The reason why the existing feature of additionalSecurityGroups on AWSMachineTemplate resources does not work is because these securityGroup(s) need to exist beforehand, when the VPC is not yet created by the AWSCluster controller.

So, at best, we need a way to add a securityRule to the securityGroups that created by CAPA.

@richardcase
Copy link
Member

This seems reasonable to me. I can work on this:

/assign
/lifecycle active

@k8s-ci-robot k8s-ci-robot added the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label Jan 18, 2023
@cablunar
Copy link
Contributor

cablunar commented Mar 9, 2023

@richardcase is there any ETA on the implementation? 🙏

@alexander-demicev
Copy link
Contributor

/assign

@alexander-demicev
Copy link
Contributor

@cablunar already working on it

@richardcase
Copy link
Member

/unassign

@richardcase richardcase removed the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Apr 19, 2023
@com6056
Copy link

com6056 commented Aug 25, 2023

Should this be closed if #3271 still hasn't been addressed from what I can tell? Of course feel free to correct me if I'm wrong!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.