Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address Excessive API Server calls from CNI Pods #1419

Merged
merged 6 commits into from
May 10, 2021

Conversation

achevuru
Copy link
Contributor

@achevuru achevuru commented Apr 7, 2021

What type of PR is this?
Enhancement

Which issue does this PR fix:
CNI daemonset Pods currently run a controller to handle ENIConfig CRDs along with Node and Pod objects. CNI code base depends on a rather old version of operator-sdk (v0.0.7) and this combined with dynamic api group discovery is contributing to a rather high rate of API Server calls from each CNI pod (approx 8500 calls per hr from each node). As we scale the cluster, this becomes a clear problem and needs to be dealt with.

What does this PR do / Why do we need it:
PR attempts to address the above issue via a 2-way approach depending on the resource it is dealing with.

  1. For ENIConfig and Node resources, we will now use a K8S client with a cache tied to it - following the list+watch approach for both these resources. Both ENIConfig and Node resources are accessed during the ENI creation flow when custom networking is enabled and we don't want to introduce an API server call during this workflow.

  2. For Pods, we will make an API server call on need basis.

If an issue # is not available please add repro steps and logs from IPAMD/CNI showing the issue:
With the change, API Server call count reduced to about 30-35 per hr from each node from the current 8500 (approx) per hr from each node.

Testing done on this change:
Verified both regular and custom networking mode.

Will this break upgrades or downgrades. Has updating a running cluster been tested?:
No. Yes, upgrade scenario has been tested.

Does this change require updates to the CNI daemonset config files to work?:
No

Does this PR introduce any user-facing change?:
No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

cmd/aws-k8s-agent/main.go Outdated Show resolved Hide resolved
cmd/aws-k8s-agent/main.go Outdated Show resolved Hide resolved
}

ipamContext, err := ipamd.New(kubeClient, eniConfigController)
ipamContext, err := ipamd.New(standaloneK8SClient, k8sClient)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does ipamd.New need both cached and uncached client? That seems weird, or a sign that we need cache invalidation somewhere.

cmd/cni-metrics-helper/metrics/cni_metrics_test.go Outdated Show resolved Hide resolved
cmd/cni-metrics-helper/metrics/cni_metrics_test.go Outdated Show resolved Hide resolved
pkg/k8sapi/k8sutils.go Outdated Show resolved Hide resolved
pkg/k8sapi/k8sutils.go Outdated Show resolved Hide resolved
pkg/k8sapi/k8sutils.go Show resolved Hide resolved
pkg/k8sapi/k8sutils.go Outdated Show resolved Hide resolved
if err != nil {
log.Errorf("Failed to create client: %v", err)
//Check API Server Connectivity
if k8sapi.CheckAPIServerConnectivity() != nil{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... why do this, rather than just wait until the first actual apiserver call to effectively do it for us?

(In particular, just because the apiserver was available here doesn't mean it's always going to be available during every other part of execution - so we still have to handle intermittent connectivity issues elsewhere.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, a success here doesn't guarantee the call will go through a while later. This was mainly done to be in parity with the current CNI codeflow, where we check API server connectivity and essentially crash if the initial API Server connectivity check fails. Sort of helps to catch or highlight basic connectivity issues to API server as part of CNI bootstrap..

https://github.com/aws/amazon-vpc-cni-k8s/blob/master/pkg/k8sapi/discovery.go#L87


// ENIConfigStatus defines the observed state of ENIConfig
type ENIConfigStatus struct {
// Fill me
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: If ENIConfigStatus is not used, can we remove it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, just the standard Spec and Status objects for CRDs/K8S resources. Although, we're not using it right now for ENIConfig CRD, it is good to have them in place.

cmd/cni-metrics-helper/metrics/pod_watcher.go Outdated Show resolved Hide resolved
Namespace: metav1.NamespaceSystem,
}

err := d.k8sClient.List(ctx, &podList, &listOptions)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big 👍 to the new label selector approach. The apiserver thanks you.

Copy link
Contributor

@anguslees anguslees left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

Copy link
Contributor

@jayanthvn jayanthvn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, can you please fix the conflicts in ipamd.go.

@achevuru
Copy link
Contributor Author

Looks good, can you please fix the conflicts in ipamd.go.

Done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants