Address Excessive API Server calls from CNI Pods #1419

achevuru · 2021-04-07T05:11:30Z

What type of PR is this?
Enhancement

Which issue does this PR fix:
CNI daemonset Pods currently run a controller to handle ENIConfig CRDs along with Node and Pod objects. CNI code base depends on a rather old version of operator-sdk (v0.0.7) and this combined with dynamic api group discovery is contributing to a rather high rate of API Server calls from each CNI pod (approx 8500 calls per hr from each node). As we scale the cluster, this becomes a clear problem and needs to be dealt with.

What does this PR do / Why do we need it:
PR attempts to address the above issue via a 2-way approach depending on the resource it is dealing with.

For ENIConfig and Node resources, we will now use a K8S client with a cache tied to it - following the list+watch approach for both these resources. Both ENIConfig and Node resources are accessed during the ENI creation flow when custom networking is enabled and we don't want to introduce an API server call during this workflow.
For Pods, we will make an API server call on need basis.

If an issue # is not available please add repro steps and logs from IPAMD/CNI showing the issue:
With the change, API Server call count reduced to about 30-35 per hr from each node from the current 8500 (approx) per hr from each node.

Testing done on this change:
Verified both regular and custom networking mode.

Will this break upgrades or downgrades. Has updating a running cluster been tested?:
No. Yes, upgrade scenario has been tested.

Does this change require updates to the CNI daemonset config files to work?:
No

Does this PR introduce any user-facing change?:
No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

cmd/aws-k8s-agent/main.go

anguslees · 2021-04-12T23:05:48Z

cmd/aws-k8s-agent/main.go

 	}

-	ipamContext, err := ipamd.New(kubeClient, eniConfigController)
+	ipamContext, err := ipamd.New(standaloneK8SClient, k8sClient)


Why does ipamd.New need both cached and uncached client? That seems weird, or a sign that we need cache invalidation somewhere.

cmd/cni-metrics-helper/metrics/cni_metrics_test.go

pkg/k8sapi/k8sutils.go

anguslees · 2021-04-12T23:49:22Z

cmd/aws-k8s-agent/main.go

-	if err != nil {
-		log.Errorf("Failed to create client: %v", err)
+	//Check API Server Connectivity
+	if k8sapi.CheckAPIServerConnectivity() != nil{


... why do this, rather than just wait until the first actual apiserver call to effectively do it for us?

(In particular, just because the apiserver was available here doesn't mean it's always going to be available during every other part of execution - so we still have to handle intermittent connectivity issues elsewhere.)

Sure, a success here doesn't guarantee the call will go through a while later. This was mainly done to be in parity with the current CNI codeflow, where we check API server connectivity and essentially crash if the initial API Server connectivity check fails. Sort of helps to catch or highlight basic connectivity issues to API server as part of CNI bootstrap..

https://github.com/aws/amazon-vpc-cni-k8s/blob/master/pkg/k8sapi/discovery.go#L87

jayanthvn · 2021-04-14T23:21:39Z

pkg/apis/crd/v1alpha1/eniconfig_types.go

+
+// ENIConfigStatus defines the observed state of ENIConfig
+type ENIConfigStatus struct {
+	// Fill me


Nit: If ENIConfigStatus is not used, can we remove it?

Yeah, just the standard Spec and Status objects for CRDs/K8S resources. Although, we're not using it right now for ENIConfig CRD, it is good to have them in place.

cmd/cni-metrics-helper/metrics/pod_watcher.go

anguslees · 2021-04-20T23:37:39Z

cmd/cni-metrics-helper/metrics/pod_watcher.go

+		Namespace:     metav1.NamespaceSystem,
+	}
+
+	err := d.k8sClient.List(ctx, &podList, &listOptions)


Big 👍 to the new label selector approach. The apiserver thanks you.

anguslees

lgtm!

jayanthvn

Looks good, can you please fix the conflicts in ipamd.go.

achevuru · 2021-04-21T17:53:11Z

Looks good, can you please fix the conflicts in ipamd.go.

Done.

Address Excessive API Server calls from CNI Pods

e35b502

achevuru requested review from anguslees, jayanthvn and fawadkhaliq April 7, 2021 05:11

Missing files

61245bd

jayanthvn mentioned this pull request Apr 9, 2021

Upgrade eniconfigs.crd.k8s.amazonaws.com to apiextensions.k8s.io/v1 by k8s 1.22 #1074

Closed

anguslees reviewed Apr 12, 2021

View reviewed changes

Addressed CR comments

d21ad5a

jayanthvn reviewed Apr 14, 2021

View reviewed changes

Scope down Metrics Server Pod list call by labels

388358b

anguslees reviewed Apr 20, 2021

View reviewed changes

Scope down Metrics Server Pod list call by labels

cca1210

anguslees approved these changes Apr 21, 2021

View reviewed changes

jayanthvn requested changes Apr 21, 2021

View reviewed changes

Merge branch 'master' into excessive_api_server_calls

882c8b0

jayanthvn approved these changes Apr 21, 2021

View reviewed changes

jayanthvn added this to the v1.8.0 milestone Apr 21, 2021

jayanthvn merged commit 4935d10 into aws:master May 10, 2021

eks-bot mentioned this pull request Jun 10, 2021

🥳 aws-vpc-cni v1.8.0 Automated Release! 🥑 aws/eks-charts#536

Closed

This was referenced Jun 10, 2021

Replace use of (ancient) operator-framework/operator-sdk with kubebuilder #745

Closed

Couldn't get resource list for external.metrics.k8s.io/v1beta1 #486

Closed

Do not treat Watch GOAWAY as error #1046

Closed

eks-bot mentioned this pull request Jun 11, 2021

🥳 aws-vpc-cni v1.8.0 Automated Release! 🥑 aws/eks-charts#538

Merged

jayanthvn mentioned this pull request Oct 6, 2021

avoid 6 api rpm per pod by removing operator framework cache-reset #1214

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Address Excessive API Server calls from CNI Pods #1419

Address Excessive API Server calls from CNI Pods #1419

achevuru commented Apr 7, 2021

anguslees Apr 12, 2021

anguslees Apr 12, 2021

achevuru Apr 13, 2021

jayanthvn Apr 14, 2021

achevuru Apr 19, 2021

anguslees Apr 20, 2021

anguslees left a comment

jayanthvn left a comment

achevuru commented Apr 21, 2021

Address Excessive API Server calls from CNI Pods #1419

Address Excessive API Server calls from CNI Pods #1419

Conversation

achevuru commented Apr 7, 2021

anguslees Apr 12, 2021

Choose a reason for hiding this comment

anguslees Apr 12, 2021

Choose a reason for hiding this comment

achevuru Apr 13, 2021

Choose a reason for hiding this comment

jayanthvn Apr 14, 2021

Choose a reason for hiding this comment

achevuru Apr 19, 2021

Choose a reason for hiding this comment

anguslees Apr 20, 2021

Choose a reason for hiding this comment

anguslees left a comment

Choose a reason for hiding this comment

jayanthvn left a comment

Choose a reason for hiding this comment

achevuru commented Apr 21, 2021