ControlPlane node is not ready in scalability tests when run on GCE #29500

wojtek-t · 2023-05-11T09:33:18Z

In scalability tests, the control-plane node is never initialized to be ready.
We're usually not suffering from them as almost all our tests run 100+ nodes and we tollerate 1% of nodes not initialized correctly.
But this is problematic for tests like:
https://testgrid.k8s.io/sig-scalability-experiments#watchlist-off

Looking into kubelet logs, the reason seem to be:

May 11 09:09:13.886270 bootstrap-e2e-master kubelet[2782]: E0511 09:09:13.886233    2782 kubelet.go:2753] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"

FWIW - it seems to be related to some of our preset settings, as, e.g.
https://testgrid.k8s.io/sig-scalability-node#node-containerd-throughput

don't suffer from it.

@kubernetes/sig-scalability @mborsz @Argh4k
@p0lyn0mial - FYI

The text was updated successfully, but these errors were encountered:

wojtek-t · 2023-05-11T09:48:06Z

The only suspicious one that I see in our preset is this one:

  - name: KUBE_GCE_PRIVATE_CLUSTER
    value: "true"

Argh4k · 2023-05-12T11:43:24Z

containerd logs from master:

May 12 08:43:21.379251 bootstrap-e2e-master containerd[650]: time="2023-05-12T08:43:21.379201176Z" level=error msg="failed to load cni during init, please check CRI plugin status before setting up network for pods" error="cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config"

on nodes we get cni config from template: NetworkPluginConfTemplate:/home/kubernetes/cni.template
on the master it is empty. In logs from master I can see that setup-containerd is called from configure-helper and it should set the template path. My guess would be that https://github.com/kubernetes/kubernetes/blob/master/cluster/gce/gci/configure-helper.sh#L3181 is executed, but this should not be the case.

Argh4k · 2023-05-12T13:03:28Z

I have sshed on to the master and it looks like all configuration files regarding cni are in place. Kubectl describe node on master:

Events:
  Type    Reason                Age                 From             Message
  ----    ------                ----                ----             -------
  Normal  RegisteredNode        21m                 node-controller  Node bootstrap-e2e-master event: Registered Node bootstrap-e2e-master in Controller
  Normal  CIDRAssignmentFailed  26s (x56 over 21m)  cidrAllocator    Node bootstrap-e2e-master status is now: CIDRAssignmentFailed

Argh4k · 2023-05-12T13:17:34Z

Kube controller manager logs:

E0512 13:12:32.119653      11 cloud_cidr_allocator.go:315] "Failed to update the node PodCIDR after multiple attempts" err="failed to patch node CIDR: Node \"bootstrap-e2e-master\" is invalid: spec.podCIDRs: Invalid value: []string{\"10.64.0.0/24\", \"10.40.0.2/32\"}: may specify no more than one CIDR for each IP family" node="bootstrap-e2e-master" cidrStrings=["10.64.0.0/24","10.40.0.2/32"]
E0512 13:12:32.119671      11 cloud_cidr_allocator.go:178] "Error updating CIDR" err="failed to patch node CIDR: Node \"bootstrap-e2e-master\" is invalid: spec.podCIDRs: Invalid value: []string{\"10.64.0.0/24\", \"10.40.0.2/32\"}: may specify no more than one CIDR for each IP family" workItem="bootstrap-e2e-master"
E0512 13:12:32.119682      11 cloud_cidr_allocator.go:187] "Exceeded retry count, dropping from queue" workItem="bootstrap-e2e-master"
I0512 13:12:32.119755      11 event.go:307] "Event occurred" object="bootstrap-e2e-master" fieldPath="" kind="Node" apiVersion="v1" type="Normal" reason="CIDRAssignmentFailed" message="Node bootstrap-e2e-master status is now: CIDRAssignmentFailed"

Argh4k · 2023-05-12T14:00:40Z

Wojtek's gut feeling was right.

@p0lyn0mial if you want to we can create pr to add:

- --env=KUBE_GCE_PRIVATE_CLUSTER=false

to the tests and they should work just fine. In the meantime I will try to understand why KUBE_GCE_PRIVATE_CLUSTER makes master node to get two CIDRs.

BenTheElder · 2023-05-12T15:03:15Z

Does it have cloud NAT enabled?

If not the private network may be having issues fetching eg from registry.k8s.io which isn't a first-party GCP service unlike GCR

BenTheElder · 2023-05-12T15:08:21Z

cc @aojea re: GCE cidr allocation :-)

aojea · 2023-05-12T15:29:37Z

E0512 13:12:32.119671 11 cloud_cidr_allocator.go:178] "Error updating CIDR" err="failed to patch node CIDR: Node "bootstrap-e2e-master" is invalid: spec.podCIDRs: Invalid value: []string{"10.64.0.0/24", "10.40.0.2/32"}: may specify no more than one CIDR for each IP family" workItem="bootstrap-e2e-master"

#29500 (comment)
@basantsa1989 we have a bug in the allocator
kubernetes/kubernetes@a013c6a

If we receive multiple cidrs before patching for dual-stack we should validate that those are dual stack

We have to fix it in k/k and in the cloud-provider-gcp https://github.com/kubernetes/cloud-provider-gcp/blob/67d1fd9f7255629fac3adfc956d0c8b2ac5f50f0/pkg/controller/nodeipam/ipam/cloud_cidr_allocator.go#L341-L344

Argh4k · 2023-05-12T15:37:02Z

FYI: https://github.com/kubernetes/kubernetes/blob/master/cluster/gce/util.sh#L3008 this is the place where we add master internal ip as a second alias if we are using KUBE_GCE_PRIVATE_CLUSTER

Then this second ip is picked by kcm (https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/legacy-cloud-providers/gce/gce_instances.go#L496) and allocator thinks we have dual stack and tries to apply both of them which fails, because we can have at most one ipv4 cidr per node.

aojea · 2023-05-15T12:43:26Z

Kube controller manager logs:

E0512 13:12:32.119653      11 cloud_cidr_allocator.go:315] "Failed to update the node PodCIDR after multiple attempts" err="failed to patch node CIDR: Node \"bootstrap-e2e-master\" is invalid: spec.podCIDRs: Invalid value: []string{\"10.64.0.0/24\", \"10.40.0.2/32\"}: may specify no more than one CIDR for each IP family" node="bootstrap-e2e-master" cidrStrings=["10.64.0.0/24","10.40.0.2/32"]
E0512 13:12:32.119671      11 cloud_cidr_allocator.go:178] "Error updating CIDR" err="failed to patch node CIDR: Node \"bootstrap-e2e-master\" is invalid: spec.podCIDRs: Invalid value: []string{\"10.64.0.0/24\", \"10.40.0.2/32\"}: may specify no more than one CIDR for each IP family" workItem="bootstrap-e2e-master"
E0512 13:12:32.119682      11 cloud_cidr_allocator.go:187] "Exceeded retry count, dropping from queue" workItem="bootstrap-e2e-master"
I0512 13:12:32.119755      11 event.go:307] "Event occurred" object="bootstrap-e2e-master" fieldPath="" kind="Node" apiVersion="v1" type="Normal" reason="CIDRAssignmentFailed" message="Node bootstrap-e2e-master status is now: CIDRAssignmentFailed"

@Argh4k do you have the entire logs?

Argh4k · 2023-05-15T13:37:17Z

@aojea https://gcsweb.k8s.io/gcs/sig-scalability-logs/ci-kubernetes-e2e-gci-gce-scalability-watch-list-off/1658029086385115136/bootstrap-e2e-master/ has all the logs from the master

wojtek-t · 2023-05-25T12:01:32Z

/sig network

aojea · 2023-05-25T13:05:00Z

based on @basantsa1989 comment kubernetes/kubernetes#118043 (comment) the allocator is working as expected and the problem is that this is not supported

https://github.com/kubernetes/kubernetes/blob/8db4d63245a89a78d76ff5916c37439805b11e5f/cluster/gce/util.sh#L3008

can we configure the cluster in a different way we don't pass two cidrs?

Argh4k · 2023-05-26T08:19:23Z

I hope we can, unfortunately I haven't had much time to look into this and other work was unblocked by running tests in a small public cluster.

p0lyn0mial · 2023-07-27T08:39:27Z

@Argh4k Hey, a friendly remainder to work on this issue :)

It looks like having a private cluster would increase egress traffic.
Having a higher egress bandwidth would allow us to generate a larger test traffic. Currently, we had to reduce the test traffic because it seems that latency is being throttled due to the limited egress bandwidth.

See kubernetes/perf-tests#2287

k8s-triage-robot · 2024-01-25T08:18:28Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

p0lyn0mial · 2024-01-25T13:22:28Z

I think that this issue still hasn't been resolved

/remove-lifecycle stale

k8s-triage-robot · 2024-04-24T13:56:43Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

p0lyn0mial · 2024-04-25T07:13:42Z

I think that this issue still hasn't been resolved

/remove-lifecycle stale

k8s-triage-robot · 2024-07-24T07:20:33Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

wojtek-t · 2024-07-24T07:58:58Z

/remove-lifecycle stale

BenTheElder · 2024-07-31T05:09:11Z

@aojea thoughts on this?

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label May 11, 2023

aojea mentioned this issue May 16, 2023

cloud_cidr_allocator: don't assume gce cidrs are validated kubernetes/kubernetes#118043

Closed

Argh4k mentioned this issue May 16, 2023

Run watchlist tests on public cluster #29531

Merged

k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 25, 2023

aojea mentioned this issue May 27, 2023

[Flaky test] gce-master-scale-correctness kubernetes/kubernetes#118295

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 25, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 25, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 24, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 25, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 24, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 24, 2024

BenTheElder mentioned this issue Jul 31, 2024

[Failing Test or Infra] gce-master-scale-performance kubernetes/kubernetes#126366

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ControlPlane node is not ready in scalability tests when run on GCE #29500

ControlPlane node is not ready in scalability tests when run on GCE #29500

wojtek-t commented May 11, 2023

wojtek-t commented May 11, 2023

Argh4k commented May 12, 2023

Argh4k commented May 12, 2023

Argh4k commented May 12, 2023

Argh4k commented May 12, 2023

BenTheElder commented May 12, 2023

BenTheElder commented May 12, 2023

aojea commented May 12, 2023

Argh4k commented May 12, 2023 •

edited

Loading

aojea commented May 15, 2023

Argh4k commented May 15, 2023

wojtek-t commented May 25, 2023

aojea commented May 25, 2023

Argh4k commented May 26, 2023

p0lyn0mial commented Jul 27, 2023

k8s-triage-robot commented Jan 25, 2024

p0lyn0mial commented Jan 25, 2024

k8s-triage-robot commented Apr 24, 2024

p0lyn0mial commented Apr 25, 2024

k8s-triage-robot commented Jul 24, 2024

wojtek-t commented Jul 24, 2024

BenTheElder commented Jul 31, 2024

ControlPlane node is not ready in scalability tests when run on GCE #29500

ControlPlane node is not ready in scalability tests when run on GCE #29500

Comments

wojtek-t commented May 11, 2023

wojtek-t commented May 11, 2023

Argh4k commented May 12, 2023

Argh4k commented May 12, 2023

Argh4k commented May 12, 2023

Argh4k commented May 12, 2023

BenTheElder commented May 12, 2023

BenTheElder commented May 12, 2023

aojea commented May 12, 2023

Argh4k commented May 12, 2023 • edited Loading

aojea commented May 15, 2023

Argh4k commented May 15, 2023

wojtek-t commented May 25, 2023

aojea commented May 25, 2023

Argh4k commented May 26, 2023

p0lyn0mial commented Jul 27, 2023

k8s-triage-robot commented Jan 25, 2024

p0lyn0mial commented Jan 25, 2024

k8s-triage-robot commented Apr 24, 2024

p0lyn0mial commented Apr 25, 2024

k8s-triage-robot commented Jul 24, 2024

wojtek-t commented Jul 24, 2024

BenTheElder commented Jul 31, 2024

Argh4k commented May 12, 2023 •

edited

Loading