Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New pods failing to start with FailedCreatePodSandBox warning for CNI versions 1.7.x with Cilium #1265

Closed
YesemKebede opened this issue Oct 19, 2020 · 22 comments
Assignees
Labels
bug priority/P1 Must be staffed and worked currently or soon. Is a candidate for next release

Comments

@YesemKebede
Copy link

YesemKebede commented Oct 19, 2020

What happened:

New pods started failing to come up after upgrading to eks CNI v1.7.0 from v1.6.0. I was able to upgrade to v1.6.3 without any issue. I started to see the errors when I upgraded to 1.7.0. I also tried to upgrade to other version ( v1.7.2 and v1.7.5) but I am seeing the same issue.

Here is the error I am seeing.

 Warning  FailedCreatePodSandBox  28s                 kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "7e3423d27fc6f36276de03aa7f41ef6b6f02121f800b65b64b8073c6a207b696" network for pod "spinnaker-get-resource-type-3fc73e4e3611d9f4-ps4b7": networkPlugin cni failed to set up pod "spinnaker-get-resource-type-3fc73e4e3611d9f4-ps4b7_default" network: invalidcharacter '{' after top-level value

Here is the cni log

Anything else we need to know?:

  • We have Cilium running in chaining mode (v1.8.4)

Environment:

  • Kubernetes version :v1.17.9-eks-4c6976
  • CNI Version: Tried different versions but seeing same issue for (1.7.0, 1.7.2, 1.7.5)
  • Kernel: 5.4.58-27.104.amzn2.x86_64
@jayanthvn
Copy link
Contributor

Hi @YesemKebede

Can you please confirm if you have set AWS_VPC_K8S_PLUGIN_LOG_FILE to stdout?

I checked IPAMD logs and I see IP allocation seems fine on the first look. We will further investigate the issue.

Thanks.

@YesemKebede
Copy link
Author

@jayanthvn AWS_VPC_K8S_PLUGIN_LOG_FILE is set to /var/log/aws-routed-eni/plugin.log

@jayanthvn
Copy link
Contributor

Thanks @YesemKebede . We will look into it asap.

@jayanthvn jayanthvn added the priority/P1 Must be staffed and worked currently or soon. Is a candidate for next release label Oct 19, 2020
@jayanthvn
Copy link
Contributor

Hi @YesemKebede

Can also please confirm how you upgraded from 1.6.3 to 1.7.X?

Thank you!

@YesemKebede
Copy link
Author

YesemKebede commented Oct 19, 2020

@jayanthvn I followed this Doc

@sophomeric
Copy link

I upgraded from 1.6.3 to 1.7.5 and had the same problem. No new pod could be started and they had that same error. I had both AWS_VPC_K8S_CNI_LOG_FILE and AWS_VPC_K8S_PLUGIN_LOG_FILE set to stdout and had this same problem. Removing them so they get sent to files as per their default config solved the issue for me.

Google led me here: Azure/azure-container-networking#195 (comment)

@jayanthvn
Copy link
Contributor

@sophomeric Yes setting AWS_VPC_K8S_PLUGIN_LOG_FILE to stdout will cause a similar issue(#1251). But here it wasn't set.

@Aggouri
Copy link

Aggouri commented Oct 26, 2020

We are experiencing the same issue on newly provisioned clusters with the following difference in versions:

  • Kubernetes version: v1.17.11-eks-cfdc40
  • Cilium v1.9.0-rc2 in chaining mode

If it helps, although I am not 100% sure about the Kubernetes version being exactly the same patch version, that same configuration was working last week on a different cluster with the same characteristics.

@jayanthvn
Copy link
Contributor

Hi @Aggouri

Can you please confirm the CNI version for the two clusters?

kubectl describe daemonset aws-node -n kube-system | grep Image | cut -d "/" -f 2

Thanks.

@Aggouri
Copy link

Aggouri commented Oct 26, 2020

@jayanthvn

Can you please confirm the CNI version for the two clusters?

The cluster was provisioned a few hours ago:

$ kubectl describe daemonset aws-node --namespace kube-system | grep Image | cut -d "/" -f 2

amazon-k8s-cni-init:v1.7.5-eksbuild.1
amazon-k8s-cni:v1.7.5-eksbuild.1

Sadly, I am unable to provide the CNI plugin version of the previous cluster as it was already torn down. If it helps, I know it was provisioned at the beginning of last week and used the defaults EKS came with for Kubernetes version 1.17.x.

@jayanthvn
Copy link
Contributor

Thanks for conforming @Aggouri . We are actively looking into the issue. Will update asap.

@Arsen-Uulu
Copy link

Arsen-Uulu commented Oct 27, 2020

@jayanthvn upgraded from 1.6.3 to 1.7.5 having a problem.

{"level":"error","ts":"2020-10-27T10:44:04.889-0400","caller":"routed-eni-cni-plugin/cni.go:249","msg":"Error received from DelNetwork gRPC call for container ba592f75d2b25963c4bd64f218ae0930917fa39e3efffd2231b313f8eb42d344: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:50051

@jayanthvn
Copy link
Contributor

jayanthvn commented Oct 27, 2020

Hi,

We have found the RC, for now please add pluginLogFile and pluginLogLevel in 05-cilium.conflist. We will fix this issue in the next release.

cat /etc/cni/net.d/05-cilium.conflist
{
  "cniVersion": "0.3.1",
  "name": "aws-cni",
  "plugins": [
    {
      "name": "aws-cni",
      "type": "aws-cni",
      "vethPrefix": "eni",
      "mtu": "9001",
      "pluginLogFile": "/var/log/aws-routed-eni/plugin.log",
      "pluginLogLevel": "Debug"
    },
    {
       "name": "cilium",
       "type": "cilium-cni",
       "enable-debug": false
    }
  ]
}

I was able to repro and below is the o/p after fixing the conflist -

dev-dsk-varavaj-2b-72f02457 % kubectl describe daemonset aws-node -n kube-system | grep 1.7.5
    Image:      602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni-init:v1.7.5
    Image:      602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.7.5

NAME                       READY   STATUS    RESTARTS   AGE   IP               NODE                                           NOMINATED NODE   READINESS GATES
my-nginx-86b7cfc89-jvzvw   1/1     Running   0          18h   192.168.10.206   ip-192-168-0-43.us-west-2.compute.internal     <none>           <none>
my-nginx-86b7cfc89-p4q2t   1/1     Running   0          18m   192.168.67.156   ip-192-168-81-109.us-west-2.compute.internal   <none>           <none>

NAME                               READY   STATUS    RESTARTS   AGE   IP               NODE                                           NOMINATED NODE   READINESS GATES
aws-node-95jtw                     1/1     Running   0          23m   192.168.0.43     ip-192-168-0-43.us-west-2.compute.internal     <none>           <none>
aws-node-cnrkq                     1/1     Running   0          24m   192.168.81.109   ip-192-168-81-109.us-west-2.compute.internal   <none>           <none>
aws-node-j64z5                     1/1     Running   0          23m   192.168.51.208   ip-192-168-51-208.us-west-2.compute.internal   <none>           <none>
cilium-5gr4s                       1/1     Running   0          18h   192.168.51.208   ip-192-168-51-208.us-west-2.compute.internal   <none>           <none>
cilium-d4nff                       1/1     Running   0          18h   192.168.0.43     ip-192-168-0-43.us-west-2.compute.internal     <none>           <none>
cilium-node-init-kwsj6             1/1     Running   0          18h   192.168.0.43     ip-192-168-0-43.us-west-2.compute.internal     <none>           <none>
cilium-node-init-pv4jw             1/1     Running   0          18h   192.168.51.208   ip-192-168-51-208.us-west-2.compute.internal   <none>           <none>
cilium-node-init-pxdfv             1/1     Running   0          18h   192.168.81.109   ip-192-168-81-109.us-west-2.compute.internal   <none>           <none>
cilium-operator-6554b44b9d-f88zj   1/1     Running   0          18h   192.168.51.208   ip-192-168-51-208.us-west-2.compute.internal   <none>           <none>
cilium-operator-6554b44b9d-j8tlb   1/1     Running   0          18h   192.168.0.43     ip-192-168-0-43.us-west-2.compute.internal     <none>           <none>
cilium-qg6tf                       1/1     Running   0          18h   192.168.81.109   ip-192-168-81-109.us-west-2.compute.internal   <none>           <none>
coredns-5c97f79574-9nnkk           1/1     Running   0          18h   192.168.68.203   ip-192-168-81-109.us-west-2.compute.internal   <none>           <none>
coredns-5c97f79574-jnsm2           1/1     Running   0          18h   100.64.95.97     ip-192-168-51-208.us-west-2.compute.internal   <none>           <none>
kube-proxy-bmv86                   1/1     Running   0          18h   192.168.81.109   ip-192-168-81-109.us-west-2.compute.internal   <none>           <none>
kube-proxy-j7c8f                   1/1     Running   0          18h   192.168.0.43     ip-192-168-0-43.us-west-2.compute.internal     <none>           <none>
kube-proxy-ss98z                   1/1     Running   0          18h   192.168.51.208   ip-192-168-51-208.us-west-2.compute.internal   <none>           <none>

Thank you!

@jayanthvn jayanthvn changed the title New pods failing to start with FailedCreatePodSandBox warning for CNI versions 1.7.x New pods failing to start with FailedCreatePodSandBox warning for CNI versions 1.7.x with Cilium Oct 27, 2020
@jayanthvn
Copy link
Contributor

#1275 is merged so closing this issue.

@bogarcia
Copy link

is there any ETA for a new release including this fix?
Thanks!

@part-time-githubber
Copy link

I tried the workaround suggested in #1265 (comment) 👍

After this, coredns is RUNNING but NOT READY. This from the 2 pods.

pankaj.tolani@tolani-mac  ~/afterpay/inception/cilium/alpha  kl coredns-74fcbd4cb4-k4dhc .:53 [INFO] plugin/reload: Running configuration MD5 = 47d57903c0f0ba4ee0626a17181e5d94 CoreDNS-1.7.0 linux/amd64, go1.13.15, f59c03d0 [ERROR] plugin/errors: 2 1687429144305681147.3333962346685544537. HINFO: read udp 10.240.35.84:60370->10.240.0.2:53: i/o timeout [ERROR] plugin/errors: 2 1687429144305681147.3333962346685544537. HINFO: read udp 10.240.35.84:48293->10.240.0.2:53: i/o timeout [ERROR] plugin/errors: 2 1687429144305681147.3333962346685544537. HINFO: read udp 10.240.35.84:49938->10.240.0.2:53: i/o timeout [ERROR] plugin/errors: 2 1687429144305681147.3333962346685544537. HINFO: read udp 10.240.35.84:38861->10.240.0.2:53: i/o timeout [ERROR] plugin/errors: 2 1687429144305681147.3333962346685544537. HINFO: read udp 10.240.35.84:56928->10.240.0.2:53: i/o timeout [ERROR] plugin/errors: 2 1687429144305681147.3333962346685544537. HINFO: read udp 10.240.35.84:52537->10.240.0.2:53: i/o timeout pankaj.tolani@tolani-mac  ~/afterpay/inception/cilium/alpha  kl coredns-74fcbd4cb4-x68m8 .:53 [INFO] plugin/reload: Running configuration MD5 = 47d57903c0f0ba4ee0626a17181e5d94 CoreDNS-1.7.0 linux/amd64, go1.13.15, f59c03d0 [ERROR] plugin/errors: 2 5763478806533751487.8973578589187692515. HINFO: read udp 10.240.29.242:47668->10.240.0.2:53: i/o timeout [ERROR] plugin/errors: 2 5763478806533751487.8973578589187692515. HINFO: read udp 10.240.29.242:48540->10.240.0.2:53: i/o timeout [ERROR] plugin/errors: 2 5763478806533751487.8973578589187692515. HINFO: read udp 10.240.29.242:57593->10.240.0.2:53: i/o timeout [ERROR] plugin/errors: 2 5763478806533751487.8973578589187692515. HINFO: read udp 10.240.29.242:37493->10.240.0.2:53: i/o timeout [ERROR] plugin/errors: 2 5763478806533751487.8973578589187692515. HINFO: read udp 10.240.29.242:42574->10.240.0.2:53: i/o timeout I0308 03:17:11.104617 1 trace.go:116] Trace[1427131847]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.18.3/tools/cache/reflector.go:125 (started: 2021-03-08 03:16:41.103761797 +0000 UTC m=+0.020357165) (total time: 30.000769225s): Trace[1427131847]: [30.000769225s] [30.000769225s] END E0308 03:17:11.104647 1 reflector.go:178] pkg/mod/k8s.io/client-go@v0.18.3/tools/cache/reflector.go:125: Failed to list *v1.Namespace: Get https://172.20.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0: dial tcp 172.20.0.1:443: i/o timeout I0308 03:17:11.105067 1 trace.go:116] Trace[911902081]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.18.3/tools/cache/reflector.go:125 (started: 2021-03-08 03:16:41.104625732 +0000 UTC m=+0.021221092) (total time: 30.000420935s): Trace[911902081]: [30.000420935s] [30.000420935s] END E0308 03:17:11.105080 1 reflector.go:178] pkg/mod/k8s.io/client-go@v0.18.3/tools/cache/reflector.go:125: Failed to list *v1.Endpoints: Get https://172.20.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 172.20.0.1:443: i/o timeout I0308 03:17:11.105165 1 trace.go:116] Trace[1474941318]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.18.3/tools/cache/reflector.go:125 (started: 2021-03-08 03:16:41.104545953 +0000 UTC m=+0.021141323) (total time: 30.000607402s): Trace[1474941318]: [30.000607402s] [30.000607402s] END E0308 03:17:11.105172 1 reflector.go:178] pkg/mod/k8s.io/client-go@v0.18.3/tools/cache/reflector.go:125: Failed to list *v1.Service: Get https://172.20.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 172.20.0.1:443: i/o timeout

Thoughts?

@jayanthvn
Copy link
Contributor

Hi @pankajmt

Which image version are you using since Rel 1.7.9 has the fix for #1265 - https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.7.9.

@tgraf
Copy link

tgraf commented Mar 8, 2021

Can we point to a particular set of EKS releases in the Cilium docs somehow? What versions of EKS will ship with 1.7.9?

@part-time-githubber
Copy link

I am on aws cni 1.7.8.

amazon-k8s-cni-init:v1.7.8 amazon-k8s-cni:v1.7.8

So looks like there is hope assuming the EKS version we need is GA in our region. While docs improve, someone knows the EKS version I should be looking for?

Many thanks,
Pankaj

@jayanthvn
Copy link
Contributor

Hi,

Yeah that would be great if Cilium docs can point to EKS CNI versions and if there is any known issue it would be easy for Cx to fallback or look for new versions. Currently EKS default CNI version is 1.7.5 with new clusters. Will keep you updated if we plan to make 1.7.9 or later versions default for EKS.

Thank you!

@part-time-githubber
Copy link

So looks like then ours is a custom install of the EKS CNI. I will figure out how it was done and how can I upgrade it to 1.7.9.

@part-time-githubber
Copy link

part-time-githubber commented Mar 9, 2021

worked well with aws cni 1.7.9. many thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug priority/P1 Must be staffed and worked currently or soon. Is a candidate for next release
Projects
None yet
Development

No branches or pull requests

9 participants