Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPAM connectivity failed when upgrading from v1.5.5 to 1.6.0 #872

Closed
laghao opened this issue Mar 17, 2020 · 21 comments
Closed

IPAM connectivity failed when upgrading from v1.5.5 to 1.6.0 #872

laghao opened this issue Mar 17, 2020 · 21 comments
Labels
needs investigation priority/P0 Highest priority. Someone needs to actively work on this.

Comments

@laghao
Copy link

laghao commented Mar 17, 2020

I updated my eks cluster to 1.15.10 and that worked.
Then I tried to update the cni-k8s from v1.5.5 to v1.6.0 on my test k8s test nodes(2) and as it's a daemonset I had 1 aws-node running and the other having following error:

kubectl logs -f pod/aws-node-cjqwm -nkube-system
starting IPAM daemon in background ... ok.
checking for IPAM connectivity ...  failed.
timed out waiting for IPAM daemon to start.

I delete the pod but it's still having the same Error:

kubectl get po --all-namespaces

NAMESPACE     NAME                             READY   STATUS    RESTARTS   AGE
kube-system   aws-node-22mnl                   1/1     Running   0          15m
kube-system   aws-node-h6nrx                   0/1     Running   3          3m9s

More details:

kubectl describe po aws-node-h6nrx -nkube-system

Events:
  Type     Reason     Age                    From                                                   Message
  ----     ------     ----                   ----                                                   -------
  Normal   Scheduled  4m49s                  default-scheduler                                      Successfully assigned kube-system/aws-node-h6nrx to ip-10-1-46-183.eu-central-1.compute.internal
  Warning  Unhealthy  3m33s                  kubelet, ip-10-1-46-183.eu-central-1.compute.internal  Readiness probe errored: rpc error: code = Unknown desc = container not running (c542f67fbf22592a6840faa98cd3e9f1c774efeead2a6068319b0488570a903f)
  Warning  Unhealthy  2m39s                  kubelet, ip-10-1-46-183.eu-central-1.compute.internal  Liveness probe failed: timeout: failed to connect service ":50051" within 1s
  Normal   Pulling    2m18s (x4 over 4m48s)  kubelet, ip-10-1-46-183.eu-central-1.compute.internal  Pulling image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.6.0"
  Normal   Pulled     2m17s (x4 over 4m47s)  kubelet, ip-10-1-46-183.eu-central-1.compute.internal  Successfully pulled image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.6.0"
  Normal   Created    2m17s (x4 over 4m47s)  kubelet, ip-10-1-46-183.eu-central-1.compute.internal  Created container aws-node
  Normal   Started    2m17s (x4 over 4m47s)  kubelet, ip-10-1-46-183.eu-central-1.compute.internal  Started container aws-node
  Warning  Unhealthy  100s                   kubelet, ip-10-1-46-183.eu-central-1.compute.internal  Liveness probe errored: rpc error: code = Unknown desc = container not running (a51a934a7d0867d500c7f9533d995ae7605ba7f80ed19186a513dd2fe62b0d88)
  Warning  BackOff    90s (x6 over 3m32s)    kubelet, ip-10-1-46-183.eu-central-1.compute.internal  Back-off restarting failed container
@mogren mogren added needs investigation priority/P0 Highest priority. Someone needs to actively work on this. labels Mar 18, 2020
@mogren
Copy link
Contributor

mogren commented Mar 18, 2020

@laghao How was the CNI updated? If only the image tag was updated, the issue could be that the required /var/run/dockershim.sock was not mounted?

@mogren
Copy link
Contributor

mogren commented Mar 18, 2020

In your logs I see

Liveness probe failed: timeout: failed to connect service ":50051" within 1s

What is your initialDelaySeconds setting? The initial startup can take quite a while, since ipamd is first trying to talk to the API server, then to EC2 API. If any throttling is happening, or some retry, this might delay the initialization long enough for the liveness probe to fail.

@nachomillangarcia
Copy link

Tried to increase initialDelaySeconds but it didn't work. Container fails with the following status:

    Last State:     Terminated
      Reason:       Error
      Exit Code:    1

Not because of liveness probe failed

@laghao
Copy link
Author

laghao commented Mar 18, 2020

I updated the CNI using the helm chart aws-vpc-cni
Now in parallel I spinned another EKS cluster using terraform direcly using 1.15.10 and cni v1.6.0 which worked smoothly.

The direct upgrade looks broken somehow.

@mogren
Copy link
Contributor

mogren commented Mar 18, 2020

I tried using the helm chart to upgrade from v1.5.5 to v1.6.0 and it took my aws-node pods around 40 to 45 seconds to become ready, no restarts. Will keep trying to reproduce this issue.

jaypipes added a commit to jaypipes/amazon-vpc-cni-k8s that referenced this issue Mar 18, 2020
Adds a configurable timeout to the aws-k8s-agent (ipamd) startup in the
entrypoint.sh script. Increases the default timeout from ~30 seconds to
60 seconds.

Users can set the IPAMD_TIMEOUT_SECONDS environment variable to change
the timeout.

Related: aws#625, aws#865 aws#872
@SaranBalaji90
Copy link
Contributor

@laghao what's your kubelet version on worker nodes? If you are using EKS AMI to launch your worker nodes, can you give us the AMI ID as well?

@hahasheminejad
Copy link

hahasheminejad commented Mar 29, 2020

Hi everyone, I am upgrading from v1.5.5 to v1.6.0 and cni pod fails to start

starting IPAM daemon in background ... ok.
checking for IPAM connectivity ...  failed.
timed out waiting for IPAM daemon to start.
  • CNI is running on eks 1.14
  • Applied the cni upgrade directly from here
  • Confirmed dockershim mount exists on pods, and verified /var/run/dockershim.sock exists on the hosts.
  • Rolling back to v1.5.5 resolves the issue.
  • Running kube-proxy:v1.14.9
  • Increasing initialDelaySeconds to 90 didn't help
  • Verified dockershim mount by docker inspect aws-host on hosts as well :
{
    "Type": "bind",
    "Source": "/var/run/dockershim.sock",
    "Destination": "/var/run/dockershim.sock",
    "Mode": "",
    "RW": true,
    "Propagation": "rprivate"
}

@jaypipes
Copy link
Contributor

jaypipes commented Apr 2, 2020

Hi @hahasheminejad! Is there any chance you might be able to run the aws-cni-support.sh script before and after the upgrade and send the results to one of us? Either mogren@ or jaypipes@ amazon...

@dylanenabled
Copy link

dylanenabled commented Apr 16, 2020

I got the same problem today after updating using eksctl

  • Updated from EKS 1.14 to 1.15 (using eksctl update cluster)
  • Added new nodegroup with 1.15 image. Drained old one. Pods moved okay, everything worked.
  • Updated utils
    • eksctl utils update-kube-proxy
    • eksctl utils update-aws-node (updated to 1.6)
    • eksctl utils update-coredns

Suddenly new containers could not start (timeout from cni). So I created a new nodegroup. Nodes would not become Ready and the AWS CNI was logging this error "timed out waiting for IPAM daemon to start."

Rolling back to aws node 1.5.7 seems to fix the issue for now.
EDIT:
I can't seem to get any pods running (except for aws-node, kube-proxy and calico-node) due to not being able to be assigned IP addresses on this cluster anymore, even after rolling back to 1.5.5. There aren't any obvious errors in the logs for aws-node either.

@dylanenabled
Copy link

dylanenabled commented Apr 16, 2020

I figured out my issue, hopefully this will help someone else if they find this via Google. The aws-node serviceaccount was using a service account IAM role to provide access to the ENI EC2 API (ala https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts-cni-walkthrough.html) instead of giving the node role access to the AmazonEKS_CNI_Policy.

Upgrading aws-node via eksctl overwrote the serviceaccount definition and removed the role annotation.

I fixed this by removing and re-adding the iamserviceaccount using eksctl

eksctl delete iamserviceaccount -f eksctl-cluster.yml --include kube-system/aws-node --approve
eksctl create iamserviceaccount -f eksctl-cluster.yml --approve

I have reported this to eksctl here

@mogren
Copy link
Contributor

mogren commented Apr 30, 2020

@hahasheminejad I noticed in your logs that your worker node was completely overwhelmed and pods constantly getting OOM killed:

grep "killed as a result of limit" messages | wc -l   
   19684

Did you see this issue on other nodes as well?

@gseshagirirao
Copy link

Hi, We are facing the same issue while upgrading from Kuberenetes 1.13(eks.9) Kuberenetes 1.14(eks.9) and using CNI v1.6.1 (from CNI v1.5.5) - Mounted dockershim mount

We tried following steps:

Removed and recreated the service account - ( initially SA is created by eksctl)
Removed the annotation in service account and readded it manually.
Restarted the aws-node pods.
Manual kubectl apply from 1.5.5 to 1.6.1

Logs:
Starting IPAM daemon in the background ... ok.
Checking for IPAM connectivity ... failed.
Timed out waiting for IPAM daemon to start

Please let us if there any workaround or when will be fix expected

@njgibbon
Copy link

Hello, as with the comment above we are also seeing the same issue updating vpc-cni from v1.5.5 to v.1.6.1.

We have 4 clusters (which are theoretically all configured the same way).

All on v1.15.11-eks-af3caf.
All worker nodes on the same AMI: 1.15.10-20200228.

DNS and Kube-proxy versions up to date inline with table in AWS official guide across all 4 clusters:
https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html

CNI VPC plugin has been updated successfully across 3 clusters.

In the last cluster the DaemonSet rolled out successfully to 6/7 nodes.

On the last node the pod crash looped. I bounced it and it crash looped. Due to failing health.
I consistently am getting this issue in the pod logs as others have pointed to.

Starting IPAM daemon in the background ... ok.
Checking for IPAM connectivity ... failed.
Timed out waiting for IPAM daemon to start:

There are other workloads scheduled already on this node.

This has meant I needed to rollback to v1.5.5 only in this cluster.

I'm looking at resources and attempting to triage and may be raising to AWS Support seperately but adding here for more information on this issue occurring in general to keep the issue fresh.

@mogren
Copy link
Contributor

mogren commented May 13, 2020

Thanks for reporting the issue @njgibbon! Did you run the aws-cni-support.sh script on the node to gather the log data? It would be great if we could see why the pod failed to start. The logs should be in /var/log/aws-routed-eni/ on the worker node. We have seen issues related to kube-proxy before.

Also, if rolling back, would v1.5.7 be an option?

@JulienDefrance
Copy link

JulienDefrance commented May 13, 2020

I've faced similar issues after upgrading to EKS 1.16 and upgrading VPC CNI plugin to 1.6.1 and the latest kube proxy 1.16.8.

  • Nodes would remain in a NotReady state
  • Describing them would also highlight: KubeletNotReady runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
  • From the logs, Kube Proxy would also remain in CrashLoopBackOff state.

After troubleshooting this with AWS Support, the combination of rolling back to our previous EKS 1.15 configuration, i.e. using AWS VPC CNI Plugin 1.5.7 and kubeproxy 1.15.11 worked for me on EKS 1.16.

Please note that terminating your existing EC2 instances might (or will?) be needed in order to get back to a running state.

Out of the 1.16 upgrade "prerequisites", the only mandatory one, if you were already on 1.15, is to make sure you have all yaml files converted to the new API (v1) version. No more betas. https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html#1-16-prequisites

You might want to hold off any other changes for now until AWS further communicates on this issue.

@mogren
Copy link
Contributor

mogren commented May 13, 2020

For kube-proxy on 1.16, make sure that --resource-container is not in the spec. See Kubernetes 1.16 for details.

@brianstorti
Copy link

brianstorti commented May 16, 2020

This was very hard to track down, but @mogren's comment was what solved it for me. My cluster was created ~2 years ago and kube-proxy was still using the --resource-container flag. After the 1.16 upgrade I started seeing this "cni config uninitialized" error and all the nodes got stuck in the NotReady state.

I tried to downgrade the CNI plugin back to 1.5.x, but that also didn't solve the problem. I had to manually edit my kube-proxy daemonset ($ kubectl edit ds kube-proxy -n kube-system) to remove the flag.

I think it'd be great to mention that in the upgrade guide.

@gyuho
Copy link
Contributor

gyuho commented May 16, 2020

@brianstorti We've updated the doc here awsdocs/amazon-eks-user-guide#125, should go live soon.

@spacebarley
Copy link

spacebarley commented May 18, 2020

@mogren Hi, I experienced almost same issue with @njgibbon .

I am running multiple cluster but upgrade only failed in one node in one cluster.
I roll-backed aws-node to v1.5.7 after I find upgrade failed.

I sent the result of running aws-cni-support.sh in problem happened node to mogren's email.

Hopes it helps.

@mogren
Copy link
Contributor

mogren commented May 18, 2020

@spacebarley Hi! Thanks for the logs, they made it clear that you ran into another issue:

{
  "level": "error",
  "ts": "2020-05-18T08:52:22.632Z",
  "caller": "aws-k8s-agent/main.go:30",
  "msg": "Initialization failure: failed to allocate one IP addresses on ENI eni-0aaaafcedcb7b0940e,
          err: allocate IP address: failed to allocate a private IP address: 
          InsufficientFreeAddressesInSubnet: The specified subnet does not have enough free addresses to satisfy the request.
          status code: 400, 
          request id: 0xxxxxx-a5e4-4a47-b76a-0360e364d5f1"
}

The subnet is out of IPs. First, since you were running the v1.5.x CNI earlier, check for leaked ENIs in your account. They will be marked as Available (blue dot) in the AWS Console, and have a tag, node.k8s.amazonaws.com/instance_id, that shows what instance they once belonged to.

@mogren
Copy link
Contributor

mogren commented May 18, 2020

Closing this issue since it has turned into a bucket of multiple upgrade issues. The things we have seen so far:

  • IAM for service accounts issue with eksctl
  • kube-proxy on Kubernetes 1.16 does no longer support the --resource-container flag
  • Subnet out of IP addresses

Please open a new issue if you find any new problem.

mogren pushed a commit that referenced this issue Jun 24, 2020
* add configurable timeout for ipamd startup

Adds a configurable timeout to the aws-k8s-agent (ipamd) startup in the
entrypoint.sh script. Increases the default timeout from ~30 seconds to
60 seconds.

Users can set the IPAMD_TIMEOUT_SECONDS environment variable to change
the timeout.

Related: #625, #865 #872

* This is a local gRPC call, so just try every 1 second indefinitely

Since we have a liveness probe restarting the probe, we can rely on that to kill the pod.

Co-authored-by: Claes Mogren <mogren@amazon.com>
bnapolitan added a commit to bnapolitan/amazon-vpc-cni-k8s that referenced this issue Jul 1, 2020
commit d938e5e
Author: Jayanth Varavani <1111446+jayanthvn@users.noreply.github.com>
Date:   Wed Jul 1 01:19:14 2020 +0000

    Json o/p for logs from entrypoint.sh

commit 2d20308
Author: Nathan Prabhu <natprabh@amazon.com>
Date:   Mon Jun 29 18:06:22 2020 -0500

    bugfix: make metrics-helper docker logging statement multi-arch compatible

commit bf9ded3
Author: Claes Mogren <claes.mogren@gmail.com>
Date:   Sat Jun 27 14:51:35 2020 -0700

    Use install command instead of cp

commit e3b7dbb
Author: Gyuho Lee <leegyuho@amazon.com>
Date:   Mon Jun 29 09:40:02 2020 -0700

    scripts/lib: bump up tester to v1.4.0

    Signed-off-by: Gyuho Lee <leegyuho@amazon.com>

commit c369480
Author: Claes Mogren <claes.mogren@gmail.com>
Date:   Sun Jun 28 12:19:27 2020 -0700

    Some refresh cleanups

commit 8c266e9
Author: Claes Mogren <claes.mogren@gmail.com>
Date:   Sun Jun 28 18:37:46 2020 -0700

    Run staticcheck and clean up

commit 8dfc5b1
Author: Jayanth Varavani <1111446+jayanthvn@users.noreply.github.com>
Date:   Sun Jun 28 17:39:20 2020 -0700

    Fix integration test script for code pipeline (aws#1062)

    Co-authored-by: Claes Mogren <mogren@amazon.com>

commit 52306be
Author: Murcherla <nithu0115@gmail.com>
Date:   Wed Jun 24 23:37:24 2020 -0500

    minor nits, fast follow up to PR 903

commit 4ddd248
Author: Claes Mogren <mogren@amazon.com>
Date:   Sun Jun 14 23:20:22 2020 -0700

    Add bandwidth plugin

commit 6d35fda
Author: Robert Sheehy <gameboy1092@gmail.com>
Date:   Fri May 22 21:11:12 2020 -0500

    Chain interface to other CNI plugins

commit 30f98bd
Author: Penugonda <saiteja313@gmail.com>
Date:   Thu Jun 25 15:14:00 2020 -0400

    removed custom networking default vars, introspection var

commit aa8b818
Author: Penugonda <saiteja313@gmail.com>
Date:   Wed Jun 24 19:11:38 2020 -0400

    updated manifest configs with default env vars

commit a073d66
Author: Nithish Murcherla <nithu0115@gmail.com>
Date:   Wed Jun 24 16:51:38 2020 -0500

    refresh subnet/CIDR information every 30 seconds and update ip rules to map pods (aws#903)

    Co-authored-by: Claes Mogren <mogren@amazon.com>

commit a0da387
Author: Claes Mogren <mogren@amazon.com>
Date:   Wed Jun 24 12:30:45 2020 -0700

    Default to random-fully (aws#1048)

commit 9fea153
Author: Claes Mogren <mogren@amazon.com>
Date:   Sun Jun 14 22:37:10 2020 -0700

    Update probe settings

    * Reduce readiness probe startup delay
    * Increase liveness polling period
    * Reduce shutdown grace period to 10 seconds

commit ad7df34
Author: Jay Pipes <jaypipes@gmail.com>
Date:   Wed Jun 24 02:06:23 2020 -0400

    Remove timeout for ipamd startup (aws#874)

    * add configurable timeout for ipamd startup

    Adds a configurable timeout to the aws-k8s-agent (ipamd) startup in the
    entrypoint.sh script. Increases the default timeout from ~30 seconds to
    60 seconds.

    Users can set the IPAMD_TIMEOUT_SECONDS environment variable to change
    the timeout.

    Related: aws#625, aws#865 aws#872

    * This is a local gRPC call, so just try every 1 second indefinitely

    Since we have a liveness probe restarting the probe, we can rely on that to kill the pod.

    Co-authored-by: Claes Mogren <mogren@amazon.com>

commit 1af40d2
Author: Jayanth Varavani <1111446+jayanthvn@users.noreply.github.com>
Date:   Fri Jun 19 10:14:44 2020 -0700

    Changelog and config file changes for v1.6.3

commit 14d5135
Author: Ari Becker <ari-becker@users.noreply.github.com>
Date:   Wed Jun 17 09:39:21 2020 +0300

    Generated the different configurations

commit 00395cb
Author: Ari Becker <ari-becker@users.noreply.github.com>
Date:   Tue Jun 16 14:33:55 2020 +0300

    Fix discovery RBAC issues in Kubernetes 1.17

commit 7e224af
Author: Gyuho Lee <leegyuho@amazon.com>
Date:   Mon Jun 15 16:04:44 2020 -0700

    scripts/lib/aws: bump up tester to v1.3.9

    Includes improvements to log fetcher + MNG deletion when metrics server
    is installed.

    Signed-off-by: Gyuho Lee <leegyuho@amazon.com>

commit 36286ba
Author: Claes Mogren <mogren@amazon.com>
Date:   Mon Jun 15 07:56:59 2020 -0700

    Remove Printf and format test (aws#1027)

commit af54066
Author: Gyuho Lee <leegyuho@amazon.com>
Date:   Sat Jun 13 01:31:08 2020 -0700

    scripts/lib/aws: tester v1.3.6, enable color outputs (aws#1025)

    Includes various bug fixes + color output if $TERM is supported.
    Fallback to plain text output automatic.

    ref.
    https://github.com/aws/aws-k8s-tester/blob/master/CHANGELOG/CHANGELOG-1.3.md#v136-2020-06-12

    Signed-off-by: Gyuho Lee <leegyuho@amazon.com>

commit 6d52e1b
Author: jayanthvn <1111446+jayanthvn@users.noreply.github.com>
Date:   Fri Jun 12 16:26:33 2020 -0700

    added warning if delete on termination is set to false for the primar… (aws#1024)

    * Added a warning message if delete on termination is set to false for the primary ENI
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs investigation priority/P0 Highest priority. Someone needs to actively work on this.
Projects
None yet
Development

No branches or pull requests