Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade CNI version broke pod-to-pod communication within the same worker node #641

Closed
rimaulana opened this issue Oct 2, 2019 · 50 comments
Labels
bug priority/P0 Highest priority. Someone needs to actively work on this.

Comments

@rimaulana
Copy link

rimaulana commented Oct 2, 2019

After upgrading the CNI version from v1.5.1-rc1 to v1.5.4, we are seeing issue where pod was unable to communicate with other pod on the same worker node. We have the following schema

CoreDNS pod on eth0
Kibana pod on eth0
App1 on eth1
App2 on eth2

What we are seeing is that DNS query from App1 and App2 failed with no server found when we tried it using dig command

dig @CoreDNS-ip amazonaws.com

Meanwhile, executing the same command from Kibana pod, the worker node and pod on a different worker node works as expected.

When collecting the logs using https://github.com/nithu0115/eks-logs-collector, we found out that CoreDNS IP was not found anywhere on the output of the ip rule show command. I would expect for each IP address of a pod running on the worker node it should have at least this associated rule on the ip rule

512: from all to POD_IP lookup main

However, we do not see one for the CoreDNS pod IP. Therefore, we believe that this is an issue with the CNI plugin unable to rebuild the rule after upgrade. There is an internal issue open for this if you want to get the collected logs

@MartiUK
Copy link

MartiUK commented Oct 2, 2019

Downgrading to v1.5.3 resolved this issue on (EKS) k8s v1.14 with CoreDNS v1.3.1.
Required node reboots first.

@mogren
Copy link
Contributor

mogren commented Oct 3, 2019

Glad you found a work-around (rebooting the nodes), but I'll keep trying to reproduce this.

@igor-pinchuk
Copy link

igor-pinchuk commented Oct 3, 2019

Facing the same issue.
Downgrading to v1.5.3 with following nodes rebooting helped.

@ueokande
Copy link

ueokande commented Oct 3, 2019

We encountered the issue with Kubernetes 1.13 (eks.4) and amazon-vpc-cni-k8s v1.5.4. Its not only on CoreDNS, but also inter-pods communication.

It occurs immediately after cluster created. We just repaired by restarting pods (release and reassign an IP address on the pod):

$ kubectl delete pod --all
$ kubectl delete pod -nkube-system --all

@dmarkey
Copy link

dmarkey commented Oct 4, 2019

I've been tearing my hair out all day after upgrading a cluster. Please change https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html to suggest v1.5.3 and not v1.5.4 as to not break more clusters until it's verified that this bug is fixed.

@mogren
Copy link
Contributor

mogren commented Oct 4, 2019

@dmarkey None of the three minor changes between v1.5.3 and v1.5.4 has anything to do with routes, so I suspect there is some other existing issue that we have not been able to reproduce yet. Does rebooting the nodes without downgrading not fix the issue?

We have seen related issues with routes when using Calico, but they are the same on v1.5.3 and v1.5.4. Still investigating this.

@angelichorsey
Copy link

angelichorsey commented Oct 4, 2019

This is a sysctl fix, no?

net.bridge.bridge-nf-call-ip6tables=1
net.bridge.bridge-nf-call-iptables=1
net.bridge.bridge-nf-call-arptables=1

If you don't have these then the docker bridge can't talk back to itself.

https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/network-plugins/#network-plugin-requirements

@nithu0115
Copy link
Contributor

@dmarkey are you seeing missing rule from routing table database ? Could you elaborate more on the issue you are running into ?

@schahal
Copy link

schahal commented Oct 4, 2019

Can we please update https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/release-1.5/config/v1.5/aws-k8s-cni.yaml to be 1.5.3 until 1.5.4 is fully vetted? We are running into the same issue and want default to be the working version.

@dmarkey
Copy link

dmarkey commented Oct 4, 2019

The main issue was around 10% of pods not being able to talk to other pods, like coredns, and therefore couldn't resolve and/or connect to dependent services. They could however connect to services on the internet.

I also noticed that for the problematic pods. Their IP was missing from the node ifconfig output. I assume they would need an interface added that would be visible on the host?

@dmarkey
Copy link

dmarkey commented Oct 4, 2019

I have powered up the cluster twice from scratch with ~200 pods with 1.5.3 and it has come up flawlessly.

With 1.5.4 about 20% of pods couldn't find dependencies either by not being able resolve their address(services in the same namespace mostly), or couldn't reach the dependency at all. I must have powered up the ASG about 10 times to try to troubleshoot the situation.

mogren pushed a commit to mogren/amazon-vpc-cni-k8s that referenced this issue Oct 4, 2019
mogren pushed a commit to mogren/amazon-vpc-cni-k8s that referenced this issue Oct 4, 2019
@mogren
Copy link
Contributor

mogren commented Oct 4, 2019

@dmarkey Thanks for the update, will keep testing this. @schahal I have reverted config/v1.5/aws-k8s-cni.yaml to point to v1.5.3 for now.

@mogren
Copy link
Contributor

mogren commented Oct 4, 2019

@dmarkey Could you please send me log output from https://github.com/awslabs/amazon-eks-ami/tree/master/log-collector-script ? (Either mogren at amazon.com or c.m in the Kubernetes slack)

@dmarkey
Copy link

dmarkey commented Oct 4, 2019

Do you mean with 1.5.3 or 1.5.4? I'm afraid this cluster is in active use (although not classed as "production") so I cant easily revert without causing at least some disruption. Either way I don't have access until AM Irish time Monday.

@mogren
Copy link
Contributor

mogren commented Oct 4, 2019

@dmarkey Logs from a node where you see the communication issue, so v1.5.4. If you could get that next week I'd be very thankful. Sorry to cause bother on a Friday evening! 🙂

@mogren
Copy link
Contributor

mogren commented Oct 14, 2019

I have still not been able to reproduce this issue, and I have not gotten any logs showing errors in the CNI, but I have seen a lot of errors in the CoreDNS logs. If anyone can reliably reproduce the issue, or find a missing route or iptable rule, I'd be happy to know more.

@ayosec
Copy link

ayosec commented Oct 15, 2019

We had a similar problem today, with 1.5.4.

Yesterday, we changed the configuration of the deployment to set AWS_VPC_K8S_CNI_LOGLEVEL=INFO, so the aws-node-* pods were restarted. We checked if it was able to assign IP address to new pods, and everything was working as expected.

Today, we updated some deployments, and then we started to see 504 Gateway Timeout errors in some requests.

After some investigation we found that the ingress controller was not able to connect to the pods when they were in the same node. The pod (with IP 10.200.254.228) was accessible from ingress controllers in other nodes.

We discarded a bug in the ingress controller because even a ping was not possible:

# nsenter -t 1558 -n ping -c 2 10.200.254.228
PING 10.200.254.228 (10.200.254.228) 56(84) bytes of data.

--- 10.200.254.228 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1001ms

(1558 is the PID of the ingress controller).

The ping worked from the host network.


After more investigation, we found an issue in the IP rules:

# ip rule show
0:	from all lookup local 
512:	from all to 10.200.211.143 lookup main 
512:	from all to 10.200.204.145 lookup main 
512:	from all to 10.200.212.149 lookup main 
512:	from all to 10.200.206.165 lookup main 
512:	from all to 10.200.236.131 lookup main 
512:	from all to 10.200.202.149 lookup main 
512:	from all to 10.200.220.69 lookup main 
512:	from all to 10.200.223.122 lookup main 
512:	from all to 10.200.212.190 lookup main 
512:	from all to 10.200.206.240 lookup main 
1024:	from all fwmark 0x80/0x80 lookup main 
1536:	from 10.200.222.108 to 10.200.0.0/16 lookup 2 
1536:	from 10.200.254.228 to 10.200.0.0/16 lookup 3 
1536:	from 10.200.221.230 to 10.200.0.0/16 lookup 3 
1536:	from 10.200.211.143 to 10.200.0.0/16 lookup 3 
1536:	from 10.200.204.145 to 10.200.0.0/16 lookup 3 
1536:	from 10.200.212.149 to 10.200.0.0/16 lookup 2 
1536:	from 10.200.206.165 to 10.200.0.0/16 lookup 3 
1536:	from 10.200.236.131 to 10.200.0.0/16 lookup 2 
1536:	from 10.200.202.149 to 10.200.0.0/16 lookup 2 
1536:	from 10.200.220.69 to 10.200.0.0/16 lookup 2 
1536:	from 10.200.223.122 to 10.200.0.0/16 lookup 2 
1536:	from 10.200.206.240 to 10.200.0.0/16 lookup 2 
32766:	from all lookup main 
32767:	from all lookup default 

In the previous list, you can see that 10.200.254.228 is missing in the from all to ... rules.

We added it manually:

# ip rule add from all to 10.200.254.228 lookup main

And the issue was fixed.


We checked the logs, and the only error related to 10.200.254.228 is the following (in plugin.log):

2019-10-14T03:55:21.684Z [INFO]	Received CNI del request: ContainerID(ae40e6b983f6f3cb21753559ed9eb10eb7e7a341ce3a9afe975078d65d9002ec) Netns(/proc/23768/ns/net) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=staging;K8S_POD_NAME=redacted-58948849cf-bjlfb;K8S_POD_INFRA_CONTAINER_ID=ae40e6b983f6f3cb21753559ed9eb10eb7e7a341ce3a9afe975078d65d9002ec) Path(/opt/cni/bin) argsStdinData({"cniVersion":"0.3.1","name":"aws-cni","type":"aws-cni","vethPrefix":"eni"})
2019-10-14T03:55:21.688Z [ERROR]	Failed to delete toContainer rule for 10.200.254.228/32 err no such file or directory
2019-10-14T03:55:21.688Z [INFO]	Delete Rule List By Src [{10.200.254.228 ffffffff}]
2019-10-14T03:55:21.688Z [INFO]	Remove current list [[ip rule 1536: from 10.200.254.228/32 table 3]]
2019-10-14T03:55:21.688Z [INFO]	Delete fromContainer rule for 10.200.254.228/32 in table 3

@mogren
Copy link
Contributor

mogren commented Oct 15, 2019

@ayosec Thanks a lot for the helpful details!

@mogren mogren added the priority/P0 Highest priority. Someone needs to actively work on this. label Oct 15, 2019
@Magizhchi
Copy link

We are facing the same issue, pod to pod communication is intermittently going down, restarting the pods brings it back up.

We followed the suggestion above to downgrade to 1.5.3 and restart the node which worked for us.

So maybe there is some issue with v1.5.4

@ueokande
Copy link

Today, we created a new EKS cluster, and amazon-k8s-cni:v1.5.3 is deployed.
Our cluster is now fine!

@mprenditore
Copy link

Faced the same issue. Upgraded from 1.5.3 to 1.5.4 started to create some issues, a lot of 504.
Reverting back to 1.5.3 wasn't enough, we needed to restart all the cluster nodes in order to be back on fully functionality. Probably a full restart with 1.5.4 could have worked too based on what other people said here that there are no huge changes. But even in that case, the upgrade to 1.5.3 from 1.2.1 didn't created any issue.

@mogren
Copy link
Contributor

mogren commented Oct 29, 2019

Please try the v1.5.5 release candidate if you need g4, m5dn, r5dn or Kubernetes 1.16 support.

@daviddelucca
Copy link

@MartiUK How did you downgrade amazon-k8s-cni? Could you show me the steps, please?

@chadlwilson
Copy link

@daviddelucca Replacing region below with whatever is appropriate for you...

kubectl set image daemonset.apps/aws-node \
  -n kube-system \
  aws-node=602401143452.dkr.ecr.ap-southeast-1.amazonaws.com/amazon-k8s-cni:v1.5.3

And then it seems restarting all pods at minimum is required. Some seem to have restarted all nodes (which would restart the pods by side effect), but it's unclear if that's really required.

@daviddelucca
Copy link

@chadlwilson thank you very much

@mogren
Copy link
Contributor

mogren commented Nov 13, 2019

v1.5.5 is released with a revert of the commit that caused issues. Resolving this issue.

@wadey
Copy link

wadey commented Nov 26, 2019

Unless I'm misunderstanding, it looks like v1.6.0-rc4 also has the problematic commit. Can we get a v1.6.0-rc5 with the fix there as well?

@eladazary
Copy link

I'm facing this issue since yesterday with CNI 1.5.5, I've tried to downgrade to 1.5.3 and 1.5.5 but with no success.
It looks like the /etc/cni/net.d/10-aws.conflist file gets created only when using CNI v1.5.1.

Errors from ipamd.log:
Starting L-IPAMD v1.5.5 ...
2019-11-26T16:31:57.105Z [INFO] Testing communication with server
2019-11-26T16:32:27.106Z [INFO] Failed to communicate with K8S Server. Please check instance security groups or http proxy setting
2019-11-26T16:32:27.106Z [ERROR] Failed to create client: error communicating with apiserver: Get https://172.20.0.1:443/version?timeout=32s: dial tcp 172.20.0.1:443: i/o timeout

I saw that after I've upgraded to CNI 1.5.5 again the file /etc/cni/10-aws.conflist got created, maybe is something with the path kubelet is looking for the cni file?

Nodes are in Ready status but all pods are in ContainerCreating state.

Do you have any idea why does it happen?

@mogren
Copy link
Contributor

mogren commented Nov 26, 2019

@wadey The issue is not in v1.6.0-rc4, there we solved it in another way, see #688. This is a better solution since if we return an error when we try to delete a pod that was never created, kubelet will retry 10 times trying to delete something that doesn't exist before giving up.

@mogren
Copy link
Contributor

mogren commented Nov 26, 2019

@eladazary The error you are seeing is unrelated to this issue. Starting with v1.5.3, we don't make the node active until ipamd can talk to the API server. If permissions are not correct and ipamd (aws-node pods) can't talk to the API server or to the EC2 control plane, it can't attach IPs to the nodes and then pods will never get IPs and become active.

Make sure that the worker nodes are configured correctly. The logs for ipamd should tell you what the issue is, they can be found in /var/log/aws-routed-eni/ on the node.

More about worker nodes: https://docs.aws.amazon.com/eks/latest/userguide/launch-workers.html

@itsLucario
Copy link

Similar issue came up with 1.7.5 on upgrading from 1.6.1. Around 10% of the pods are able to communicate with each other and others are failing.

Even downgrading to 1.6.1 didn't work until we restated the nodes. Can someone brief the cause and the status of the solution for this?

@jayanthvn
Copy link
Contributor

Hi @itsLucario

When you upgraded was it just an image update or you reapplied the config (https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/v1.7.5/config/v1.7/aws-k8s-cni.yaml) ?

@itsLucario
Copy link

itsLucario commented Jan 13, 2021

@jayanthvn I have applied the exact config yaml which you have shared.
Also since we are using CNI custom networking. Once the daemonset is updated we run:

kubectl set env daemonset aws-node -n kube-system AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG=true

Edit:
While updating CNI itself if I set the AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG=true in the container env then the upgrade happens seamlessly.

I think docs must be updated mentioning if custom configuration is there then update manifests respectively before upgrading.

@jayanthvn
Copy link
Contributor

Hi @itsLucario

Yes that makes sense and thanks for checking. Even I was suspecting that is what is happening hence wanted to know how you upgraded. Can you please open an issue for documentation? I can take care of it.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug priority/P0 Highest priority. Someone needs to actively work on this.
Projects
None yet
Development

No branches or pull requests