Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod traffic doesn't go through OVS when Antrea agent is in networkPolicyOnly mode #4228

Closed
luolanzone opened this issue Sep 15, 2022 · 7 comments
Labels
area/provider/aws Issues or PRs related to aws provider. kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@luolanzone
Copy link
Contributor

luolanzone commented Sep 15, 2022

Describe the bug
I followed this guide to deploy an EKS cluster with Antrea in networkPolicyOnly mode.
After deployment completed, I tried to curl a Service from one Pod. I didn't see any packets counts increased in OVS flow rules if I run ovs-ofctl dump-flows br-int. With a few of troubleshooting, I found that expected routes are not correct. the Pod IP's route is still via its own interface eg: 110.13.37.137 dev eni4f80b44affe scope link but it should be antrea-gw0 eg:110.13.47.25 dev antrea-gw0 scope link.

I can see the log from the Pod antrea-node-init-* , existing containers have been restarted by it successfully, but seems they are not handled by Antrea correctly.

Waiting for Antrea CNI conf file
Detecting container runtime (docker / containerd) based on whether /var/run/docker.sock exists
Container runtime: docker

Restarting container with ID: 462a2355d6f5790075a04cc5466dcbe4bc7c1ea6cb2d989f8ad872329c48d2a2
462a2355d6f5790075a04cc5466dcbe4bc7c1ea6cb2d989f8ad872329c48d2a2
Restarting container with ID: 620e2cf084a096e83444f0285ea29555be036701b1c69a4e7cd21c45dc7018f6
620e2cf084a096e83444f0285ea29555be036701b1c69a4e7cd21c45dc7018f6
Restarting container with ID: a3f3e3e476d72f66f071a534bc9ee53fa02497c384486f366355a3f6dfea8766
a3f3e3e476d72f66f071a534bc9ee53fa02497c384486f366355a3f6dfea8766
Node initialization completed
!!! startup-script succeeded!

I don't know if there is any known solution to fix this kind of issue, but the Pod route is correct after I restarted the Node. I feel reboot is not a good solution for such kind of issue, so document this issue here to see if anyone familiar with networkPolicyOnly mode can help on it.

To Reproduce
Follow the guide to deploy an EKS cluster and deploy Antrea in networkPolicyOnly mode. This issue may happen or may not happen. I created two clusters, one has this issue, another one doesn't have this issue.

Expected
All non-hostNetwork Pods routes should be updated to antrea-gw0, so OVS can take over the traffic.

Actual behavior

Versions:

  • Latest version of Antrea, the docker images is
projects.registry.vmware.com/antrea/antrea-mc-controller           latest               d376f9d14150   29 hours ago    65.7MB
  • Kubernetes version:
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.0", GitCommit:"c2b5237ccd9c0f1d600d3072634ca66cefdf272f", GitTreeState:"clean", BuildDate:"2021-08-04T18:03:20Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22+", GitVersion:"v1.22.12-eks-6d3986b", GitCommit:"dade57bbf0e318a6492808cf6e276ea3956aecbf", GitTreeState:"clean", BuildDate:"2022-07-20T22:06:30Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}
  • Container runtime; Docker
  • Linux kernel version: 5.4.209-116.363.amzn2.x86_64

Additional context

@luolanzone luolanzone added the kind/bug Categorizes issue or PR as related to a bug. label Sep 15, 2022
@antoninbas
Copy link
Contributor

@luolanzone could you share more information on how to reproduce?

I used the following steps:

eksctl create cluster -N 2
kubectl apply -f https://raw.githubusercontent.com/antrea-io/antrea/main/build/yamls/antrea-eks-node-init.yml
kubectl apply -f https://raw.githubusercontent.com/antrea-io/antrea/main/build/yamls/antrea-eks.yml

And I was not able to reproduce (I tried twice, with 2 different clusters).

There are only 2 Pods that need to be restarted (the core-dns Pods, only Pods on the Pod network), and I observe that they are restarted correctly by antrea-eks-node-init:

kube-system   coredns-5db97b446d-6r7ts           1/1     Running   1 (3m58s ago)   14m
kube-system   coredns-5db97b446d-hz7pj           1/1     Running   1 (3m58s ago)   14m

The routes for these Pods are correct:

default via 192.168.32.1 dev eth0
169.254.169.254 dev eth0
192.168.32.0/19 dev eth0 proto kernel scope link src 192.168.43.69
192.168.38.10 dev antrea-gw0 scope link
192.168.45.29 dev antrea-gw0 scope link
  1. Did you create additional workload Pods (besides core-dns) before deploying Antrea to the EKS cluster?
  2. After applying both YAML manifests, do you see that the Pods have indeed been restarted by K8s?
  3. If you manage to recreate, can you capture the antrea-agent logs?

@luolanzone
Copy link
Contributor Author

@antoninbas yes, I created a Pod before deploy Antrea. I will try it today and let you know the result.

@luolanzone
Copy link
Contributor Author

I tried in two clusters, both works this time. the original environment is cleaned up, I will keep this in mind and to see if it's possible to reproduce it or not. But I found that if I delete all Antrea related deployment, then create a new Service deployment without Antrea, I will see below errors, it this expected?

Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container
"9f19670aa37866a6ed6ce18d6caab59ebe0d2dc69781a978d7a9420cd0de6798" network for pod "nginx-7b8bcb848f-
psn6v": networkPlugin cni failed to set up pod "nginx-7b8bcb848f-psn6v_default" network: rpc error: code = Unavailable 
desc = connection error: desc = "transport: Error while dialing dial unix /var/run/antrea/cni.sock: connect: connection 
refused", failed to clean up sandbox container 
"9f19670aa37866a6ed6ce18d6caab59ebe0d2dc69781a978d7a9420cd0de6798" network for pod "nginx-7b8bcb848f-
psn6v": networkPlugin cni failed to teardown pod "nginx-7b8bcb848f-psn6v_default" network: rpc error: code = 
Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/run/antrea/cni.sock: connect: 
connection refused"]

@antoninbas
Copy link
Contributor

But I found that if I delete all Antrea related deployment, then create a new Service deployment without Antrea, I will see below errors, it this expected?

Running kubectl deleted -f antrea.yml doesn't delete Antrea completely. You need to manually remove the CNI conf file on every Node (and optionally remove the antrea-cni binary on every Node). This is not specific to EKS clusters.

@antoninbas antoninbas added triage/needs-information Indicates an issue needs more information in order to work on it. area/provider/aws Issues or PRs related to aws provider. labels Sep 28, 2022
@luolanzone
Copy link
Contributor Author

Thanks. I feel resource clean up may be done by a PreStop hook, do you know is there any reason we didn't clean them up?

@antoninbas
Copy link
Contributor

There are some issues with PreStop hook. You can refer to kubernetes/kubernetes#35183 and more generally for Antrea cleanup you can refer to #181.

@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/provider/aws Issues or PRs related to aws provider. kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

2 participants