Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Traffic from outside VPC does not reach pod #1392

Closed
Nuru opened this issue Mar 2, 2021 · 13 comments · Fixed by #1475
Closed

Traffic from outside VPC does not reach pod #1392

Nuru opened this issue Mar 2, 2021 · 13 comments · Fixed by #1475
Assignees
Labels

Comments

@Nuru
Copy link

Nuru commented Mar 2, 2021

What happened:

For some reason, UDP traffic from outside the VPC is not reaching the pod. Traffic flow logs show UDP traffic flowing through the load balancer and being sent to the correct internal address. UDP traffic has been confirmed (via tcpdump) to reach the primary IP of the ENI that hosts the pod's IP, but does not reach the pod. UDP traffic from inside the VPC does reach the pod as expected. TCP traffic from both inside and outside the VPC reaches the pod as expected.

AWS Support has confirmed that security groups are properly configured to allow access.

Attach logs

Logs are attached to AWS Support case 7990499551 which I opened 16 days ago but remains unresolved. I feel like the ticket is not getting directed to the right team, which is why I am opening this issue. I think this is a CNI bug.

What you expected to happen:

With a pod listening on a UDP port and a security group allowing access from 0.0.0.0/0, I expect traffic targeted to that UDP port to be delivered, regardless of its source address,

How to reproduce it (as minimally and precisely as possible):

  1. Create an EKS cluster with a node pool in a private subnet using default configuration.
  2. Create a target Pod server on a secondary ENI on one of the nodes. The only way I know to do this is to first create enough pods to fill up the first ENI.
  3. Create a load balancer target group targeting the target pod on its pod IP address (not the ENI or Node IP address). Critical: ensure "Preserve Client IP" is enabled.
  4. Provision a Network load balancer to forward traffic to the target Pod.

At this point, the configuration is complete and you should be able to communicate from the public internet to the Pod using the NLB's DNS name. However, the connection does not work. On the other hand, you can verify that the Pod is serving traffic properly by connecting to it from any host in the private subnet on the same IP as is specified in the target group.

~~Original version~~

I don't know. The entire configuration, from creating the VPC through creating the EKS cluster through deploying the Pod, Kubernetes service, and load balancer, was all provisioned with Terraform and Helm. The almost exactly the same configuration was applied in another account and works fine as expected. (The differences between the 2 configurations are things like ID names, DNS names, ARNs, etc. that have to be different between different installations, but basically the configurations are identical in what should be all relevant aspects.)

As deployed, the only noticeable difference is that the working instance has the Pod deployed on the EC2 instance's primary ENI and the non-working instance has the pod deployed on a secondary ENI.

Anything else we need to know?:

The default configuration leaves aws-node with AWS_VPC_K8S_CNI_EXTERNALSNAT: false. You can check with

$  kubectl -n kube-system describe daemonset aws-node | grep EXTERNALSNAT
      AWS_VPC_K8S_CNI_EXTERNALSNAT:        false

Unfortunately, this is incompatible with allowing Pods to be reached on their private IPs when the source IP is outside the VPC. This issue can be fixed by setting AWS_VPC_K8S_CNI_EXTERNALSNAT=true but there is no AWS API for doing that, thus it is not practical to set in a large organization.

  1. The CNI should be smart enough to bypass SNAT when the connection is initiated from outside the node on the pod's private IP.
  2. There should be an API and all the corresponding ways (AWS console, aws CLI, Terraform, etc.) for setting AWS_VPC_K8S_CNI_EXTERNALSNAT=true

Environment:

  • Kubernetes version (use kubectl version): v1.18.9-eks-d1db3c
  • CNI Version: amazon-k8s-cni:v1.7.5-eksbuild.1
  • OS (e.g: cat /etc/os-release): Amazon Linux 2
  • Kernel (e.g. uname -a): 4.14.209-160.339.amzn2.x86_64 #1 SMP Wed Dec 16 22:44:04 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
@Nuru Nuru added the bug label Mar 2, 2021
@jayanthvn
Copy link
Contributor

Hi @Nuru

We will look into the support ticket and get in touch with you for further debugging.

Thanks!

@andrewjeffree
Copy link

This sounds identical to the issue I just spent a chunk of time with support on yesterday. Our case number is 8015490531 if you want more information. I think the only difference in our case is we're using TCP.

@jayanthvn
Copy link
Contributor

Thanks @ImperialXT, Yes it might be a similar issue since the pods behind secondary ENIs are impacted.

@Nuru - Can you please share the logs - sudo bash /opt/cni/bin/aws-cni-support.shon the node having the issue. I can verify the NAT rules and get back to you.

You can email it me varavaj@amazon.com

@jayanthvn
Copy link
Contributor

Hi

Sorry for the delayed response. Since this is the ingress traffic and you mentioned the impacted pod is behind the secondary ENI, did you capture tcp dump to verify the traffic is entering instance on the secondary ENI?

Thank you!

@Nuru
Copy link
Author

Nuru commented Mar 19, 2021

@jayanthvn wrote:

Sorry for the delayed response. Since this is the ingress traffic and you mentioned the impacted pod is behind the secondary ENI, did you capture tcp dump to verify the traffic is entering instance on the secondary ENI?

Yes. tcpdump shows packets on eth2 on the instance but no packets on eth0 in the pod.

@jayanthvn
Copy link
Contributor

I have a BPF script which captures pipeline lookup events. We will reach out to you with your case number (7990499551).

/cc @abhipth

Thanks

@Nuru
Copy link
Author

Nuru commented Mar 28, 2021

After much more investigation, I believe this is a mirror of #75 which was fixed by #130. In that issue, traffic arriving on eth0 had a reverse path out a secondary ENI and was therefore being dropped. My case appears to be the mirror of that one: traffic arriving on eth2 has a reverse route via eth0 (since it the destination is a public IP) and is dropped.

@liwenwu-amazon predicted this would happen: #130 (review)

This change may become incompatible with other features such as:

  • add Pod IP to NLB/ALB target group and use NLB/ALB to directly send traffic to Pod IP

My scenario is exactly the one @liwenwu-amazon warned about. We have the NLB in IP mode directly sending traffic to the Pod IP via eth2.

Here are the important diagnostics:

  1. If we enable "martian" logging via sysctl -w net.ipv4.conf.all.log_martians=1 then the source packets show up in /var/log/messages as "maritans"
  2. If we disable reverse path filtering with
    sysctl -w net.ipv4.conf.eth2.rp_filter=0
    sysctl -w net.ipv4.conf.all.rp_filter=0
    
    (both settings are required), the source packets are no longer logged as martians and they reach the pod as intended
  3. After reaching the pod, the server responds appropriately, but the response packets are lost, presumably due to the same configuration issue that caused them to be marked as martians in the first place.

The destination Pod's IP is 10.101.4.84. If we use ip rule show we see (among other rules)

# ip rule show | magic-filter
512:	from all to 10.101.4.84 lookup main
1024:	from all fwmark 0x80/0x80 lookup main
1536:	from 10.101.4.84 to 10.101.0.0/18 lookup 3
32766:	from all lookup main
32767:	from all lookup default

# ip route show table 3
default via 10.101.0.1 dev eth2 
10.101.0.1 dev eth2 scope link 

Only interested in 10.10.4.84 right now, since that is our Pod IP address. The 1536 rule provides a reverse path for VPC traffic out eth2 where it came in, but no reverse path for the internet, so reverse traffic goes out eth0 via default routes.

If I turn reverse path filtering back on and add a reverse route

ip route add 10.101.0.1 dev eth2 scope link 
ip route add  nn.nn.nn.nn/32 via 10.101.0.1 dev eth2

(where nn.nn.nn.nn is my workstation's IP address) then traffic flows properly in both directions.

Of course, this is not the correct fix; you will have to figure out what is. My guess is that the restriction on destination in the 1536 rule is unnecessary. If I delete my added routes (so we are back to the original configuration) and then add

ip rule add priority 1111 from  10.101.4.84 to all lookup 3

then again traffic flows as expected.

Update

It appears that I am in luck in my specific case, because my EKS cluster is in a private subnet with a NAT Gateway attached. This means I can set the AWS_VPC_K8S_CNI_EXTERNALSNAT flag to true on the CNI (specifically the aws-node node daemonset) as described here and then, indeed, it will set the rule to from 10.101.4.84 to all lookup 3 just as I suggested.

The bad news is that this remains a problem for people who put their EKS clusters in public subnets. They do not have the option of setting AWS_VPC_K8S_CNI_EXTERNALSNAT=true because they do not have an external process performing SNAT. So this is still a bug that needs to be fixed. Now that I understand it, I see that this likely is going to have to follow the plan outlined in #130. As @fasaxc commented at the time:

This PR (#130) doesn't do anything for non-eth0 secondary IPs right now, those packets should be processed as before. I wasn't sure what the desired behaviour of such IPs was. The PR could be extended to use a mark value per ENI but that would require quite a bit of additional work. If a user is already ENI-aware, presumably they'll only be accessing pods attached to that ENI so the routing should just work in that case?

As we have seen, the routing only works if the "user" is in the same VPC, where SNAT is automatically disabled. It does not work where the user is in a different VPC or on the internet, due to reverse path filtering.

At least now I understand that this was not a careless oversight, but a series of difficult decisions that need further refinement.

This is also more than a bit outside my wheelhouse. Nevertheless, I will offer an opinion that the correct thing to do is use connmark to mark a connection that comes in via a (any) secondary ENI and then prevent SNAT or NODEPORT from forcing reverse traffic out eth0 if that mark is set. There are already rules that send traffic out the right ENI if it is from within the VPC, so just enhance/duplicate those rules to do the the same thing when the connmark is present.

@Nuru Nuru changed the title UDP traffic from outside VPC does not reach pod Traffic from outside VPC does not reach pod Mar 28, 2021
@jayanthvn
Copy link
Contributor

Hi @Nuru

Nice debugging. During the debug session we did set sysctl -w net.ipv4.conf.eth2.rp_filter=0 but missed setting "all". Also we also did recommend setting the EXTERNAL SNAT in the debug session but still wondering how TCP traffic worked for you.

@Nuru
Copy link
Author

Nuru commented Mar 29, 2021

@jayanthvn TCP worked because, by default, NLB -> TCP disables source IP preservation, which means the observed source address was the private IP of the NLB, which is within the VPC, even if the sender was outside the VPC. UDP forces source IP preservation.

Which raises the question: with source IP preservation, how does return traffic know to exit via the NLB (so that the client sees a response from the same address it sent the request to) rather than through the NAT gateway?

@kishorj
Copy link
Contributor

kishorj commented Apr 22, 2021

@Nuru, thank you for sharing the details of your troubleshooting, and the analysis. You are right, the traffic flow is not successful due to the routing configuration. NLB with IP targets, with client IP preservation enabled and the pod is attached to a secondary ENI exhibit this behavior.

There are two issues

  • First, the reverse path filter on secondary ENIs attached to Linux are set to strict mode (rp_filter =1). All ingress traffic on secondary interfaces with source IP outside of the VPC ranges get dropped without further processing.
  • If the rp filter is made less restrictive, the incoming packets on non eth0 do reach to the pods. However, the return path for non VPC traffic is always through eth0. This asymmetry causes the return packets to not go through the NLB, and as a result the affected connections are not successful.

When external SNAT is enabled, traffic ingress on secondary interfaces have the same return path since the routing rules don't have the VPC CIDR restriction. External SNAT can only be used for private EC2 instances.

For TCP connections through NLB-IP, client IP doesn't get preserved unless enabled via target group attributes. SInce the source IP is within the VPC range, the return path is the same as the ingress interface. So the issue is not seen.

The fix involves configuring less restrictive rp_filter settings on the secondary interfaces, connmark and ip rules to make the return path same as the ingress interface. I'm reviewing the corner cases and any other features that could be affected by the proposed changes.

@kishorj
Copy link
Contributor

kishorj commented Apr 22, 2021

Did you also try running your application pods on Fargate? I don't foresee any issues. If you were able to verify, feel free to share your results.

@Nuru
Copy link
Author

Nuru commented Apr 22, 2021

@kishorj I did not try running pods on Fargate and do not have time or resources to try it. You understand the problem well enough now that you should be able to deploy a test case yourself.

@dcarley
Copy link

dcarley commented Jan 31, 2022

📝 In case it saves anyone some debugging time, this also affects NLBs when using IP mode and client IP preservation with EKS 1.20 clusters that don't contain the fixed CNI version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants