Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stale IMDS entries could cause VPC CNI incorrectly release in-use IPs #3019

Open
M00nF1sh opened this issue Aug 30, 2024 · 6 comments · Fixed by #3033
Open

Stale IMDS entries could cause VPC CNI incorrectly release in-use IPs #3019

M00nF1sh opened this issue Aug 30, 2024 · 6 comments · Fixed by #3033
Labels

Comments

@M00nF1sh
Copy link
Contributor

M00nF1sh commented Aug 30, 2024

What happened:

TL;DR: It's a code defect in VPC-CNI, which is trigged by an defect deployment in EC2 where IP allocation/release didn't reflect in Instance Metadata Service(IMDS). As a result, pod IPs used by existing pods might be released to EC2 or reused by another pod.

Background:

  1. There is three view of IP addresses associated with a ENI:
    • view#1: EC2's DescribeNetworkInterfaces API provides the de-facto set of IP addresses associated with a ENI.
    • view#2: IMDS have an endpoint network/interfaces/macs/<eniMac>/local-ipv4s that supposed to contains the set of IP addresses associated with a ENI, which is expected to be eventually consist with view#1 above over a short period.
    • view#3: VPC CNI has a internal cache that contains the set of IP addresses associated with a ENI. It's updated when it make EC2 calls such as when it AssignPrivateIpAddresses or UnassignPrivateIpAddresses.
  2. VPC-CNI have a periodically executed eniIPPoolReconcile routine that tries to reconcile it's view#3 to match the real set of IP address on ENI(view#1). However, in order to save unnecessary API calls, it assumes IMDS(view#2) is same as EC2(view#1) if length($view#3) == length($view2)(This is the code defect that that corrupted due to IMDS bug), and will fallback to view#1 if length mismatch. Then it will add IPs to/from view#3 to match IMDS(view#2)/EC2(view#1).
    • tip: ignore the +1 in the actual comparison code, its a compensation for the primary IP of eni, which i deliberately excluded from this discussion & examples below for simplicity.
    • tip: the actual logic is actually more complex than this, e.g. those lines, but we can ignore them for this particular incident.

What happened:
Due to some bugs in IMDS's end, the set of IP addresses associated with a ENI is no longer eventually consist with actual state from EC2 API. For example,

  • when some IPs were freed(UnassignPrivateIpAddresses) by vpc-CNI, it stills appears in IMDS for several hours.
  • when some IPs were allocated(AssignPrivateIpAddresses) by vpc-CNI, it still not show up in IMDS for several hours.

Let's focus on a particular case related to eni-<redacted>.

  1. With the above IMDS issue in place, the set of IP addresses for this ENI in the three view is as follows:

    • // some IPs that are common in all three view// let's say M IPs.
    • [10.56.156.194, 10.56.151.27, 10.56.151.25] those 3 IP address is in VPC CNI's internal cache & EC2, but not in IMDS. (those are newly allocated IPs by VPC CNI)
    • [10.56.150.175, 10.56.147.61, 10.56.159.130, 10.56.151.147] those 4 IP address is in IMDS only. (those were previous freed IPs by VPC CNI).
  2. At 2024-08-29T22:14:11.001Z, the eniIPPoolReconcile routine executes. Since the length($view#3) and length($view#2) mismatches(M+3 != M+4), VPC CNI will fetch the view#1 from EC2 directly and compare it with view#3. Nothing happens as they matches.

  3. At 2024-08-29T22:14:30.579Z, a new pod is created and got assigned IP address. Due to it's warm pool maintenance algorithm, at 2024-08-29T22:14:33.645Z VPC CNI allocated an new IP 10.56.145.254 to this ENI. Since this new IP didn't show up in IMDS as well, the three view became follows:

    • // some IPs that are common in all three view// let's say M IPs.
    • [10.56.156.194, 10.56.151.27, 10.56.151.25, 10.56.145.254] those 4 IP address is in VPC CNI's internal cache & EC2, but not in IMDS. (those are newly allocated IPs by VPC CNI)
    • [10.56.150.175, 10.56.147.61, 10.56.159.130, 10.56.151.147] those 4 IP address is in IMDS only. (those were previous freed IPs by VPC CNI).
  4. At 2024-08-29T22:15:36.566Z, the eniIPPoolReconcile routine executes. Since now length($view#3) and length($view#2) matches(M+4 == M+4), VPC CNI trusted IMDS as source of truth, and does the following:

    • remove those 4 IPs[10.56.156.194, 10.56.151.27, 10.56.151.25, 10.56.145.254] from it's internal cache(view#3).
      • this also cleans up the marker that record whether a IP is been used by Pod. 10.56.156.194 is the IP used by customer's pod. However, so far the pod network still works, it's just a marker.
    • it won't add [10.56.150.175, 10.56.147.61, 10.56.159.130, 10.56.151.147] to it's it's internal cache(view#3) thanks to those line of code. (complex and unrelated logic so i'll skip explanation of this part).
  5. So now the three view became

    • // some IPs that are common in all three view// let's say M IPs.
    • [10.56.156.194, 10.56.151.27, 10.56.151.25, 10.56.145.254] those 4 IP address is in EC2 only.
    • [10.56.150.175, 10.56.147.61, 10.56.159.130, 10.56.151.147] those 4 IP address is in IMDS only. (those were previous freed IPs by VPC CNI).
  6. At 2024-08-29T22:15:39.197Z, warm pool maintenance algorithm decides to add two additional IP to this ENI (10.56.148.72 & 10.56.154.181). Now the three view became follows:

    • // some IPs that are common in all three view// let's say M IPs.
    • [10.56.156.194, 10.56.151.27, 10.56.151.25, 10.56.145.254] those 4 IP addresses is in EC2 only.
    • [10.56.150.175, 10.56.147.61, 10.56.159.130, 10.56.151.147] those 4 IP address is in IMDS only. (those were previous freed IPs by VPC CNI).
    • [10.56.148.72 & 10.56.154.181] those 2 IP address is in VPC CNI's internal cache & EC2, but not in IMDS. (those are newly allocated IPs by VPC CNI)
  7. At 2024-08-29T22:16:42.107Z, the eniIPPoolReconcile routine executes. Since now length($view#3) and length($view#2) mismatches(M+2 != M+4), VPC CNI will fetch the view#1 from EC2 directly and compare it with view#3, and does the following:

    • add those 4 IP [10.56.156.194, 10.56.151.27, 10.56.151.25, 10.56.145.254] back to it's internal cache(view#3). However those IPs won't be recorded as been occupied by any pods.
  8. At 2024-08-29T22:16:44.731Z. The warm IP maintenance algorithm runs again, which noticed there are a lot IP address not assigned to any pod(including our 10.56.156.194), thus it decides to release those IP address back to EC2 via UnassignPrivateIpAddresses. After this, pod networking for pod with 10.56.156.194 starts broken.

Attach logs

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
  • CNI Version
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
@M00nF1sh M00nF1sh added the bug label Aug 30, 2024
@orsenthil
Copy link
Member

Instead of comparing the lengths, if we match the contents, will it prevent the bug under discussion?

@M00nF1sh
Copy link
Contributor Author

M00nF1sh commented Aug 31, 2024

Instead of comparing the lengths, if we match the contents, will it prevent the bug under discussion?

Yes, comparing contents would be a simple fix in our current code/setup since we only do IP pool action per minute.
We might consider do some in-depth changes such as track when a IP is added/removed and tolerate those IPs recently modified during comparison to save AWS API calls. We can also track how it took IMDS to reach consensus as a metrics

@BojanZelic-DD
Copy link

We've seen pod ips get reused even if it's assigned to another pod because of this issue and 2 separate eth interfaces are bound to the same ip (but only one works):

# ip netns exec cni-2c070b18-e7f5-6128-1a15-3aaae50fc1e0c ip addr
...
3: eth0@if3050: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP group default
    link/ether aa:c9:1b:84:ac:bf brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.30.118.6/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::a8c9:1bff:fe84:acbf/64 scope link
       valid_lft forever preferred_lft forever

# ip netns exec cni-293a1c74-4d7f-8414-3d56-453a459db1f8 ip addr;
...
3: eth0@if3047: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP group default
    link/ether 7a:3c:a4:6a:ca:d9 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.30.118.6/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::783c:a4ff:fe6a:cad9/64 scope link
       valid_lft forever preferred_lft forever

@M00nF1sh
Copy link
Contributor Author

M00nF1sh commented Sep 4, 2024

@BojanZelic-DD
Thanks for sharing this real case to us. We are aware of such case internally and are working with EC2 team to retire impacted instances that were used as EKS nodes. We will also work to improve CNI's logic to avoid similar issue reoccur.

Basically worker nodes affected by this issue could be in one of the following states:

  1. An IP address in use by an existing pod might be released back to EC2.
    1. This could potentially result in that IP being allocated to another node and assigned to a different pod.
  2. An IP address in use by an existing pod might be reassigned to another pod on the same node.
  3. An IP address in use by an existing pod may be marked as unassigned internally, but remains attached to the node.

If state #1 or #2 occurs, the existing pod will lose network connectivity. In state #3, the existing pod may not show immediate symptoms, but it could transition to state #1 or #2 at any time in the future, even after IMDS corrects the discrepancies.

Copy link

This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.

@orsenthil orsenthil reopened this Sep 18, 2024
@orsenthil
Copy link
Member

This can be closed only the release with the fix is available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants