-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AWS] Unsafe decomissioning of nodes when ASGs are out of instances #5829
Comments
I investigated this some more. Here is what I believe is happening:
I can imagine a number of fixes for this, however I don't know the code well enough to comment on side effects.
Will create a PR to implement 2). |
Would be great to get some insight/review here. It frequently (roughly daily) kills long-running batch jobs related to our model training and optimisation, due to causing accidental node scale down some times when AWS is out of instances (which is a constant thing). This happens regardless of the fact we set the do not evict annotation on these Pods, because as above the AWS out of instances issue is not properly factored in. |
Hello, hello! We're also being impacted by this issue and would be nice to get some updates on when it can be expected. If not, are there any known workarounds to avoid the problems? (we've tried using stuff like the scaled down after scale up delay but it didn't seem to work) Our setup has stateful pods so having the scaling group kill active nodes breaks the sessions. Thank you for the time and effort! |
Hi @gicuuu3 there is a PR open with a fix that's working for us, however I'm not sure whether it's going to get merged since it might have implications for other use cases. |
Understood. We've started using mixed instances and it also helped in our scenario. I guess we'll keep track of this PR and if we ever start hitting the problem again we can use this PR for our own setup also. Thank you for the help! |
In case anyone is still struggling with this issue, another thing we did that helped us is enabled the aws auto scaling group https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-auto-scaling-instance-protection.html Since our autoscaler always does targeted removal requests, this toggle protects us from random removal due to capacity reduction while allowing the autoscaler to still removes nodes it doesn't need in the cluster by "name" |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
We've also observed this happening multiple times across different clusters using cluster autoscaler v1.24.3.
I suspect this started happening more frequently because of this feature to early abort on OutOfCapacity errors: #4489. The logs from the intended behavior of the PR match what we see on the cluster:
This means cluster autoscaler doesn't recognize there was a successfully launched instance due to the early abort on the first FAILED status and the decreased desiredCount removes a running instance with potential workloads. If that's the case, it seems like the proposed PR above could be one fix. Another one may be exposing a flag to respect the backoff duration even when hitting an OutOfCapacity error: #5756 |
/remove-lifecycle stale |
@apy-liu if requested instance(s) do not show up in For example, if you attempt to scale up an ASG from 3 -> 6, but only 1 of the 3 instances are provisioned before a If instead |
Hey @jlamillan thanks for taking a look and providing a detailed explanation. That makes sense to me, in that case, the PR which adds a flag to respect the backoff duration seems a more stable approach.
Attaching some sample logs to provide some context. This was what led us to that conclusion. CA Logs:
AWS console:
Right before the successful start at 23:43:07, the CA checks the status of the node group which has a FAILED status at 23:43:00 and does a clean up through the early abort feature. It does look like this causes a race between the ASG finishing the scale-up and the CA cleaning up placeholder instances. |
@apy-liu or @theintz - curious if either of you have looked into using termination policies or scale-in protection as @gicuuu3 suggested as an alternative way to resolve this issue? |
Yes, looking into the suggested mitigations though still hoping to get to a resolution on this through one of the proposed fixes. |
Hi @gjtempleton this was the issue we raised during the SIG meeting. Would be interested to hear your feedback on this issue and the suggested PR to fix it or potentially a different way as we're still looking for a long-term solution. Thanks |
Hi @gjtempleton Could you please outline the next steps and recommend the optimal approach to address this issue? Additionally, is there any restriction on the range of older versions to which a fix can be backported when requesting a backport fix? Thanks |
Hi @gjtempleton, thank you for providing your feedback at this week's SIG meeting. It sounds like the proposed solution would be adding a check before cleaning up the placeholder instance and updating the internal asgCache to reflect the new instance if it does come up. You had mentioned drafting up a PR for the above, so I wanted to check where we could follow along. Also wanted to check on the question @kmsarabu posed above about the range of older versions that backporting is supported. Thanks! |
@gjtempleton I am uncertain about the proposed fix (checking for recent scaling activity before changing the ASG size). I missed the SIG meeting unfortunately where this was discussed, but I think this doesn't solve the problem, it just narrows the window between which a race can occur (see my comment on #6818). I think the ideal way to handle this (and how I've seen it done in the past) is to enable scale-in protection on the ASGs (as mentioned by @gicuuu3). This is the only way I know of to safely change the size of an ASG without accidentally terminating running nodes. I will try to join the SIG meeting this Monday to discuss further. |
The problem is not just AWS, it also exists on other clouds. For example AliCloud. The root of the problem is that scaling activities on the cloud are asynchronous. |
This merge resolves an issue in the Kubernetes Cluster Autoscaler where actual instances within AWS Auto Scaling Groups (ASGs) were incorrectly decommissioned instead of placeholders. The updates ensure that placeholders are exclusively targeted for scaling down under conditions where recent scaling activities have failed. This prevents the accidental termination of active nodes and enhances the reliability of the autoscaler in AWS environments. Fixes kubernetes#5829
This change expands on pr kubernetes#6818 This merge resolves an issue in the Kubernetes Cluster Autoscaler where actual instances within AWS Auto Scaling Groups (ASGs) were incorrectly decommissioned instead of placeholders. The updates ensure that placeholders are exclusively targeted for scaling down under conditions where recent scaling activities have failed. This prevents the accidental termination of active nodes and enhances the reliability of the autoscaler in AWS environments. Fixes kubernetes#5829
PR#6911 Backport for 1.28: Fix/aws asg unsafe decommission #5829
PR#6911 Backport for 1.29: Fix/aws asg unsafe decommission #5829
PR#6911 Backport for 1.30: Fix/aws asg unsafe decommission #5829
Which component are you using?: cluster-autoscaler
What version of the component are you using?: cluster-autoscaler 1.25.1
Component version: chart cluster-autoscaler-9.28.0
What k8s version are you using (
kubectl version
)?:kubectl version
OutputWhat environment is this in?: AWS EKS
What did you expect to happen?: We are running EKS on managed nodes using cluster-autoscaler to dynamically scale up and down the sizes of the node groups. Unfortunately, sometimes AWS is unable to fulfil requests for scaling up (Could not launch On-Demand Instances. InsufficientInstanceCapacity). When that happens, cluster-autoscaler gets confused about the size of the ASG and sends unnecessary
SetDesiredCapacity
requests, which result in unsafe scaling down of the ASG (details below).The expectation is that CA handles these situations transparently and does not scale down the ASG.
What happened instead?: We are observing the following behavior, by looking at both the CA logs and the CloudTrail logs of the requests sent. I don't know enough about the internal workings of the CA, so some of this is an assumption.
See the attached image that shows the sequence of events as well.
How to reproduce it (as minimally and precisely as possible): Difficult to reproduce as it only happens when the AWS capacity issues occur. We are regularly observing this behavior though, I am happy to assist with more information.
Anything else we need to know?:
The text was updated successfully, but these errors were encountered: