Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

📖 Update spot instances proposal with termination handler design #3528

Closed
wants to merge 1 commit into from
Closed

📖 Update spot instances proposal with termination handler design #3528

wants to merge 1 commit into from

Conversation

alexander-demicev
Copy link
Contributor

What this PR does / why we need it:
This PR updates spot instances proposal with termination handler design.

Thanks to @JoelSpeed because it's mostly copy-paste of his proposal done for OpenShift.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 25, 2020
@k8s-ci-robot
Copy link
Contributor

Hi @alexander-demichev. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Aug 25, 2020
@neolit123
Copy link
Member

neolit123 commented Aug 25, 2020

@alexander-demichev could you please link to the PR/issue/comment where this amend was requested?

@neolit123
Copy link
Member

/kind documentation

@k8s-ci-robot k8s-ci-robot added the kind/documentation Categorizes issue or PR as related to documentation. label Aug 25, 2020
Copy link
Member

@vincepri vincepri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold

For more information

for termination.

The code for all providers will be mostly similar, the only difference is logic for polling termination endpoint and processing the response.
This means that cloud provider type can be passed as an argument and termination handler can be common for all providers and placed in a separate repository, for example: `kubernetes-sigs/cluster-api-termination-handler`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this doesn't actually exist today, should we remove it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a POC of what the code would look like?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll put it all together and demonstrate it, probably next week.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +268 to +302
To enable graceful termination of workloads running on non-guaranteed instances,
a termination handler pod will need to be deployed on each spot instance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where would users get this termination handler?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plan is to make something similar to what calico has.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit curious as to why a new termination handler rather than trying to leverage/extend existing projects, such as https://github.com/aws/aws-node-termination-handler for this purpose?

One of the things that stands out to me is that the aws-node-termination-handler supports an SQS-based system that would avoid the security concerns mentioned below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up for discussion as to whether this benefit is worth it or not, but the benefit of having our own termination handler is that it reduces the time to get a new Node.

Using a normal handler, that needs to be able to drain the node, then we have to wait for the Node to go away and MHC to remediate.

Using our own we can trigger machine deletion as soon as the termination notice is served, causing the machineset to create a new machine.

In some cases this won't really be all that useful, but there are a couple I can think of where this is useful.

In GCP preemptible instances are only allowed to run for 24 hours and then get served a termination notice, this would save time here in replacing the lost compute capacity. Secondly, while CAPI doesn't support it yet, if a MachineSet supported multi AZ (which I believe there's discussion of somewhere), the MachineSet may be able to create a new instance in a different AZ if a single AZ is terminating its instances.

As a secondary benefit, by leveraging the machine controller shutdown logic, if we ever encode any extra logic for an instance (pre stop hooks), this would also be leveraged before the instance is actually gone (more useful on AWS with the 2 minute warning than others I admit).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JoelSpeed is that something that we could address by working with related upstream projects to add support for Cluster API rather than having to replicate support for the various different ways to detect spot instance termination across various providers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@detiber One more benefit of having our own termination handler is that we will have only one project for all infra providers that support spot instances. This allows us to deliver to users only one binary instead of 3 different for each cloud.
It's not a full replication of other projects, it more like a stripped-down version of them that should reduce the complexity of integrating termination handlers functionality to CAPI.

Comment on lines 297 to 348
```yaml
---
apiVersion: cluster.x-k8s.io/v1alpha3
kind: MachineHealthCheck
metadata:
name: cluster-api-termination-handler
labels:
api: clusterapi
k8s-app: termination-handler
spec:
selector:
matchLabels:
cluster.x-k8s.io/interruptible-instance: "" # This label is automatically applied to all spot instances
maxUnhealthy: 100% # No reason to block the interruption, it's inevitable
unhealthyConditions:
- type: Terminating
status: "True"
timeout: 0s # Immediately terminate the instance
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this something that MachinePool controller should handler rather than MachineHealthCheck?

cc @CecileRobertMichon

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spot instances are implemented for InfraMachines, not just MachinePools (in fact they aren't implemented for MachinePools yet, see Future work below). This solution only works for Machines. For ASGs/VMSS this would need to be part of the MachinePool controller, but we would need both.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, should the Machine controller be responsible for noticing this condition? That way it would be the same in both Machine and MachinePool?

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 25, 2020
```

#### Running the termination handler
The Termination Pod will be part of a DaemonSet, which should be deployed manually by users. It will select Nodes which are labelled as spot instances to ensure the Termination Pod only runs on instances that require termination handlers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"which should be deployed manually by users" -> can ClusterResourceSets be leveraged instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are ClusterResourceSets usable at this point/will they be by the time we want to implement this proposal? I believe they would solve the issue well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are usable (we use them in e2e) but are still experimental and turned on by feature flag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can definitely use ClusterResourceSets for this

Comment on lines 297 to 348
```yaml
---
apiVersion: cluster.x-k8s.io/v1alpha3
kind: MachineHealthCheck
metadata:
name: cluster-api-termination-handler
labels:
api: clusterapi
k8s-app: termination-handler
spec:
selector:
matchLabels:
cluster.x-k8s.io/interruptible-instance: "" # This label is automatically applied to all spot instances
maxUnhealthy: 100% # No reason to block the interruption, it's inevitable
unhealthyConditions:
- type: Terminating
status: "True"
timeout: 0s # Immediately terminate the instance
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spot instances are implemented for InfraMachines, not just MachinePools (in fact they aren't implemented for MachinePools yet, see Future work below). This solution only works for Machines. For ASGs/VMSS this would need to be part of the MachinePool controller, but we would need both.


Once the Node has been marked with the `Terminating` condition, it will be
the MachineHealthCheck controller's responsibility to ensure that the Machine
is deleted, triggering it to be drained and removed from the cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't MHC cause a new Machine to be created? Since the idea is to replace an unhealthy machine? Is that what we want when a spot VM is terminated for lack of capacity?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think yes, we need to delete machine and leverage node draining here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the Machine is going to be Failed, is there any reason to keep it around apart from the fact we know a replacement would fail?

I guess when you're running a set up like this you kind of need the cluster autoscaler to notice that there's no capacity and scale up elsewhere, or maybe have the MachineSet bring up a node in an alternate failure domain (if we go down the route of supporting that? Would probably need to recognise somehow that the spot capacity is out and retry a different failure domain)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that for a first implementation letting the machine be recreated is reasonable. We can always do smarter things based on usage feedback once we have something in place.


The termination pod will be developed to poll the metadata service for the Node
that it is running on.
We will implement request/response handlers for each of the three cloud providers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"each of the three cloud providers" seems a bit restrictive, maybe this can be

Suggested change
We will implement request/response handlers for each of the three cloud providers
We will implement request/response handlers for each infrastructure provider that supports non-guaranteed instances

for termination.

The code for all providers will be mostly similar, the only difference is logic for polling termination endpoint and processing the response.
This means that cloud provider type can be passed as an argument and termination handler can be common for all providers and placed in a separate repository, for example: `kubernetes-sigs/cluster-api-termination-handler`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a POC of what the code would look like?

#### Running the termination handler
The Termination Pod will be part of a DaemonSet, which should be deployed manually by users. It will select Nodes which are labelled as spot instances to ensure the Termination Pod only runs on instances that require termination handlers.

The spot label will be added by the cloud providers as they create instances, provided they support spot instances and the instance is a spot instance. This also requires ability of cloud providers to sync labels between machines and nodes on workload cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"This also requires ability of cloud providers to sync labels between machines and nodes on workload cluster." -> this seems like an important limitation, by this do you mean the Cluster API infra provider has to be able to sync labels between CAPI Machines and nodes, Infra Machines and nodes, or something else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think somehow we need to mark nodes as spot instances (using a well known label) so that the DaemonSet only deploys to spot instances.

Is there anywhere in CAPI already where we can put a label on a Machine and have that copied to the Node once it joins the cluster? If there is, could the Infra provider add a label to that set when it creates a spot instance?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise the core machine controller could infer from the InfrastructureRef if this is a non-guaranteed instance and add the label to the node as it adds the nodeRef

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created an issue for tracking this #3504 and a POC PR for aws kubernetes-sigs/cluster-api-provider-aws#1876

The spot label will be added by the cloud providers as they create instances, provided they support spot instances and the instance is a spot instance. This also requires ability of cloud providers to sync labels between machines and nodes on workload cluster.

#### Termination handler security
To be able to perform the actions required by the termination handler, the pod will need to be relatively privileged.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the pod will need to be relatively privileged." seems a bit vague, what access does it need exactly? What are the actions required by the termination handler?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed this is a bit vague, I believe it is a precursive sentence for the rest of the content of this section, which explains the privilege level required to run this termination handler. Could probably just drop this whole sentence.

To enable graceful termination of workloads running on non-guaranteed instances,
a termination handler pod will need to be deployed on each spot instance.

The termination pod will be developed to poll the metadata service for the Node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure this would work for Azure. At the bottom of this proposal we wrote:

Azure uses their Scheduled Events API to notify Spot VMs that they are due to be preempted.
This is a similar service to the AWS metadata service that each machine can poll to see events for itself.
Azure only gives 30 seconds warning for nodes being preempted though.

A Daemonset solution similar to the AWS termination handlers could be implemented to provide graceful shutdown with Azure Spot VMs.
For example see this [existing solution](https://github.com/awesomenix/drainsafe).

cc @awesomenix

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are planning integrate https://github.com/awesomenix/drainsafe to automatic cordon and drain any terminating instances for machinepools. We can probably extend it to execute additional termination handlers. Just a reminder that you get at most 30 seconds for a Spot VM on azure, I am not sure how much you can do in that time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For termination handlers, it should be sufficient to use node draining logic that we already have implemented in CAPI https://github.com/kubernetes-sigs/cluster-api/blob/master/controllers/machine_controller.go#L276

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a POC for AWS which marks the node with a condition as described in this proposal, the azure one can be pretty much identical to this, just substituting the logic for how to read the response and adding the required header to the request.

https://github.com/openshift/cluster-api-provider-aws/blob/b4a3478db44ddb554883cf77a9e5f49ffd54fdf4/pkg/termination/handler.go

While I agree 30 seconds isn't very long to actually do much in the drain, I think the benefit would come from replacing nodes quicker. Without the contents of this proposal, the instance gets terminated, if an MHC is present, it will remove the Machine after some time of being unhealthy, then the MachineSet replaces the node.
With this proposal, we get an up to 30 second warning, that then allows us to mark the Node, have an MHC notice this mark, delete the Machine and then have the MachineSet controller create a new Machine. This should pretty much all happen "instantly" so we should start the creation of the new instance before the old instance goes away rather than a few minutes after it has already shut down

@enxebre
Copy link
Member

enxebre commented Sep 9, 2020

Still need to agree on #3528 (comment) but overall this looks good to me.

@cpuguy83
Copy link

There's a KEP that was discussed on sig-node to hook into systemd's ability to allow you to delay shutdown to handle shutdown gracefully from the kubelet: kubernetes/enhancements#2001

I do like the idea of hooking this as a separate component as done here.

@alexander-demicev
Copy link
Contributor Author

alexander-demicev commented Dec 1, 2020

@cpuguy83 Thanks for sharing the link. The logic described in this proposal might not work with spot instances, I am not sure that we can delay the shutdown of a spot instance.
One more thing to add, for spot instances we will be polling a termination endpoint that is located on the cloud provider side, it should give us more time for the graceful shutdown than the approach described in K8S proposal.

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Dec 1, 2020
@alexander-demicev
Copy link
Contributor Author

@CecileRobertMichon @vincepri @JoelSpeed This proposal amendment is ready for another round of reviews.

@fabriziopandini
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 10, 2020
@fabriziopandini
Copy link
Member

/milestone v0.4.0

@vincepri
Copy link
Member

vincepri commented Mar 22, 2021

/milestone Next

Folks, what's the status of this proposal?

@k8s-ci-robot k8s-ci-robot removed this from the v0.4.0 milestone Mar 22, 2021
@alexander-demicev
Copy link
Contributor Author

I think all concerns and comments were addressed, I'm happy to continue with this proposal.

@CecileRobertMichon
Copy link
Contributor

@alexander-demichev are there any breaking changes in this proposal?

Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexander-demichev please update the last-updated date on the proposal

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign timothysc after the PR has been reviewed.
You can assign the PR to them by writing /assign @timothysc in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@alexander-demicev
Copy link
Contributor Author

@CecileRobertMichon sure, it's done

@alexander-demicev
Copy link
Contributor Author

@CecileRobertMichon There are no breaking changes.

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Aug 11, 2021

@alexander-demichev: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-test-main-mink8s 19ec895 link /test pull-cluster-api-test-main-mink8s

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@randomvariable
Copy link
Member

randomvariable commented Aug 18, 2021

Couple of things:

  • Proposal history needs updating
  • In last conversations with AWS:
    • they said they were recommending the queue processor mode of the AWS Node Termination handler, and @gab-satchi implemented a bunch of stuff around consumption of EventBridge in CAPA already that can potentially be reused - this would remove the need for running a DaemonSet on the workload cluster.
    • they recommended not using RunInstances at all for spot instances, and moving wholesale to Fleet.

I would be tempted to reduce the proposal here to its essentials - defining the contract, and defer the per provider implementation to CAEPs (or preferred processes) for those providers.

We should also at least mention the existence of https://github.com/aws/aws-node-termination-handler in the alternatives section.

```

#### Running the termination handler
The Termination Pod will be part of a DaemonSet, that can be deployed using [ClusterResourceSet](https://github.com/kubernetes-sigs/cluster-api/blob/master/docs/proposals/20200220-cluster-resource-set.md). The DaemonSet will select Nodes which are labelled as spot instances to ensure the Termination Pod only runs on instances that require termination handlers.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? I think this should be enabled for all nodes by default.
Even if the problem is clearly more important for spot instances, every node that is about to be terminated, for whatever reason, should attempt to properly drain.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The termination notices that these termination handlers rely upon are only implemented on the spot/preemptible instances. Eg on AWS, if the spot instance is going away, we get a 2 minute warning via the metadata endpoint. There is no equivalent for an on-demand instance.

I appreciate that there are times that cloud providers remove instances for maintenance reasons, and there may be some way programmatically to detect this, but as far as I know they use a different mechanism and are also far less frequent. For now at least, I think they are out of scope for these termination handlers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proposal referenced here #3528 (comment) describes the situation where the node can be unexpectedly shut down.

@vincepri vincepri deleted the branch kubernetes-sigs:master September 22, 2021 19:09
@vincepri vincepri closed this Sep 22, 2021
@randomvariable
Copy link
Member

This got auto closed because of the branch rename to main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/documentation Categorizes issue or PR as related to documentation. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.