Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Update MHC proposal with new annotation strategy #2920

Merged
merged 1 commit into from
Apr 28, 2020

Conversation

benmoss
Copy link

@benmoss benmoss commented Apr 15, 2020

What this PR does / why we need it:
Updates the MHC proposal with a new strategy so that MHC can support machines managed by KCP.

Related to #2836

@enxebre @vincepri @JoelSpeed

/kind proposal

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 15, 2020
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. kind/proposal Issues or PRs related to proposals. labels Apr 15, 2020
docs/proposals/20191030-machine-health-checking.md Outdated Show resolved Hide resolved
@@ -108,6 +109,11 @@ As an operator of a Management Cluster, I want my machines to be self-healing an

### Implementation Details/Notes/Constraints

#### Machine annotation:
```go
const MachineUnhealthyAnnotation = "machine.cluster.x-k8s.io/unhealthy"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this should be have a time as value?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to expand this section to give some guarantees, for example if this is a timestamp, do we keep updating the annotation at every reconciliation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we track timestamps, we likely need two (unhealthy since, and last health check) to ensure that we are operating against the information the way we expect to. That way we could differentiate between a persistent unhealthy state and lack of reconciliation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an aside, to support different external remediation (outside of the owning controller), such as the request from #2846, would it make sense to allow for overriding the annotation key here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having different timestamps, while interesting, also means unfortunately that the Machine and everything that watches them is going to be updated every time the check is performed :/

to support different external remediation (outside of the owning controller), such as the request from #2846, would it make sense to allow for overriding the annotation key here?

I'd expect that at some point (proposal tbd) that we'll allow different strategies for remediation.

Maybe to start, we should have a simple annotation with no timestamp on it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to have also an annotation that signals when the machine is not healthy but the timeout is not yet reached (machine under probation)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say that until we have a use case for it being a timestamp, let's not add that extra complexity

Copy link

@beekhof beekhof Apr 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, you can't trust that the timestamp is not skewed (best case) or actively wrong.

docs/proposals/20191030-machine-health-checking.md Outdated Show resolved Hide resolved
docs/proposals/20191030-machine-health-checking.md Outdated Show resolved Hide resolved
docs/proposals/20191030-machine-health-checking.md Outdated Show resolved Hide resolved
@vincepri
Copy link
Member

/assign @JoelSpeed @enxebre

@vincepri
Copy link
Member

/milestone v0.3.4

@k8s-ci-robot k8s-ci-robot added this to the v0.3.4 milestone Apr 16, 2020
@@ -108,6 +109,11 @@ As an operator of a Management Cluster, I want my machines to be self-healing an

### Implementation Details/Notes/Constraints

#### Machine annotation:
```go
const MachineUnhealthyAnnotation = "machine.cluster.x-k8s.io/unhealthy"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we track timestamps, we likely need two (unhealthy since, and last health check) to ensure that we are operating against the information the way we expect to. That way we could differentiate between a persistent unhealthy state and lack of reconciliation.

docs/proposals/20191030-machine-health-checking.md Outdated Show resolved Hide resolved
docs/proposals/20191030-machine-health-checking.md Outdated Show resolved Hide resolved
docs/proposals/20191030-machine-health-checking.md Outdated Show resolved Hide resolved
Copy link
Member

@fabriziopandini fabriziopandini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from my side.
this work might be relevant for conditions as well

Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two thoughts cross my mind while reading this as it is currently:

  • Do we have noted down anywhere, how remediation will happen for any of the controllers? Should this be noted down somewhere?
  • What does the annotation mean, are we expecting it to always be present on an unhealthy Machine? Are we expecting other controllers to remove it? Does this really matter right now/can we change this later if we find the model needs changing?

@@ -108,6 +109,11 @@ As an operator of a Management Cluster, I want my machines to be self-healing an

### Implementation Details/Notes/Constraints

#### Machine annotation:
```go
const MachineUnhealthyAnnotation = "machine.cluster.x-k8s.io/unhealthy"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say that until we have a use case for it being a timestamp, let's not add that extra complexity

@vincepri
Copy link
Member

Do we have noted down anywhere, how remediation will happen for any of the controllers? Should this be noted down somewhere?

Let's take a little bit more time to think about the annotation naming and key-value definition. If the default strategy is "controller owner needs to delete", we should communicate it.

What does the annotation mean, are we expecting it to always be present on an unhealthy Machine? Are we expecting other controllers to remove it? Does this really matter right now/can we change this later if we find the model needs changing?

Related to the above, the expectations for this iteration should be that the Machine is deleted after we set the annotation, it doesn't need to be removed.

What do you all think?

@enxebre
Copy link
Member

enxebre commented Apr 20, 2020

The way it looks atm this update would effectively enable any "remediation" controller to widely do anything with no clear expectations or contract with the core MHC. It would leave a lot of unresolved ambiguity that might result in consumers confusion and antipatterns, e.g what happens if the remediatior chooses not to delete the machine but to do something else, what happen if two remediator race against the annotation, etc, etc... We should not enable this programatically until there's a clear particular proposal and plan for it.

For the scope of this PR I'd suggest we keep it to solve the particular problem you are trying to overcome here, i.e remediate machines owned by the KCP.

To that end I'd suggest we programatically enforce this in the MHC by checking when a machine is a controlPlane one and only set the annotation in that case.
I'd suggest we update the remediation section https://github.com/kubernetes-sigs/cluster-api/blob/62b7a1d777fd19a5af1884960eb59378844f90d6/docs/proposals/20191030-machine-health-checking.md#remediation to remain as it is plus including a subsection for control plane machines.

Then if we want to generalise the annotation mechanism I think that deserves a particular fleshed out proposal covering all the scenarios mentioned above with clear expectations and contracts between components.

@fmuyassarov
Copy link
Member

As discussed in the last community meeting and #2846 we are working on a proposal to allow using provider custom remediation strategy that would be essential where deleting Machine as remediation operation isn't sufficient. We should be submitting the proposal in a couple of days.

@vincepri
Copy link
Member

vincepri commented Apr 20, 2020

@enxebre @JoelSpeed

The way it looks atm this update would effectively enable any "remediation" controller to widely do anything with no clear expectations or contract with the core MHC. It would leave a lot of unresolved ambiguity that might result in consumers confusion and antipatterns, e.g what happens if the remediatior chooses not to delete the machine but to do something else, what happen if two remediator race against the annotation, etc, etc... We should not enable this programatically until there's a clear particular proposal and plan for it.

We need to call out that the expectation is that with this strategy the owner deletes the Machine.

For the scope of this PR I'd suggest we keep it to solve the particular problem you are trying to overcome here, i.e remediate machines owned by the KCP.

During our discussion we agreed to rework the proposal in a way that touches any Machine with a controller owner, not just KCP-owned Machines.

/cc @ncdc

@vincepri
Copy link
Member

@fmuyassarov Thanks for the reminder, we won't be tackling external machine remediation in this amendment to the MHC proposal.

@benmoss
Copy link
Author

benmoss commented Apr 20, 2020

@enxebre Is it sufficiently explicit now that no controller that works with MHC is to do anything but handle deletion now? We can generalize it and relax the contract once we have that subsequent proposal, but for now can't it be understood that the two implementing controllers are MachineSet and KubeadmControlPlane?

Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few comments that might improve the clarity of the changes but otherwise happy with the changes described

docs/proposals/20191030-machine-health-checking.md Outdated Show resolved Hide resolved
docs/proposals/20191030-machine-health-checking.md Outdated Show resolved Hide resolved
docs/proposals/20191030-machine-health-checking.md Outdated Show resolved Hide resolved
@enxebre
Copy link
Member

enxebre commented Apr 21, 2020

We can generalize it and relax the contract once we have that subsequent proposal, but for now can't it be understood that the two implementing controllers are MachineSet and KubeadmControlPlane

Yeh that's exactly my point, it does not need to be understood, It's enforced by code already. You can just enable annotations for KCP now and generalise it afterwards based on the upcoming proposal.

Anyway don't take my feedback as a blocker, I'm fine whatever path you choose to go :)

@vincepri
Copy link
Member

Thanks @enxebre, I think it's great feedback and I'll make sure to broadcast it when we'll get there in the external remediation proposal.

The main reason I want to avoid to build something just for KCP is that it sets a precedent to have some controller specific behavior tied up to another. In general, Cluster API should strive (whenever possible) to be agnostic and provide generic interfaces.

- The owning controller e.g machineSet controller reconciles to meet number of replicas and start the process to bring up a new machine/node.
- The owning controller observes the annotation and is responsible to remediate the machine.
- The owning controller performs any pre-deletion tasks required.
- The owning controller MUST delete the machine.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why must the machine be deleted?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the purpose of this effort, which is mainly to separate the responsibilities between controllers, we want to make sure the expected outcomes are spelled out.

In the future, when new proposals come out and have new strategies, we'll redesign this.

Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy that the changes are sufficiently descriptive now so "owning controllers" know what they should be doing

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 22, 2020
@benmoss
Copy link
Author

benmoss commented Apr 22, 2020

/test pull-cluster-api-capd-e2e

Copy link
Member

@fmuyassarov fmuyassarov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a small tiny nit

docs/proposals/20191030-machine-health-checking.md Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 27, 2020
Copy link
Member

@vincepri vincepri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: benmoss, vincepri

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 27, 2020
@vincepri
Copy link
Member

@JoelSpeed @enxebre
over to you for final lgtm

@JoelSpeed
Copy link
Contributor

/hold

Should we squash this down before merging? If not cancel hold, otherwise I'm happy with the content

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 27, 2020
Apply suggestions from code review

Co-Authored-By: Vince Prignano <vince@vincepri.com>

More edits

Add actual implementation history
Rewrite latest alternative to represent that it was actually implemented

Apply suggestions from code review

Co-Authored-By: Jason DeTiberus <detiberusj@vmware.com>

Make deletion explicit

Apply suggestions from code review

Co-Authored-By: Joel Speed <Joel.speed@hotmail.co.uk>

Reorder downstream controller actions

Add missing word
@enxebre
Copy link
Member

enxebre commented Apr 28, 2020

I hope this unblocks KPC and enables exploration but I really hope eventually machineSet does not need to know anything about a MHC annotation. Happy leaving to @JoelSpeed to unhold as appropriate

@JoelSpeed
Copy link
Contributor

I really hope eventually machineSet does not need to know anything about a MHC annotation

I echo this sentiment

/hold cancel
/lgtm

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Apr 28, 2020
@vincepri
Copy link
Member

@enxebre @JoelSpeed The unhealthy annotation is a placeholder until we have a formalized condition workflow coming in v1alpha3 (as informational only) and with extended use in v1alpha4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/proposal Issues or PRs related to proposals. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants