Add Managing Control Plane machines proposal #292

enxebre · 2020-04-23T10:36:14Z

This proposal outlines a solution for managing the Control Plane compute resources with a controlPlane resource which manages machineSets and MHC.

cc @abhinavdahiya, @hexfusion, @JoelSpeed, @jeremyeder @smarterclayton @derekwaynecarr @deads2k @markmc @michaelgugino

Follow up to supersede #278

enhancements/machine-api/control-plane-machinesets.md

smarterclayton · 2020-04-27T13:26:22Z

enhancements/machine-api/control-plane-machinesets.md

+	- The Cluster etcd Operator manages certificates, report healthiness and add etcd members as "Master" nodes join the cluster.
+	- The Kubelet is tied to the OS and therefore is managed by the Machine Config Operator.
+- To integrate with any Control Plane components topology e.g Remote hosted pod based Control Plane.
+- Automated disaster recovery of a cluster that has lost quorum.


I think what we do intended to support should be moved to goals, which is "simplify disaster recovery of a cluster that has lost quorum" (and then later describe what we do offer)

I would like to see this in-scope as well. This is the critical scenario when we have lost etcd and the cluster is broken. We hit this with an OSD cluster last weekend, 2 masters offline and quorum lost. Guarding against this with other goals noted in this enhancement is an improvement but without the ability to recover from lost quorum there is still a huge risk carried in supporting clusters.

This only affects to scaling and ensuring compute resources. Any operational aspect of etcd recovery should be handled by the etcd operator orthogonally.

enhancements/machine-api/control-plane-machinesets.md

smarterclayton · 2020-04-27T13:32:24Z

enhancements/machine-api/control-plane-machinesets.md

+Currently during a regular IPI bootstrapping process the installer uses Terraform to create a bootstrapping instance and 3 master instances. Then it creates Machine resources to "adopt" the existing master instances.
+
+Additionally:
+- It will create MachineSets to adopt those machines by looking up known labels (Adopting behaviour already exists in machineSet logic).


I think this should be automatic by the platform instead of install - install creates instances, machine sets are created by a more core operator that respects infrastrucutre.spec.managedMasters / masters. That supports customers upgrading.

I think it’d be very complex to materialise an opinionated expected state for something like infra.spec.managedMasters in existing clusters.

There are no strong guarantees about cluster topology or safe assumptions we can do on already existing clusters/environments for an flag to instantiate and manage these objects. I’m concern this would create more rough edges than it would solve.

I think we should make no difference with workers. I don’t think we should have a semantic for managedMasters the same way we don’t have one for managedWorkers. MachineSets are building blocks compute resources.

Before there was a limitation on etcd that prevent master compute resources from being machineSets. Now that’s fixed at the right layer and shipped on upgrades (etcd operator).

So we just want all compute resources to be machineSets out of the box. And as for existing clusters, as an admin I now know that creating machineSets for masters is safe so it’s my recommended choice as for any other machine.

The installer is just defaulting the initial ha topology for all the machines in a cluster. MachineSets is not more. It could actually supersede current masters machines objects and terraform code in the installer, just like we do for workers (but we want to keep that step separate from this proposal to maintain scope).

To mitigate this concerns and alleviate the upgrade burden from the user I introduced controlPlane CRD and controller managed by the Machine API Operator. Opt-in for existing clusters requires no more than just instantiate this object.
This also let us have more control to limit the underlying machineSet capabilities to prevent an undesired to high degree of flexibility, and reserve the ability to relax it later in time.

See 5ef75fd

PTAL @smarterclayton

gregsheremeta · 2020-04-27T13:33:37Z

cc @dgoodwin @staebler

enhancements/machine-api/control-plane-machinesets.md

smarterclayton · 2020-04-27T13:35:29Z

enhancements/machine-api/control-plane-machinesets.md

+	- Scale out.
+	- Scale in.
+	- Vertical scaling forcing machine recreation.
+	- Simulate unhealthy nodes and remediation.


I think it woulrd be good to detail a reasonably complete test plan for each of these (for instance, for unhealthy node emulation alay already owns a test, describe the scenario here so we can review it).

enhancements/machine-api/control-plane-machinesets.md

smarterclayton

Structure is ok, not happy with the installer cerating machine sets.

jewzaam · 2020-04-29T17:32:53Z

enhancements/machine-api/control-plane-machinesets.md

+	- The Cluster etcd Operator manages certificates, report healthiness and add etcd members as "Master" nodes join the cluster.
+	- The Kubelet is tied to the OS and therefore is managed by the Machine Config Operator.
+- To integrate with any Control Plane components topology e.g Remote hosted pod based Control Plane.
+- Automated disaster recovery of a cluster that has lost quorum.


I would like to see this in-scope as well. This is the critical scenario when we have lost etcd and the cluster is broken. We hit this with an OSD cluster last weekend, 2 masters offline and quorum lost. Guarding against this with other goals noted in this enhancement is an improvement but without the ability to recover from lost quorum there is still a huge risk carried in supporting clusters.

jewzaam · 2020-04-29T17:33:43Z

enhancements/machine-api/control-plane-machinesets.md

+
+The Control Plane is the most critical and sensitive entity of a running cluster. Today OCP Control Plane instances are "pets" and therefore fragile. There are multiple scenarios where adjusting the compute capacity which is backing the Control Plane components might be desirable either for resizing or repairing.
+
+Currently there is nothing that automates or eases this task. The steps for the Control Plane to be resized in any manner or to recover from a tolerable failure are completely manual. Different teams are following different "Standard Operating Procedure" documents scattered around with manual steps resulting in loss of information, confusion and extra efforts for engineers and users.


For the purposes of this document, what is "a tolerable failure"?

etcd quorum is not lost. Clarified.

enhancements/machine-api/control-plane-machinesets.md

jewzaam · 2020-04-29T17:50:12Z

enhancements/machine-api/control-plane-machinesets.md

+
+- Exhaustive unit testing.
+- Exhaustive integration testing via [envTest](https://book.kubebuilder.io/reference/testing/envtest.html).
+- E2e testing on the [machine API](https://github.com/openshift/cluster-api-actuator-pkg/tree/master/pkg) and origin e2e test suite:


Would like to see some tests when bad configuration is supplied. I.e.

scale in to less than 3 masters

instance type too small to handle master workload

At best we can test to block scale in on etcd guard PDB. For any farther automation that requires somehow auto scaling of compute this proposal is a necessary first step.

enhancements/machine-api/control-plane-machinesets.md

michaelgugino · 2020-06-24T19:16:30Z

@jeremyeder My preference is for some component to create the control plane resource for existing clusters by default. Which component TBD.

Having the user create them definitely presents lots of problems, and will be quite error prone. If the resource in question ends up being something like the cluster autoscaler (basically a singleton CRD to deploy something else that does the actual work) with minimal user-drive configuration, that might be suitable as well.

I think this enhancement is in need of some additional revision.

enxebre · 2020-06-25T08:15:38Z

@enxebre how are we looking on this one? Any outstanding issues before it can merge?

I'm all good merging this as it is as provisional, move forward and review during the next release cycle. @michaelgugino is already hands on working on a prototype either way.

If the resource in question ends up being something like the cluster autoscaler (basically a singleton CRD to deploy something else that does the actual work) with minimal user-drive configuration, that might be suitable as well.

That's literally what this proposal is describing atm. I think we can re-interact about more details during the next release cycle, sync about the state of etcd operator and get some tangible feelings out of the prototype you are working on.

@jeremyeder I'd like to get ack from @smarterclayton to merge though as he requested changes in the first place.

smarterclayton · 2020-06-25T15:35:12Z

enhancements/machine-api/control-plane.md

+### Goals
+
+- To have a declarative mechanism to ensure that existing Control Plane Machines are recreated on a deletion request at any time.
+- To auto repair unhealthy Control Plane Nodes.


These are great goals

smarterclayton · 2020-06-25T15:36:12Z

enhancements/machine-api/control-plane.md

+- As a SRE, I want to have consumable API primitives in place to resize Control Plane compute resources so I can develop upper level automation tooling atop. E.g Automate Story 3 to support a severe growth peak of the number of worker nodes.
+
+#### Story 5
+- As an operator, I want faulty nodes to be remediated automatically. This includes having self healing Control Plane machines.


I assume this is covered, but want to call it out here. Faulty control plane machines must be remediated in a way that does not break quorum.

yeh, I clarified here 5e56429#diff-dcd7423e408e8b740b1b9b14e53eeda1R201
Any deletion operation must block on PDB and so honour etcd-guard.
Additionally for auto repairing nodes the managed MHC will keep a non-disruptive maxUnhealthy value for etcd quorum i.e 1 out of 3 which will limit the number deletion operations.

smarterclayton · 2020-06-25T21:02:41Z

One question remaining.

…

On Thu, Jun 25, 2020 at 4:16 AM Alberto García Lamela < ***@***.***> wrote: @enxebre <https://github.com/enxebre> how are we looking on this one? Any outstanding issues before it can merge? I'm all good merging this as it is as provisional, move forward and review during the next release cycle. @michaelgugino <https://github.com/michaelgugino> is already hands on working on a prototype either way. If the resource in question ends up being something like the cluster autoscaler (basically a singleton CRD to deploy something else that does the actual work) with minimal user-drive configuration, that might be suitable as well. That's literally what this proposal is describing atm. I think we can re-interact about more details during the next release cycle, sync about the state of etcd operator and get some tangible feelings out of the prototype you are working on. @jeremyeder <https://github.com/jeremyeder> I'd like to get ack from @smarterclayton <https://github.com/smarterclayton> to merge though as he requested changes in the first place. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#292 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAI37J6O3GSJLHMNRBEGKXDRYMBTXANCNFSM4MO6JUGA> .

openshift-ci-robot · 2020-06-26T09:02:12Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [enxebre]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

enxebre · 2020-06-26T09:08:07Z

One question remaining.

thanks @smarterclayton PTAL -> #292 (comment)

JoelSpeed · 2020-06-26T11:11:39Z

enhancements/machine-api/control-plane.md

+  // Defaults to true.
+  // This can be disable for complex scenarios where manual introspection is needed.
+  EnableNodeAutorepair bool `json:"enableautorepair,omitempty"`


Minor, but to be able to default this to true it should be a pointer to a bool, otherwise we won't know if the user explicitly set it to false or just omitted it. Alternatively, negate it so that default is false by calling it something like DisableNodeAutoRepair?

Addressed by 2aa9977

JoelSpeed · 2020-06-26T11:25:01Z

LGTM, I'll leave the label for @smarterclayton

openshift-bot · 2020-10-23T19:08:10Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

wking · 2020-10-24T01:19:22Z

We definitely want this.

/lifecycle frozen

wking · 2021-03-03T20:27:20Z

enhancements/machine-api/control-plane.md

+  - The ControlPlane controller creates and manages a MachineSet to back each Control Plane Machine that is found at any time.
+  - The ControlPlane controller creates and manages a Machine Health Check resource to monitor the Control Plane Machines.
+
+This proposal assumes that all etcd operational aspects are managed by the cluster-etcd-operator orthogonally in a safe manner while manipulating the compute resources.


In addition to the etcd member dance, we may need something to manipulate load balancer targets for api and api-int, at least in clusters where those were provisioned by the installer. Maybe that looks like "grow an in-cluster component to optionally manage those load balancers", because that could be more convenient than writing something specific to the control-plane-recovery case.

The machine-api adds and removes control plane hosts from the load balancers today, at least in AWS and GCP, and probably Azure. I think there's also something such about SLBs, not sure how they work.

On platforms where this can't be automated, users will need to add/remove load balancers. The nice thing about the machineset model is we're not going to add/remove machines without you telling us to, so you'll be able to do this when you see fit.

The machine-api adds and removes control plane hosts from the load balancers today, at least in AWS and GCP, and probably Azure.

How does it discover which LBs to adjust?

That's part of the machine object spec.

dhellmann · 2021-09-21T21:17:55Z

This proposal is over a year old. As part of recent efforts to clean up old pull requests, I am removing the life-cycle/frozen label to allow it to age out and be closed. If the proposal is still active, please restore the label.

/remove-lifecycle frozen

openshift-bot · 2021-10-19T21:43:45Z

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2021-10-26T21:50:31Z

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2021-11-02T21:56:33Z

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2021-11-02T21:58:02Z

@openshift-bot: Closed this PR.

In response to this:

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Add Managing Control Plane machines proposal

5d3ca69

enxebre mentioned this pull request Apr 23, 2020

Add Declarative Single Entity Control Plane proposal #278

Closed

openshift-ci-robot requested review from ecordell and imcleod April 23, 2020 10:36

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 23, 2020

jeremyeder reviewed Apr 23, 2020

View reviewed changes

enhancements/machine-api/control-plane-machinesets.md Outdated Show resolved Hide resolved

beekhof reviewed Apr 24, 2020

View reviewed changes

enhancements/machine-api/control-plane-machinesets.md Outdated Show resolved Hide resolved

JoelSpeed reviewed Apr 24, 2020

View reviewed changes

mhrivnak reviewed Apr 24, 2020

View reviewed changes

wking reviewed Apr 27, 2020

View reviewed changes

enhancements/machine-api/control-plane-machinesets.md Outdated Show resolved Hide resolved

wking reviewed Apr 27, 2020

View reviewed changes

enhancements/machine-api/control-plane-machinesets.md Outdated Show resolved Hide resolved

wking reviewed Apr 27, 2020

View reviewed changes

enhancements/machine-api/control-plane-machinesets.md Outdated Show resolved Hide resolved

enxebre force-pushed the control-plane-machineSet branch 3 times, most recently from e00ad19 to 3b653f4 Compare April 27, 2020 09:50

soltysh reviewed Apr 27, 2020

View reviewed changes

enhancements/machine-api/control-plane-machinesets.md Outdated Show resolved Hide resolved

enhancements/machine-api/control-plane-machinesets.md Outdated Show resolved Hide resolved

enhancements/machine-api/control-plane-machinesets.md Outdated Show resolved Hide resolved

enxebre force-pushed the control-plane-machineSet branch from 3b653f4 to 9b2ffa6 Compare April 27, 2020 12:01

smarterclayton reviewed Apr 27, 2020

View reviewed changes

enhancements/machine-api/control-plane-machinesets.md Outdated Show resolved Hide resolved

smarterclayton reviewed Apr 27, 2020

View reviewed changes

enhancements/machine-api/control-plane-machinesets.md Outdated Show resolved Hide resolved

smarterclayton reviewed Apr 27, 2020

View reviewed changes

enhancements/machine-api/control-plane-machinesets.md Outdated Show resolved Hide resolved

smarterclayton reviewed Apr 27, 2020

View reviewed changes

enhancements/machine-api/control-plane-machinesets.md Outdated Show resolved Hide resolved

smarterclayton suggested changes Apr 27, 2020

View reviewed changes

jewzaam reviewed Apr 29, 2020

View reviewed changes

vrutkovs mentioned this pull request May 1, 2020

etcd-quorum-guard should not attempt to deploy 3 replicas if useUnsupportedUnsafeNonHANonProductionUnstableEtcd is set openshift/machine-config-operator#1695

Closed

enxebre force-pushed the control-plane-machineSet branch from 9b2ffa6 to a13a7a7 Compare May 4, 2020 16:01

Address comments

d2da065

smarterclayton reviewed Jun 25, 2020

View reviewed changes

Address comment openshift#292 (comment)

5e56429

JoelSpeed reviewed Jun 26, 2020

View reviewed changes

enxebre added a commit to enxebre/enhancements that referenced this pull request Jun 26, 2020

Address openshift#292 (comment)

8449e62

Address openshift#292 (comment)

2aa9977

enxebre force-pushed the control-plane-machineSet branch from 8449e62 to 2aa9977 Compare June 26, 2020 11:22

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 23, 2020

openshift-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 24, 2020

vrutkovs mentioned this pull request Feb 19, 2021

[OKD4] Documentation indicates to not use more or less than exactly 3 Master Nodes in OKD 4 okd-project/okd#519

Closed

wking reviewed Mar 3, 2021

View reviewed changes

openshift-ci bot removed the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Sep 21, 2021

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 19, 2021

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 26, 2021

openshift-ci bot closed this Nov 2, 2021


		The Control Plane is the most critical and sensitive entity of a running cluster. Today OCP Control Plane instances are "pets" and therefore fragile. There are multiple scenarios where adjusting the compute capacity which is backing the Control Plane components might be desirable either for resizing or repairing.

		Currently there is nothing that automates or eases this task. The steps for the Control Plane to be resized in any manner or to recover from a tolerable failure are completely manual. Different teams are following different "Standard Operating Procedure" documents scattered around with manual steps resulting in loss of information, confusion and extra efforts for engineers and users.

Add Managing Control Plane machines proposal #292

Add Managing Control Plane machines proposal #292

Conversation

enxebre commented Apr 23, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

enxebre May 4, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gregsheremeta commented Apr 27, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smarterclayton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michaelgugino commented Jun 24, 2020

enxebre commented Jun 25, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

enxebre Jun 26, 2020 • edited Loading

Choose a reason for hiding this comment

smarterclayton commented Jun 25, 2020 via email

openshift-ci-robot commented Jun 26, 2020

enxebre commented Jun 26, 2020

Choose a reason for hiding this comment

enxebre Jun 26, 2020 • edited Loading

Choose a reason for hiding this comment

JoelSpeed commented Jun 26, 2020

openshift-bot commented Oct 23, 2020

wking commented Oct 24, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhellmann commented Sep 21, 2021

openshift-bot commented Oct 19, 2021

openshift-bot commented Oct 26, 2021

openshift-bot commented Nov 2, 2021

openshift-ci bot commented Nov 2, 2021

enxebre commented Apr 23, 2020 •

edited

Loading

enxebre May 4, 2020 •

edited

Loading

enxebre Jun 26, 2020 •

edited

Loading

enxebre Jun 26, 2020 •

edited

Loading