Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Managing Control Plane machines proposal #292

Closed
wants to merge 10 commits into from

Conversation

enxebre
Copy link
Member

@enxebre enxebre commented Apr 23, 2020

This proposal outlines a solution for managing the Control Plane compute resources with a controlPlane resource which manages machineSets and MHC.

cc @abhinavdahiya, @hexfusion, @JoelSpeed, @jeremyeder @smarterclayton @derekwaynecarr @deads2k @markmc @michaelgugino

Follow up to supersede #278

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 23, 2020
@enxebre enxebre force-pushed the control-plane-machineSet branch 3 times, most recently from e00ad19 to 3b653f4 Compare April 27, 2020 09:50
- The Cluster etcd Operator manages certificates, report healthiness and add etcd members as "Master" nodes join the cluster.
- The Kubelet is tied to the OS and therefore is managed by the Machine Config Operator.
- To integrate with any Control Plane components topology e.g Remote hosted pod based Control Plane.
- Automated disaster recovery of a cluster that has lost quorum.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what we do intended to support should be moved to goals, which is "simplify disaster recovery of a cluster that has lost quorum" (and then later describe what we do offer)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see this in-scope as well. This is the critical scenario when we have lost etcd and the cluster is broken. We hit this with an OSD cluster last weekend, 2 masters offline and quorum lost. Guarding against this with other goals noted in this enhancement is an improvement but without the ability to recover from lost quorum there is still a huge risk carried in supporting clusters.

Copy link
Member Author

@enxebre enxebre May 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only affects to scaling and ensuring compute resources. Any operational aspect of etcd recovery should be handled by the etcd operator orthogonally.

Currently during a regular IPI bootstrapping process the installer uses Terraform to create a bootstrapping instance and 3 master instances. Then it creates Machine resources to "adopt" the existing master instances.

Additionally:
- It will create MachineSets to adopt those machines by looking up known labels (Adopting behaviour already exists in machineSet logic).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be automatic by the platform instead of install - install creates instances, machine sets are created by a more core operator that respects infrastrucutre.spec.managedMasters / masters. That supports customers upgrading.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it’d be very complex to materialise an opinionated expected state for something like infra.spec.managedMasters in existing clusters.

There are no strong guarantees about cluster topology or safe assumptions we can do on already existing clusters/environments for an flag to instantiate and manage these objects. I’m concern this would create more rough edges than it would solve.

I think we should make no difference with workers. I don’t think we should have a semantic for managedMasters the same way we don’t have one for managedWorkers. MachineSets are building blocks compute resources.

Before there was a limitation on etcd that prevent master compute resources from being machineSets. Now that’s fixed at the right layer and shipped on upgrades (etcd operator).

So we just want all compute resources to be machineSets out of the box. And as for existing clusters, as an admin I now know that creating machineSets for masters is safe so it’s my recommended choice as for any other machine.

The installer is just defaulting the initial ha topology for all the machines in a cluster. MachineSets is not more. It could actually supersede current masters machines objects and terraform code in the installer, just like we do for workers (but we want to keep that step separate from this proposal to maintain scope).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To mitigate this concerns and alleviate the upgrade burden from the user I introduced controlPlane CRD and controller managed by the Machine API Operator. Opt-in for existing clusters requires no more than just instantiate this object.
This also let us have more control to limit the underlying machineSet capabilities to prevent an undesired to high degree of flexibility, and reserve the ability to relax it later in time.

See 5ef75fd

PTAL @smarterclayton

@gregsheremeta
Copy link
Contributor

cc @dgoodwin @staebler

- Scale out.
- Scale in.
- Vertical scaling forcing machine recreation.
- Simulate unhealthy nodes and remediation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it woulrd be good to detail a reasonably complete test plan for each of these (for instance, for unhealthy node emulation alay already owns a test, describe the scenario here so we can review it).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@smarterclayton smarterclayton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Structure is ok, not happy with the installer cerating machine sets.

- The Cluster etcd Operator manages certificates, report healthiness and add etcd members as "Master" nodes join the cluster.
- The Kubelet is tied to the OS and therefore is managed by the Machine Config Operator.
- To integrate with any Control Plane components topology e.g Remote hosted pod based Control Plane.
- Automated disaster recovery of a cluster that has lost quorum.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see this in-scope as well. This is the critical scenario when we have lost etcd and the cluster is broken. We hit this with an OSD cluster last weekend, 2 masters offline and quorum lost. Guarding against this with other goals noted in this enhancement is an improvement but without the ability to recover from lost quorum there is still a huge risk carried in supporting clusters.


The Control Plane is the most critical and sensitive entity of a running cluster. Today OCP Control Plane instances are "pets" and therefore fragile. There are multiple scenarios where adjusting the compute capacity which is backing the Control Plane components might be desirable either for resizing or repairing.

Currently there is nothing that automates or eases this task. The steps for the Control Plane to be resized in any manner or to recover from a tolerable failure are completely manual. Different teams are following different "Standard Operating Procedure" documents scattered around with manual steps resulting in loss of information, confusion and extra efforts for engineers and users.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the purposes of this document, what is "a tolerable failure"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

etcd quorum is not lost. Clarified.

enhancements/machine-api/control-plane-machinesets.md Outdated Show resolved Hide resolved

- Exhaustive unit testing.
- Exhaustive integration testing via [envTest](https://book.kubebuilder.io/reference/testing/envtest.html).
- E2e testing on the [machine API](https://github.com/openshift/cluster-api-actuator-pkg/tree/master/pkg) and origin e2e test suite:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would like to see some tests when bad configuration is supplied. I.e.

  • scale in to less than 3 masters
  • instance type too small to handle master workload

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At best we can test to block scale in on etcd guard PDB. For any farther automation that requires somehow auto scaling of compute this proposal is a necessary first step.

enhancements/machine-api/control-plane-machinesets.md Outdated Show resolved Hide resolved
enhancements/machine-api/control-plane-machinesets.md Outdated Show resolved Hide resolved
@michaelgugino
Copy link
Contributor

@jeremyeder My preference is for some component to create the control plane resource for existing clusters by default. Which component TBD.

Having the user create them definitely presents lots of problems, and will be quite error prone. If the resource in question ends up being something like the cluster autoscaler (basically a singleton CRD to deploy something else that does the actual work) with minimal user-drive configuration, that might be suitable as well.

I think this enhancement is in need of some additional revision.

@enxebre
Copy link
Member Author

enxebre commented Jun 25, 2020

@enxebre how are we looking on this one? Any outstanding issues before it can merge?

I'm all good merging this as it is as provisional, move forward and review during the next release cycle. @michaelgugino is already hands on working on a prototype either way.

If the resource in question ends up being something like the cluster autoscaler (basically a singleton CRD to deploy something else that does the actual work) with minimal user-drive configuration, that might be suitable as well.

That's literally what this proposal is describing atm. I think we can re-interact about more details during the next release cycle, sync about the state of etcd operator and get some tangible feelings out of the prototype you are working on.

@jeremyeder I'd like to get ack from @smarterclayton to merge though as he requested changes in the first place.

### Goals

- To have a declarative mechanism to ensure that existing Control Plane Machines are recreated on a deletion request at any time.
- To auto repair unhealthy Control Plane Nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are great goals

- As a SRE, I want to have consumable API primitives in place to resize Control Plane compute resources so I can develop upper level automation tooling atop. E.g Automate Story 3 to support a severe growth peak of the number of worker nodes.

#### Story 5
- As an operator, I want faulty nodes to be remediated automatically. This includes having self healing Control Plane machines.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is covered, but want to call it out here. Faulty control plane machines must be remediated in a way that does not break quorum.

Copy link
Member Author

@enxebre enxebre Jun 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeh, I clarified here 5e56429#diff-dcd7423e408e8b740b1b9b14e53eeda1R201
Any deletion operation must block on PDB and so honour etcd-guard.
Additionally for auto repairing nodes the managed MHC will keep a non-disruptive maxUnhealthy value for etcd quorum i.e 1 out of 3 which will limit the number deletion operations.

@smarterclayton
Copy link
Contributor

smarterclayton commented Jun 25, 2020 via email

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@enxebre
Copy link
Member Author

enxebre commented Jun 26, 2020

One question remaining.

thanks @smarterclayton PTAL -> #292 (comment)

Comment on lines 222 to 224
// Defaults to true.
// This can be disable for complex scenarios where manual introspection is needed.
EnableNodeAutorepair bool `json:"enableautorepair,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, but to be able to default this to true it should be a pointer to a bool, otherwise we won't know if the user explicitly set it to false or just omitted it. Alternatively, negate it so that default is false by calling it something like DisableNodeAutoRepair?

Copy link
Member Author

@enxebre enxebre Jun 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed by 2aa9977

enxebre added a commit to enxebre/enhancements that referenced this pull request Jun 26, 2020
@JoelSpeed
Copy link
Contributor

LGTM, I'll leave the label for @smarterclayton

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 23, 2020
@wking
Copy link
Member

wking commented Oct 24, 2020

We definitely want this.

/lifecycle frozen

@openshift-ci-robot openshift-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 24, 2020
- The ControlPlane controller creates and manages a MachineSet to back each Control Plane Machine that is found at any time.
- The ControlPlane controller creates and manages a Machine Health Check resource to monitor the Control Plane Machines.

This proposal assumes that all etcd operational aspects are managed by the cluster-etcd-operator orthogonally in a safe manner while manipulating the compute resources.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the etcd member dance, we may need something to manipulate load balancer targets for api and api-int, at least in clusters where those were provisioned by the installer. Maybe that looks like "grow an in-cluster component to optionally manage those load balancers", because that could be more convenient than writing something specific to the control-plane-recovery case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The machine-api adds and removes control plane hosts from the load balancers today, at least in AWS and GCP, and probably Azure. I think there's also something such about SLBs, not sure how they work.

On platforms where this can't be automated, users will need to add/remove load balancers. The nice thing about the machineset model is we're not going to add/remove machines without you telling us to, so you'll be able to do this when you see fit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The machine-api adds and removes control plane hosts from the load balancers today, at least in AWS and GCP, and probably Azure.

How does it discover which LBs to adjust?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's part of the machine object spec.

@dhellmann
Copy link
Contributor

This proposal is over a year old. As part of recent efforts to clean up old pull requests, I am removing the life-cycle/frozen label to allow it to age out and be closed. If the proposal is still active, please restore the label.

/remove-lifecycle frozen

@openshift-ci openshift-ci bot removed the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Sep 21, 2021
@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 19, 2021
@openshift-bot
Copy link

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 26, 2021
@openshift-bot
Copy link

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this Nov 2, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 2, 2021

@openshift-bot: Closed this PR.

In response to this:

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.