Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Managing Control Plane machines proposal #292

Closed
wants to merge 10 commits into from
238 changes: 238 additions & 0 deletions enhancements/machine-api/control-plane-machinesets.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,238 @@
---
title: Managing Control Plane machines
authors:
- enxebre
reviewers:
- hexfusion
- jeremyeder
- abhinavdahiya
- joelspeed
- smarterclayton
- derekwaynecarr
approvers:
- TBD

creation-date: 2020-04-02
last-updated: yyyy-mm-dd
status: implementable
see-also:
- https://github.com/kubernetes-sigs/cluster-api/blob/master/docs/proposals/20191017-kubeadm-based-control-plane.md
- https://github.com/openshift/enhancements/blob/master/enhancements/etcd/cluster-etcd-operator.md
- https://github.com/openshift/enhancements/blob/master/enhancements/etcd/disaster-recovery-with-ceo.md
enxebre marked this conversation as resolved.
Show resolved Hide resolved
- https://github.com/openshift/enhancements/blob/master/enhancements/kube-apiserver/auto-cert-recovery.md
- https://github.com/openshift/machine-config-operator/blob/master/docs/etcd-quorum-guard.md
- https://github.com/openshift/cluster-kube-scheduler-operator
- https://github.com/openshift/cluster-kube-controller-manager-operator
- https://github.com/openshift/cluster-openshift-controller-manager-operator
- https://github.com/openshift/cluster-kube-apiserver-operator
- https://github.com/openshift/cluster-openshift-apiserver-operator
enxebre marked this conversation as resolved.
Show resolved Hide resolved
replaces:
superseded-by:
---

# Managing Control Plane machines

## Release Signoff Checklist

- [ ] Enhancement is `implementable`
- [ ] Design details are appropriately documented from clear requirements
- [ ] Test plan is defined
- [ ] Graduation criteria for dev preview, tech preview, GA
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)

## Summary

For clarity in this doc we set the definition for "Control Plane" as "The collection of stateless and stateful processes which enable a Kubernetes cluster to meet minimum operational requirements". This includes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet and etcd.

This proposal outlines a solution for presenting the Control Plane compute as a single scalable resource (i.e machineSet) for each failure domain. Particularly:
- Proposes to let the installer create a MachineSet for each master Machine.
- Proposes to let the installer create a Machine Health Check resource to monitor the Control Plane Machines.

**Note:**
- This will enable to create a new CRD to manage the Control Plane as a single scalable resource on top of MachineSets, autoscaling and fully automated rolling upgrades if desired in the near future. This is out of the scope for this particular proposal.

- A separate proposal is created to let the installer create Machine Health Checking resources to monitor the worker Machines. This is out of the scope for this particular proposal.

## Motivation

The Control Plane is the most critical and sensitive entity of a running cluster. Today OCP Control Plane instances are "pets" and therefore fragile. There are multiple scenarios where adjusting the compute capacity which is backing the Control Plane components might be desirable either for resizing or repairing.

Currently there is nothing that automates or eases this task. The steps for the Control Plane to be resized in any manner or to recover from a tolerable failure (etcd quorum is not lost) are completely manual. Different teams are following different "Standard Operating Procedure" documents scattered around with manual steps resulting in loss of information, confusion and extra efforts for engineers and users.

### Goals

- To have a declarative mechanism to scale Control Plane compute resources.
enxebre marked this conversation as resolved.
Show resolved Hide resolved
- To have a declarative mechanism to ensure a given number of Control Plane compute resources is available at a time.
- To support default declarative self-healing and replacement of Control Plane compute resources.

### Non-Goals / Future work

- To integrate with any existing etcd topology e.g external clusters. Stacked etcd with dynamic member identities, local storage and the Cluster etcd Operator are an assumed invariant.
- To managed individual Control Plane components. Self hosted Control Plane components that are self managed by their operators is an assumed invariant:
enxebre marked this conversation as resolved.
Show resolved Hide resolved
enxebre marked this conversation as resolved.
Show resolved Hide resolved
- Rolling OS upgrades and config changes at the software layer are managed by the Machine Config Operator.
- etcd Guard that ensures a PDB to honour quorum is managed by the Machine Config Operator.
- Each individual Control Plane component i.e Kube API Server, Scheduler and controller manager are self hosted and managed by their operators.
- The Cluster etcd Operator manages certificates, report healthiness and add etcd members as "Master" nodes join the cluster.
- The Kubelet is tied to the OS and therefore is managed by the Machine Config Operator.
- To integrate with any Control Plane components topology e.g Remote hosted pod based Control Plane.
- Automated disaster recovery of a cluster that has lost quorum.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what we do intended to support should be moved to goals, which is "simplify disaster recovery of a cluster that has lost quorum" (and then later describe what we do offer)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see this in-scope as well. This is the critical scenario when we have lost etcd and the cluster is broken. We hit this with an OSD cluster last weekend, 2 masters offline and quorum lost. Guarding against this with other goals noted in this enhancement is an improvement but without the ability to recover from lost quorum there is still a huge risk carried in supporting clusters.

Copy link
Member Author

@enxebre enxebre May 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only affects to scaling and ensuring compute resources. Any operational aspect of etcd recovery should be handled by the etcd operator orthogonally.

- To manage OS upgrades. This is managed by the Machine Config Operator.
- To manage configuration changes at the software layer. This is managed by the Machine Config Operator.
- To manage the life cycle of Control Plane components
- To automate the provisioning and decommission of the bootstrapping instance managed by the installer.
- To provide autoscaling support for the Control Plane.
- This proposal is a necessary first step for Control Plane autoscaling. It focuses on settling on the primitives to abstract away the Control Plane as scalable resources. In a follow up RFE we will discuss how/when to auto scale the MachineSet resources as a single entity based on relevant cluster metrics e.g number of workers, number of etcd objects, etc.
- To support fully automated rolling upgrades for the Control Plane compute resources. Same reason as above.

## Proposal

Currently the installer chooses the failure domains out of a particular provider availability and it creates a Control Plane Machine resource for each of them.

This proposes two additional steps:
- To let the installer create MachineSets for each master Machine to be adopted.
- To let the installer create a Machine Health Check resource to monitor the Control Plane Machines.
enxebre marked this conversation as resolved.
Show resolved Hide resolved

The lifecycle of the compute resources still remains decoupled and orthogonal to the lifecycle and management of the Control Plane components hosted by the compute resources. All of these components, including etcd are expected to keep self managing themselves as the cluster shrink and expand the Control Plane compute resources.

### User Stories [optional]

#### Story 1
- As an operator running Installer Provider Infrastructure (IPI), I want flexibility to run [large or small clusters](https://kubernetes.io/docs/setup/best-practices/cluster-large/#size-of-master-and-master-components) so I need to have the ability to resize the control plane in a declarative, automated and seamless manner.

#### Story 2
- As an operator running User Provider Infrastructure (UPI), I want to expose my non-machine API machines and offer them to the Control Plane controller so I can have the ability to resize the control plane in a declarative, automated and seamless manner.
enxebre marked this conversation as resolved.
Show resolved Hide resolved

#### Story 3
- As an operator of an OCP Dedicated Managed Platform, I want to give users flexibility to add as many workers nodes as they want or to enable autoscaling on worker nodes so I need to have ability to resize the control plane instances in a declarative and seamless manner to react quickly to cluster growth.

#### Story 4
- As a SRE, I want to have consumable API primitives in place to resize Control Plane compute resources so I can develop upper level automation tooling atop. E.g Control Plane autoresizing to support a severe growth peak of the number of worker nodes.

#### Story 5
- As an operator, I want my machine API cluster infrastructure to be self healing so I want faulty nodes to be remediated automatically for safety scenarios. This includes having self healing Control Plane machines.

#### Story 6
- As a multi cluster operator, I want to have a universal user experience for managing the Control Plane in a declarative manner across any cloud provider, bare metal and any flavour of the product that have in common the topology assumed in this doc.

#### Story 7
- As a developer, I want to be able to deploy the smallest possible cluster to save costs, i.e one instance Control Plane and resize to more instances as I see fit.

### Implementation Details/Notes/Constraints [optional]

To satisfy the goals, motivation and stories above, this proposes to let the installer to create MachineSets to own the Machine objects for each failure domain and a Machine Health Check to monitor the Control Plane machines.

#### Bootstrapping
Currently during a regular IPI bootstrapping process the installer uses Terraform to create a bootstrapping instance and 3 master instances. Then it creates Machine resources to "adopt" the existing master instances.

Additionally:
- It will create MachineSets to adopt those machines by looking up known labels (Adopting behaviour already exists in machineSet logic).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be automatic by the platform instead of install - install creates instances, machine sets are created by a more core operator that respects infrastrucutre.spec.managedMasters / masters. That supports customers upgrading.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it’d be very complex to materialise an opinionated expected state for something like infra.spec.managedMasters in existing clusters.

There are no strong guarantees about cluster topology or safe assumptions we can do on already existing clusters/environments for an flag to instantiate and manage these objects. I’m concern this would create more rough edges than it would solve.

I think we should make no difference with workers. I don’t think we should have a semantic for managedMasters the same way we don’t have one for managedWorkers. MachineSets are building blocks compute resources.

Before there was a limitation on etcd that prevent master compute resources from being machineSets. Now that’s fixed at the right layer and shipped on upgrades (etcd operator).

So we just want all compute resources to be machineSets out of the box. And as for existing clusters, as an admin I now know that creating machineSets for masters is safe so it’s my recommended choice as for any other machine.

The installer is just defaulting the initial ha topology for all the machines in a cluster. MachineSets is not more. It could actually supersede current masters machines objects and terraform code in the installer, just like we do for workers (but we want to keep that step separate from this proposal to maintain scope).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To mitigate this concerns and alleviate the upgrade burden from the user I introduced controlPlane CRD and controller managed by the Machine API Operator. Opt-in for existing clusters requires no more than just instantiate this object.
This also let us have more control to limit the underlying machineSet capabilities to prevent an undesired to high degree of flexibility, and reserve the ability to relax it later in time.

See 5ef75fd

PTAL @smarterclayton

- `machine.openshift.io/cluster-api-machineset": <cluster_name>-<label>-<zone>-controlplane`
- `machine.openshift.io/cluster-api-cluster": clusterID`

#### Declarative horizontal scaling
- Each MachineSet can be scaled in/out by changing the replica number or via "scale" subresource.
enxebre marked this conversation as resolved.
Show resolved Hide resolved
- Out of band the Cluster etcd Operator is responsible for watching the existing nodes and taking care of the etcd operational aspects.
- During scaling in operations any machine deletion will always honour Pod Disruption Budgets (PBD). This gives etcd guard the chance to block a deletion that might be considered disruptive.
enxebre marked this conversation as resolved.
Show resolved Hide resolved

#### Declarative Vertical scaling
- With MachineSets in place this is semi-automated:
- A particular provider property can be changed by any consumer in the MachineSet spec.
- This won't affect existing Machines but it would apply for upcoming Machines.
- Machine creation with the new spec property can be forced by scaling out or deleting existing machines, e.g `kubectl delete machine name`
enxebre marked this conversation as resolved.
Show resolved Hide resolved

#### Self healing (Machine Health Checking)
- Specific MHC details can be found [here](https://github.com/openshift/enhancements/blob/master/enhancements/machine-api/machine-health-checking.md)

- The Installer will create a Machine Health Checking resource targeting the Control Plane Machines.
enxebre marked this conversation as resolved.
Show resolved Hide resolved

```
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
name: controlPlane
namespace: openshift-machine-api
spec:
selector:
matchLabels:
machine.openshift.io/cluster-api-machine-role: master
machine.openshift.io/cluster-api-machine-type: master
unhealthyConditions:
- type: "Ready"
timeout: "300s"
status: "False"
- type: "Ready"
timeout: "300s"
status: "Unknown"
maxUnhealthy: "34%"
enxebre marked this conversation as resolved.
Show resolved Hide resolved
```

- Any machine deletion will always honour Pod Disruption Budgets (PBD). This gives etcd guard the chance to block a deletion that it considers disruptive.

TODO: Currently multiple MHC can target the same Machine. We should revisit a stronger mechanisim to prevent undesired MHC to target Control Plane Machines.
enxebre marked this conversation as resolved.
Show resolved Hide resolved

### Risks and Mitigations

During horizontal scaling operations there are multiple sensitive scenarios like scaling from 1 to 2. As soon as the etcd API is notified of the new member the cluster loses quorum until that new member starts and joins the cluster. This as well as any other operational aspect of etcd healthiness must be still handled by the [Cluster etcd Operator](https://github.com/openshift/enhancements/blob/master/enhancements/etcd/cluster-etcd-operator.md#motivation).
The contract for the machine API is by honouring Pod Disruption Budgets (PBD), including those given by etcd guard.

## Design Details

### Test Plan

- Exhaustive unit testing.
- Exhaustive integration testing via [envTest](https://book.kubebuilder.io/reference/testing/envtest.html).
- E2e testing on the [machine API](https://github.com/openshift/cluster-api-actuator-pkg/tree/master/pkg) and origin e2e test suite. Given a running cluster:
- Scale out.
- Loop over machineSets for masters and increase replicas.
- Wait for new nodes to go ready.
- Scale in.
- Loop over machineSets for masters and decrease replicas.
- Wait for new nodes to go away. See exisisting one to remain ready.
- Vertical scaling forcing machine recreation.
- Loop over machineSets for masters.
- Change a providerSpecific property which defines the instance capacity.
- Request all master machines to be deleted.
- Wait for new machines to come up with new capacity. Wait for existing machines to go away while cluster remains healthy.
- Simulate unhealthy nodes and remediation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it woulrd be good to detail a reasonably complete test plan for each of these (for instance, for unhealthy node emulation alay already owns a test, describe the scenario here so we can review it).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

- Force 2 master nodes out of 3 to go unhealthy, e.g kill kubelet.
- Wait for machines get request for deletion by the MHC.
- Wait for a new nodes to come up healthy.
- Wait for old machines to go away.

### Graduation Criteria

All the resources leveraged in this proposal i.e MachineSet and Machine Health Check are already GA.
This proposal will be released in 4.N as long as:
- All the testing above is in place.
- The cluster etcd operator manages all the etcd operational aspects including scaling down.
- There is considerable internal usage and manual disruptive QE testing feedback.
- This is beta tested pre-release during a considerable period of time either internally or by an early adopter user through the dev preview builds.

### Upgrade / Downgrade Strategy

New IPI clusters deployed after the targeted release will run the MachineSets and MHC deployed by the installer out of the box.

For UPI clusters and existing IPI clusters this is opt-in. We should pursue cluster homogenous topology. To that end we should provide the exact steps and encourage users in any user facing doc to create the MachineSets and MHC for the Control Plane Machines.
enxebre marked this conversation as resolved.
Show resolved Hide resolved


### Version Skew Strategy

## Implementation History
- "https://github.com/kubernetes-sigs/cluster-api/blob/master/docs/proposals/20191017-kubeadm-based-control-plane.md"
- "https://github.com/openshift/enhancements/blob/master/enhancements/etcd/cluster-etcd-operator.md"
- "https://github.com/openshift/machine-config-operator/blob/master/docs/etcd-quorum-guard.md"

## Drawbacks

## Alternatives

1. Higher level CRD and controller: The Control Plane scaling capabilities could be managed by a single scalable resource built atop MachineSets just like a deployment uses replica sets. This is explored here https://github.com/openshift/enhancements/pull/278. Instead this proposal focuses on the first step and primitives that will enable any farther abstraction.
enxebre marked this conversation as resolved.
Show resolved Hide resolved

2. Cluster etcd Operator could develop the capabilities to manage and scale machines. However this would break multiple design boundaries:
The Cluster etcd Operator has the ability to manage etcd members at Pod level without being necessarily tied to the lifecycle of Nodes or Machines.

There are UPI environments where the Machine API is not running the Master instances. Nevertheless we want the Cluster etcd Operator to manage the etcd membership consistently there.

3. Hive could develop the capabilities to manage and scale the Control Plane Machines.

However this would leave multiple scenarios uncovered where the Control Plane Machines would remain "pets" and it won't satisfy multiple of the stories mentioned above. Instead this proposal provides a reasonable abstraction to guarantee a certain level of consistent behaviour that Hive can still leverage as it sees fit.