Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Declarative Single Entity Control Plane proposal #278

Closed
wants to merge 23 commits into from

Conversation

enxebre
Copy link
Member

@enxebre enxebre commented Apr 3, 2020

This proposal outlines a solution for declaratively managing as a single entity the compute resources that host the OCP Control Plane components.

cc @abhinavdahiya for Installer bootstrapping, @hexfusion for Etcd, @dgoodwin for Hive, @JoelSpeed for machine API, @jeremyeder @smarterclayton @derekwaynecarr @deads2k

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 3, 2020
@enxebre
Copy link
Member Author

enxebre commented Apr 3, 2020

Not sure why this gets automatically approve label, holding to make sure automation does not merge by accident.
/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 3, 2020
@enxebre enxebre force-pushed the control-plane branch 4 times, most recently from 90cfeff to 760416d Compare April 3, 2020 11:11
#### Declarative horizontal scaling
##### Scale out
1 - The controller always reconciles towards expected number of replicas. This must be an odd number.
2 - Fetch all existing control plane Machine resources by ownerRef. Adopt any other machine having a targeted label.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick note on the owner references, will we block deletion of the ControlPlane CRD entirely? I wouldn't want kube garbage collecting these machines in any scenario.

As an alternative would just relying on labels be sufficient?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we currently block the CRD for Machines from being deleted? Deleting this would have a similar effect right? If this was already considered there, there may be a mechanism that could be reused

@jim-minter
Copy link

@enxebre thanks for this! What about onboarding existing clusters? Say this lands in 4.N. Is there an adoption path from a cluster of version 4.(N-1)?

enhancements/machine-api/control-plane.md Outdated Show resolved Hide resolved
enhancements/machine-api/control-plane.md Outdated Show resolved Hide resolved
### Goals

- To have a declarative mechanism to manage the Control Plane as a single entity.
- To support declarative safe horizontal scaling towards an odd number replicas for the control plane compute resources.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- To support declarative safe horizontal scaling towards an odd number replicas for the control plane compute resources.
- To support declarative safe horizontal scaling towards an odd number of replicas for the control plane compute resources.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having an even number of replicas is not inherently bad.
You don't achieve any additional failure tolerance by adding the last node, but maybe there is a performance benefit.
So perhaps it is not beneficial to enforce in code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it fair to say etcd docs argue strongly that an even number of replicas is inherently bad?

Although adding a node to an odd-sized cluster appears better since there are more machines, the fault tolerance is worse since exactly the same number of nodes may fail without losing quorum but there are more nodes that can fail.

during a network partition, an odd number of members guarantees that there will always be a majority partition that can continue to operate and be the source of truth when the partition ends

Copy link
Contributor

@beekhof beekhof Apr 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure they do, if one knows nothing about a given setup, then "odd" is the most reliable advice to give.

However it is possible to construct scenarios where 2N nodes is less vulnerable to outages than 2N + 1 by varying how the machines are organised/connected (ie. the number of racks).

So advise an odd number of nodes: absolutely
Enforce an odd number of nodes: my vote is no

enhancements/machine-api/control-plane.md Outdated Show resolved Hide resolved

#### Declarative horizontal scaling
##### Scale out
1 - The controller always reconciles towards expected number of replicas. This must be an odd number.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the odd number be enforced somehow?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that it shouldn't be. Odd is just an availability driven convention but there may be other reasons (capacity) to add an extra node (making the total even).

enhancements/machine-api/control-plane.md Show resolved Hide resolved
enhancements/machine-api/control-plane.md Outdated Show resolved Hide resolved
This is effectively a rolling upgrade with maxUnavailable 0 and maxSurge 1.

#### Self healing (MHC + reconciling)
1 - The Controller should always reconcile by removing Etcd members for voluntary Machine disruptions, i.e machine deletion.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MHC will delete the Machine, causing draining to happen, does this affect this at all?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think etcd is a static pod and therefor immune to drains

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not very clear what I was thinking here, I think I was thinking more along the lines of do we need to remove Etcd from the Etcd cluster somehow when doing this, or is there something else that will notice the member has gone and remove it from the cluster?

Status ControlPlaneStatus `json:"status,omitempty"`
}

type ControlPlaneSpec struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this use a MachineTemplateSpec in the same way a MachineSet does so that they look similar + get objectmeta label copying?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would we expose anything else than we really need at the API? Couldn't anything other than the providerSpec be generated by the controller?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought was that users may want to add labels to their control-plane machines, that's all we use the objectmeta for on the the MachineTemplateSpec within MachineSets right?

### Version Skew Strategy

## Implementation History
- "https://github.com/kubernetes-sigs/cluster-api/blob/master/docs/proposals/20191017-kubeadm-based-control-plane.md"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How similar is this to upstream? Given we want to eventually converge somewhat with upstream, does this fall in line with the assumptions for a Control Plane upstream?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proposal is fairly aligned with the capi kubeadm control plane one. The main difference is here we don't need to manage the lifecycle of etcd via kubeadm. We just want to check its healthy as a gate for proceeding with scale operations.
See also kubernetes-sigs/cluster-api#2836

## Alternatives

- Same approach but different details:
The Control Plane controller scaling logic could be built atop machineSets just like a deployment uses replica sets. The scenarios to be supported in this proposal are very specific. The flexibility that managing machineSets would provide is not needed. The Control Plane is critical and we want to favour control over flexibility here, therefore this proposes an ad-hoc controller logic to avoid unnecessary layers of complexity.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line break every 120 or every sentence (whichever is shorter) to make commenting easier.

Can you expand on why we would not re-use this existing metaphor with some admission webhook validators to restrict how far the adjustments can go? This seems be the "standard" API that existing for scaling up and down and I would have expected us to double-down on the existing metaphor instead of creating something new and special.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer we layer this on top of machinesets as it allows

  1. the installer to define the failure domains
  2. no new api for installer to configure the placement.
  3. the new control-plane becomes almost similar to autoscaler with added niceness.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from auto-scaling - which it seems isn't the goal of the proposal as it stands, although it makes sense to me to add it - what would be the main function of this controller if it was built on machinesets?

Say for scale-up, would this come down to choose-which-control-plane-machineset-to-scale-up and an (all Etcd members are healthy AND all owned machines have a backed ready node) safety wrapper?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say for scale-up, would this come down to choose-which-control-plane-machineset-to-scale-up and an (all Etcd members are healthy AND all owned machines have a backed ready node) safety wrapper?

For this scale-up case, could we add additional extra layers of safety around cluster-etcd-operator's decision to add a new etcd member, such that it's safe to add new control plane compute capacity and have cluster-etcd-operator worry about the etcd operational aspects?

Copy link
Member Author

@enxebre enxebre Apr 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we could drop that safety check and assume scale compute in/out is always completely safe this controller would come down to:

  • Ensure even spread of compute resources across failure domains / machineSets
  • Rolling upgrade for vertical resizing.
  • Managing the underlying MHC for safety maxUnhealthy value base on replicas.

I'd love that but not sure that'd be ever feasible for all scenarios though. Say e.g for scaling in when 1 etcd member out of 3 is unhealthy a "cattle" scale down can easily break quorum. I tried to elaborate more my etcd concerning scenarios in alternatives and proposal

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @hexfusion for feedback on the etcd scenarios and how to possibly decouple more compute from etcd healthiness

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can make sure any concerning scenario as in https://github.com/openshift/enhancements/pull/278/files#diff-dcd7423e408e8b740b1b9b14e53eeda1R293-R316 would be handled gracefully by etcd operator then we can drop the any etcd “pet” logic from this proposal and treat this as regular compute scaling operations.

Comment on lines 117 to 118
7 - Choose a failure domain.
8 - Create new Machine object with a templated spec. Go to 1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1- the doc is missing how this is going to be calculated.

2- I think we should layer this on top of machinesets instead of machine objects, because each machineset object already defines a failure domain.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of using machinesets, but are they compatible with UPI?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like the idea of using machinesets, but are they compatible with UPI?

Any machine API resource is tied to the the API and controllers being operational regardless of how you bootstrap the product UPI/IPI.

Copy link
Member Author

@enxebre enxebre Apr 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should layer this on top of machinesets instead of machine objects, because each machineset object already defines a failure domain.

Regardless of we layer on machines or machineSets we need something on top to decide which failure domain out of a given list to pick for the machine/machineSet. So we are able to scale out evenly. Nothing does that for you today unless we assume the machineSets would be always pre-created by something external to the controller i.e the installer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abhinavdahiya let me think about this a bit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing does that for you today unless we assume the machineSets would be always pre-created by something external to the controller i.e the installer.

the installer today creates workers and masters across failure domains..

for workers we have a machineset in each failure domain, and we set replicas so that they are evenly distributed.

expecting that each machineset defines a failure domain is probably a good idea imo...

Comment on lines +164 to +166
// Number of desired machines. Defaults to 1.
// Only odd numbers are permitted.
Replicas *int32 `json:"replicas,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How and where will the installer define the failure domains

EnableAutorepair bool `json:"enableautorepair,omitempty"`

// ProviderSpec details Provider-specific configuration to use during node creation.
ProviderSpec ProviderSpec `json:"providerSpec"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is the installer supposed to define things like placement availability zone, subnet etc.. as these are defined per machine object today.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1.

What is providerSpec in this context?

My thoughts were rather than try to dynamically build machine's from templates and several data sources that are different per-provider, the installer can populate a list of templates for each supported AZ at creation time. Basically a machineset, but the controller chooses 1 of N templates rather than just the one.

Comment on lines +86 to +88

#### Story 2
- As an operator running User Provider Infrastructure (UPI), I want to expose my non-machine API machines and offer them to the Control Plane controller so I can have the ability to resize the control plane in a declarative, automated and seamless manner.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the doc doesn't contain information of how this will be achieved.

Comment on lines 259 to 261
During the cluster upgrade for the targeted release the Machine API Operator (MAO) will let the CVO to instantiate the new CRD `controlPlane` and it will run the backing controller making this functionallity opt-in for existing clusters. The user can create an instance of the new CRD if they choose to do so.

New IPI clusters deployed after the targeted release will run the `controlPlane` instance deployed by the installer out of the box.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it think this is going to become difficult to maintain as documentation because every document would have to say..

## if you installed the cluster before 4.x

## if you installer the cluster on or after 4.x

We should atleast make/provide a migration step as part of some binary to perform this migration so that most users can move to this.
Or I think we should try the migration for ALL platforms and warn users to perform manual actions to correctly setup as part of upgrade.

The current approach of install only will be difficult in the long run.

i'm open to adding this migration to installer binary if that suits the best

### Goals

- To have a declarative mechanism to manage the Control Plane as a single entity.
- To support declarative safe horizontal scaling towards an odd number replicas for the control plane compute resources.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having an even number of replicas is not inherently bad.
You don't achieve any additional failure tolerance by adding the last node, but maybe there is a performance benefit.
So perhaps it is not beneficial to enforce in code.

enhancements/machine-api/control-plane.md Outdated Show resolved Hide resolved

#### Declarative horizontal scaling
##### Scale out
1 - The controller always reconciles towards expected number of replicas. This must be an odd number.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that it shouldn't be. Odd is just an availability driven convention but there may be other reasons (capacity) to add an extra node (making the total even).

3 - Compare with expected replicas number. If expected is higher than current then:
4 - Check all owned machines have a backed ready node.
5 - Check all Etcd members for all owned machines are healthy via Cluster Etcd Operator status signalling.
6 - If (NOT all Etcd members are healthy OR NOT all owned machines have a backed ready node) then controller short circuits here, log, update status and requeue. Else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the assumption that MHC will eventually fix that situation?
Might be worth calling out

Comment on lines 117 to 118
7 - Choose a failure domain.
8 - Create new Machine object with a templated spec. Go to 1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of using machinesets, but are they compatible with UPI?

This is effectively a rolling upgrade with maxUnavailable 0 and maxSurge 1.

#### Self healing (MHC + reconciling)
1 - The Controller should always reconcile by removing Etcd members for voluntary Machine disruptions, i.e machine deletion.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think etcd is a static pod and therefor immune to drains

enhancements/machine-api/control-plane.md Outdated Show resolved Hide resolved
enhancements/machine-api/control-plane.md Outdated Show resolved Hide resolved
2 - Fetch all existing control plane Machine resources by ownerRef. Adopt any other machine having a targeted label.
3 - Compare with expected replicas number. If expected is higher than current then:
4 - Check all owned machines have a backed ready node.
5 - Check all Etcd members for all owned machines are healthy via Cluster Etcd Operator status signalling.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be worth calling out the scenario when etcd has lost quorum and all API calls start failing.

Comment on lines +240 to +241
- Scale from 3 to 5.
- Scale from 5 to 3.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we assuming 1 to 3 is covered by existing installation tests?

enhancements/machine-api/control-plane.md Outdated Show resolved Hide resolved

For clarity in this doc we set the definition for "Control Plane" as "The collection of stateless and stateful processes which enable a Kubernetes cluster to meet minimum operational requirements". This includes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet and Etcd.

This proposal outlines a solution for declaratively managing as a single entity the compute resources that host the OCP Control Plane components. It introduces scaling and self-healing capabilities for this compute resources while honouring inviolable Etcd expectations and with out disrupting the lifecycle of Control plane components.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little more specifics on the proposal here would avoid "burying the lede" - e.g. "proposes a ControlPlane CR and controller who's primary function is to manage the number of control plane machines"

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would that CR allow alerts and remediation as well? (assuming no for most critical cases which imply at best we have a RO etcd).

enhancements/machine-api/control-plane.md Outdated Show resolved Hide resolved
enhancements/machine-api/control-plane.md Show resolved Hide resolved
enhancements/machine-api/control-plane.md Show resolved Hide resolved
enhancements/machine-api/control-plane.md Outdated Show resolved Hide resolved
enhancements/machine-api/control-plane.md Outdated Show resolved Hide resolved
enhancements/machine-api/control-plane.md Outdated Show resolved Hide resolved
enhancements/machine-api/control-plane.md Show resolved Hide resolved
## Alternatives

- Same approach but different details:
The Control Plane controller scaling logic could be built atop machineSets just like a deployment uses replica sets. The scenarios to be supported in this proposal are very specific. The flexibility that managing machineSets would provide is not needed. The Control Plane is critical and we want to favour control over flexibility here, therefore this proposes an ad-hoc controller logic to avoid unnecessary layers of complexity.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from auto-scaling - which it seems isn't the goal of the proposal as it stands, although it makes sense to me to add it - what would be the main function of this controller if it was built on machinesets?

Say for scale-up, would this come down to choose-which-control-plane-machineset-to-scale-up and an (all Etcd members are healthy AND all owned machines have a backed ready node) safety wrapper?

enhancements/machine-api/control-plane.md Outdated Show resolved Hide resolved
enhancements/machine-api/control-plane.md Outdated Show resolved Hide resolved
enhancements/machine-api/control-plane.md Show resolved Hide resolved
@enxebre enxebre force-pushed the control-plane branch 2 times, most recently from 0aca4f1 to e8b7321 Compare April 8, 2020 12:26
enhancements/machine-api/control-plane.md Outdated Show resolved Hide resolved
enhancements/machine-api/control-plane.md Outdated Show resolved Hide resolved
enhancements/machine-api/control-plane.md Show resolved Hide resolved
enhancements/machine-api/control-plane.md Outdated Show resolved Hide resolved
// This will autorepair faulty machine/nodes in scenarios where quorum would not be violated.
// e.g 1 unhealthy out of 3.
// Defaults to true.
EnableAutorepair bool `json:"enableautorepair,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would anyone ever set this to false?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'm thinking of unpredictable environments/scenarios where you want manual troubleshooting and therefore opt-out might be useful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the flag should be DisableAutorepair then so the default of false will give the expected desired behavior?


- With only machineSet + MHC:
- New machines start to come up.
- As soon as the etcd API is notified of a new member( total 4, healthy 2) the cluster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think anyone other than etcd-operator should be doing this decision making. If etcd-operator thinks adding a new node if cause failure, it shouldn't add those members.

This new controller should be doing this decision making imo.

- New machines start to come up.
- As soon as the etcd API is notified of a new member( total 4, healthy 2) the cluster
loses quorum until that new member starts and joins the cluster.
- There's no automation mechanism for Machines to spread evenly across failure domains.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is control-plane the only one with this problem?
currently we have one machineset for each failure domain for compute in the cluster, autoscaling compute should also benefit from this failure domain into account.

Copy link
Member Author

@enxebre enxebre Apr 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For regular nodes autoscaling has the ability to ensure even spread via balance-similar-node-groups flag.

If we ever put something on top of machineSets or have a different complementary scalable resource abstraction it could include the ability to choose failure domains dynamically for manual scaling operations as a feature.

Since we are figuring out an scalable resource for control plane here we can try to have similar functionality to balance-similar-node-groups

Comment on lines +307 to +309
- Machines start to get deleted.
- etcd peer membership is not removed.
- etcd guard blocks on drain before losing quorum. etcd remains degraded.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since machine-api always drains, and etcd operator is the one that decides the disruption.
here:
pdb is 3 out of 5
one is already down.

machine-api is asked to go down from 3->5
one machine can actually go done.
the second machine cannot because of pdb making the drain fail.
the etcd is not degraded because it still have 3 up etcd members?

5. Check all etcd members for all owned machines are healthy via Cluster etcd Operator status signalling.
6. If (NOT all etcd members are healthy OR NOT all owned machines have a backed ready node) then controller short circuits here, log, update status and requeue. Else:
7. Pick oldest machine in more populated failure domain out of a candidates to be deleted list (by default all owned Machines).
8. Remove etcd member.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isnt there a 'remove-node' api? So the remove etcd-member will be done by the CEO rather than this controller?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scaling etcd is owned by cluster-etcd-operator.

Copy link
Contributor

@michaelgugino michaelgugino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the basic items like "What happens when I delete all the machines" aren't really called out here. Is there any kind of race condition possible that will attempt to remove too many etcd members too quickly?

There's a lot of functionality that needs to be defined still. My preference is to create this object without scalability first, and after it matures, add scalability later if appropriate.

Some things I think we should add to this particular proposal are user stories for monitoring, metrics, and alerting; UX process for replacing a control plane machine; UX for vertically sizing existing machines.

### User Stories [optional]

#### Story 1
- As an operator running Installer Provider Infrastructure (IPI), I want flexibility to run [large or small clusters](https://kubernetes.io/docs/setup/best-practices/cluster-large/#size-of-master-and-master-components) so I need to have the ability to resize the control plane in a declarative, automated and seamless manner.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resizing doesn't make a lot of sense out of the box. Often times, there are AZ's we're not utilizing for the CP. Should we attempt to utilize all AZ's or just existing AZ's? Do we support resizing to 1?

- As an operator running User Provider Infrastructure (UPI), I want to expose my non-machine API machines and offer them to the Control Plane controller so I can have the ability to resize the control plane in a declarative, automated and seamless manner.

#### Story 3
- As an operator of an OCP Dedicated Managed Platform, I want to give users flexibility to add as many workers nodes as they want or to enable autoscaling on worker nodes so I need to have ability to resize the control plane instances in a declarative, automated and seamless manner to react quickly to cluster growth.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually align with the long term needs of Dedicated? Is there an RFE or Bug to support this story?

#### Declarative horizontal scaling
##### Scale out
1. The controller always reconciles towards expected number of replicas. Validation enforce this to be an odd number.
2. Fetch all existing control plane Machine resources by ownerRef. Adopt any other machine having a targeted label e.g `node-role.kubernetes.io/master`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we misconfigure targeted labels or users add labels to workers?

3. Compare with expected replicas number. If expected is higher than current then:
4. Check all owned machines have a backed ready node.
5. Check all etcd members for all owned machines are healthy via Cluster etcd Operator status signalling.
6. If (NOT all etcd members are healthy OR NOT all owned machines have a backed ready node) then controller short circuits here, log, update status and requeue. Else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when we're scaling from 3 to 5, 4 comes up and 5 doesn't. Are we relying on the MHC to kick in at some point? Any alerting for a specific set of conditions here?

##### Scale out
1. The controller always reconciles towards expected number of replicas. Validation enforce this to be an odd number.
2. Fetch all existing control plane Machine resources by ownerRef. Adopt any other machine having a targeted label e.g `node-role.kubernetes.io/master`.
3. Compare with expected replicas number. If expected is higher than current then:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are we considering 'replicas' in this context? Machines?

7. Trigger scale out workflow (starting in 4). Set "replaced" annotation. Requeue.
This is effectively a rolling upgrade with maxUnavailable 0 and maxSurge 1.

#### Self healing (MHC + reconciling)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is MHC always mandatory if we introduce this component?

This is effectively a rolling upgrade with maxUnavailable 0 and maxSurge 1.

#### Self healing (MHC + reconciling)
1. The Controller should always reconcile by removing etcd members for voluntary Machine disruptions, i.e machine deletion.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if a user deletes the machine intentionally? What is this controller supposed to do?

For MachineSets, a new machine is created instantly, and as soon as a machine is drained, it's deleted.

This behavior is undesirable for control planes. If you are deleting a machine, you probably want the new machine to come online and join the cluster before actually deleting the instance in the cloud provider (for worst-case DR scenarios).

We need to work out the particular steps for the scenario of a machine has been deleted, and then the control-plane is scaled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To this end, how do we ensure that a user gets a replacement machine in the same AZ as the machine being deleted?

EnableAutorepair bool `json:"enableautorepair,omitempty"`

// ProviderSpec details Provider-specific configuration to use during node creation.
ProviderSpec ProviderSpec `json:"providerSpec"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1.

What is providerSpec in this context?

My thoughts were rather than try to dynamically build machine's from templates and several data sources that are different per-provider, the installer can populate a list of templates for each supported AZ at creation time. Basically a machineset, but the controller chooses 1 of N templates rather than just the one.

}

// ProviderSpec defines the configuration to use during node creation.
type ProviderSpec struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we declaring this to not be reconciled similar to MachineSets?

This proposes an ad-hoc CRD and controller for declaratively managing as a single entity the compute resources that host the OCP Control Plane components.

This controller differs from a regular machineSet in that:
- It ensures that scaling operations are non disruptive for etcd:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this would replace the need for etcd quorum-guard to enforce PDB or will this still be required?

This controller differs from a regular machineSet in that:
- It ensures that scaling operations are non disruptive for etcd:
- It scales one resource at a time.
- It let scaling operations proceed only when all etcd members are healthy and all the owned machines have a backed ready node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can let cluster-etcd-operator conclude when scaling etcd is appropriate. Meaning if we have 3 node cluster and are going to replace node N. cluster-etcd-operator will watch for changes to your $resource and the scale up or down based on those observations. You can't roll the next node to upgrade until etcd comes back up as per PDB. We already watch Nodes so how will this improve our observability of change?

- It scales one resource at a time.
- It let scaling operations proceed only when all etcd members are healthy and all the owned machines have a backed ready node.
- It ensure even spread of compute resources across failure domains.
- It removes etcd membership for voluntary machine disruptions (Question: can rather etcd operator somehow handle this?).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes cluster-etcd-operator owns etcd scaling

5. Check all etcd members for all owned machines are healthy via Cluster etcd Operator status signalling.
6. If (NOT all etcd members are healthy OR NOT all owned machines have a backed ready node) then controller short circuits here, log, update status and requeue. Else:
7. Pick oldest machine in more populated failure domain out of a candidates to be deleted list (by default all owned Machines).
8. Remove etcd member.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scaling etcd is owned by cluster-etcd-operator.


### Risks and Mitigations

During horizontal scaling operations there are sensitive scenarios like scaling from 1 to 2. As soon as the etcd API is notified of the new member the cluster loses quorum until that new member starts and joins the cluster. This must be still handled by the [Cluster etcd Operator](https://github.com/openshift/enhancements/blob/master/enhancements/etcd/cluster-etcd-operator.md#motivation) while the Control Plane controller should honour and short-circuit when it meets the etcd unhealthiness criteria as described in the workflows above.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently scaling from 1 to 2 only happens during bootstrap or disaster recovery. Can you outline a scenario where we would go from 1 to 2 nodes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess is scaling from 1 to 3 entails a stop, however brief, at 2.

@enxebre
Copy link
Member Author

enxebre commented Apr 15, 2020

Based on etcd-team feeback it should be safe to add/remove(coming soon) new control plane compute capacity honouring PDB and have exclusively cluster-etcd-operator to handle the etcd operational aspects orthogonally. I'll update the proposal to drop the etcd pet logic and possibly build atop machineSets

@enxebre
Copy link
Member Author

enxebre commented Apr 23, 2020

Based on etcd-team feeback it should be safe to add/remove(coming soon) new control plane compute capacity honouring PDB and have exclusively cluster-etcd-operator to handle the etcd operational aspects orthogonally. I'll update the proposal to drop the etcd pet logic and possibly build atop machineSets

I revamped this proposal with a new PR proposing just adding machineSets and MHC for masters while etcd operator should handle all its operational aspects underneath gracefully. #292

@enxebre enxebre closed this Apr 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.
Projects
None yet
Development

Successfully merging this pull request may close these issues.