Operator: add support for downscaling #5019

nicolaferraro · 2022-06-03T17:31:30Z

Cover letter

This adds support for downscaling clusters in Kubernetes. Downscaling is done in coordination with the cluster, using the /decommission endpoint. The StatefulSet is downscaled only if the cluster allows it (e.g. having a topic with 3 replicas prevents the cluster to scale to less than 3 nodes).

When the cluster prevents the statefulset to scale, replicas can still be reverted to previous state and the operator will automatically trigger a /recommission of the broker.

This also removes the need to use an empty seed_server when node 0 starts, because fresh clusters now start with a single replica, then are upscaled.

~~The whole PR has been created with nodeID possibly diverging from StatefulSet ordinal (a PR to do that is expected after this).~~

It also deals with many edge cases (see controller tests).

Release notes

Features

operator: clusters can now be downscaled (alpha feature enabled using the controller startup flag --allow-downscaling)
operator: multi-replica clusters are initialized with one replica then upscaled to the desired number, without need to create empty seed-server

nicolaferraro · 2022-06-06T10:05:53Z

Code ready for review.

It needs #4964 to work properly, because when scaling up from one node, node 0 is restarted and sometimes the cluster fails to get leadership.

nicolaferraro · 2022-06-06T13:30:48Z

Code ready for review.

It needs #4964 to work properly, because when scaling up from one node, node 0 is restarted and sometimes the cluster fails to get leadership.

The problem with node 0 not being able to form the initial cluster after restart was caused by maintenance mode hooks enabled on it: i.e. node 0 was losing leadership and no longer be leader for anything after restart. I've disabled maintenance mode when the cluster is starting up. That forces a restart during multi-node cluster first creation (it will be fixed by #4907)

src/go/k8s/pkg/resources/statefulset_scale.go

src/go/k8s/pkg/resources/statefulset.go

alenkacz

Finished first pass 😅 thanks for the clear commit structure, that helped a lot 👏

src/go/k8s/controllers/redpanda/cluster_controller.go

src/go/k8s/apis/redpanda/v1alpha1/cluster_types.go

src/go/k8s/apis/redpanda/v1alpha1/cluster_webhook.go

src/go/k8s/pkg/admin/admin.go

src/go/k8s/pkg/resources/statefulset_scale.go

alenkacz · 2022-06-08T11:31:36Z

src/go/k8s/pkg/resources/statefulset_scale.go

+}
+
+// isRunningReplicas checks if the statefulset is configured to run the given amount of replicas and that also pods match the expectations
+func (r *StatefulSetResource) isRunningReplicas(


verifyRunningCount ? 🤔

src/go/k8s/pkg/resources/statefulset.go

src/go/rpk/pkg/api/admin/api_broker.go

nicolaferraro · 2022-06-10T13:05:02Z

This is ready for a second pass. Other than the suggestions, I've also found other issues that I've fixed in this new version:

When upgrading the operator and there's an existing cluster running, it's likely that status.desiredReplicas of the CR is not initialized (unlike the old status.replicas), so I took care of initializing it properly for existing clusters. Only fresh clusters do the initialization starting from 1 node (then scaling up), while existing clusters are initialized at the current number of replicas
When a node is decommissioned and shut down, its maintenance mode hooks can put the cluster in an inconsistent state (node deleted from the cluster, but at the same time holding the lock on maintenance mode, preventing other nodes to enter maintenance mode and shutdown properly). See: Enabling maintenance mode on a decommissioned node leaves the cluster in an inconsistent state #4999. I've added a workaround to force releasing the maintenance mode lock
I've added a hack to still remove the seeds servers when they contain a single entry. Normally clusters are able to form when the seeds server list contain a single entry pointing to themselves, except when the Kafka API is secured with mutual TLS (this is why the helm test was failing). So I empty the seeds server list when it contains a single entry: it should be equivalent and still prevent the issues that we're facing with the logic based on ordinal 0..

cc: @alenkacz , @pvsune, @RafalKorepta

alenkacz

I think the API is the only blocker for me, otherwise looks REALLY good, thank you 👏

alenkacz · 2022-06-13T07:36:59Z

src/go/k8s/apis/redpanda/v1alpha1/cluster_types.go

+// ComputeInitialDesiredReplicas calculates the initial value for status.desiredReplicas.
+//
+// It needs to consider the following cases:
+// - Fresh cluster: we start from 1 replicas, then upscale if needed


why desired is 1 on fresh cluste? desired to me seems like it should equal to replicas in spec 🤔

Checking HPA docs, I wonder whether this field should not be named currentreplicas rather? I feel like desired is the target and final one 🤔 https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/

but feel free to correct me, I just want to get the API right so it's not confusing since it's hard to change

Yeah, I took the name from HPA, but I agree it's confusing, especially on the part where the user "desires" multiple replicas, but the controller keeps them to a lower value. currentReplicas may be better..

The reason for 1 replica is to prevent the problem with seeds_servers, to have them empty only at cluster creation and never again

alenkacz · 2022-06-13T08:09:37Z

src/go/k8s/cmd/configurator/main.go

+ // In case of a single seed server, the list should contain the current node itself.
+ // Normally the cluster is able to recognize it's talking to itself, except when the cluster is
+ // configured to use mutual TLS.
+ // So, we clear the list of seeds to help Redpanda.


do you know why is that? the nodes does not talk to itself by mtls. Is this a bug in redpanda they are looking to fix at one point?

Yeah, I'm going to add a comment about this in the seeds server issue

#333 (comment)

alenkacz · 2022-06-13T08:10:59Z

src/go/k8s/controllers/redpanda/suite_test.go

+ return admin.NodeConfig{}, nil
+}
+
+// nolint:goerr113 // test code


we need to disable this linter completely 🙄

alenkacz · 2022-06-13T08:12:44Z

src/go/k8s/pkg/resources/statefulset.go

@@ -166,8 +175,21 @@ func (r *StatefulSetResource) Ensure(ctx context.Context) error {
 return fmt.Errorf("error while fetching StatefulSet resource: %w", err)
 }
 r.LastObservedState = &sts
+
+ // Hack for: https://github.com/redpanda-data/redpanda/issues/4999


alenkacz · 2022-06-13T08:13:26Z

src/go/k8s/pkg/resources/statefulset.go

+ }
+
+ r.logger.Info("Running scale handler", "resource name", r.Key().Name)
+ return r.handleScaling(ctx)


do we have a test for this? Can we maybe change one of the updat test to also scale to see both can be done at the same time?

RafalKorepta · 2022-06-13T16:46:49Z

src/go/k8s/controllers/redpanda/suite_test.go

@@ -366,6 +367,10 @@ func (m *mockAdminAPI) SetUnavailable(unavailable bool) {
 m.unavailable = unavailable
 }

+func (m *mockAdminAPI) GetNodeConfig() (admin.NodeConfig, error) {


Missing context argument.

RafalKorepta · 2022-06-13T16:47:04Z

src/go/k8s/controllers/redpanda/suite_test.go

+func (m *mockAdminAPI) GetNodeConfig(
+ _ context.Context,
+) (admin.NodeConfig, error) {


NIT: This should change should belong to operator: allow scoping internal admin API to specific nodes commit

3e77ca2#diff-8dc5beefd6b5f6302ad354811063b60d63ef4062149258baa3f5e3284502a963R370

RafalKorepta · 2022-06-13T17:08:11Z

src/go/k8s/pkg/resources/statefulset.go

+ }
+
+ r.logger.Info("Running scale handler", "resource name", r.Key().Name)
+ return r.handleScaling(ctx)


Can I ask you to add more context to the commit message with the details about what is currently new order in stateful set resource reconciliation function. You could add the caveats around edge cases.

nicolaferraro · 2022-06-14T14:16:21Z

Fixed.. and added more context to the commit messages

RafalKorepta

Thank you for this amazing set of features! I left few comments.

src/go/k8s/pkg/resources/statefulset_scale.go

RafalKorepta · 2022-06-14T19:48:06Z

src/go/k8s/pkg/resources/statefulset_scale.go

+)
+
+const (
+ DecommissionRequeueDuration = time.Second * 10


NIT: Should this timeout have some back-off mechanism? I would see that if cluster is big enough (number of partitions or data on disk) this could be wise to ask every one minute. For small cluster it would be good to ask more frequent (less than 1 second).

src/go/k8s/apis/redpanda/v1alpha1/cluster_types_test.go

src/go/k8s/pkg/resources/statefulset_scale.go

RafalKorepta · 2022-06-14T20:09:42Z

src/go/k8s/apis/redpanda/v1alpha1/cluster_types.go

@@ -841,6 +841,16 @@ func (r *Cluster) IsUsingMaintenanceModeHooks() bool {
 return true
 }

+// ClusterSpec
+
+// GetReplicas returns the replicas field is present, or 1


Why 1? If user would edit cluster CR and remove replicas it will be down scaled to 1 node?

I guess so but I think it's a sane default... it's either 1 or 0 🤷‍♀️

Disallow empty replicas

src/go/k8s/pkg/resources/statefulset.go

src/go/k8s/controllers/redpanda/cluster_controller_scale_test.go

RafalKorepta · 2022-06-14T21:26:30Z

src/go/k8s/main.go

+ Log: ctrl.Log.WithName("controllers").WithName("redpanda").WithName("Cluster"),
+ Scheme: mgr.GetScheme(),
+ AdminAPIClientFactory: adminutils.NewInternalAdminAPI,
+ DecommissionWaitInterval: 10 * time.Second,


I'm not sure if 10 seconds is good hardcoded interval. This PR is big enough, but maybe we could expose new flag to set the 10 seconds default that could be changed based on the flag?

RafalKorepta · 2022-06-14T22:08:04Z

src/go/k8s/pkg/resources/statefulset_scale.go

+ ordinal := *r.pandaCluster.Status.DecommissioningNode
+ targetReplicas := ordinal
+
+ scaledDown, err := r.verifyRunningCount(ctx, targetReplicas)


NIT: The verify running count doesn't check if targetReplicas aka decommissioning node id is decreasing by looking at statefulset definition

Suggested change

scaledDown, err := r.verifyRunningCount(ctx, targetReplicas)

differentCount, err := r.verifyRunningCount(ctx, targetReplicas)

RafalKorepta · 2022-06-14T22:14:27Z

src/go/k8s/pkg/resources/statefulset_scale.go

+// preventing other pods clean shutdown.
+//
+// See: https://github.com/redpanda-data/redpanda/issues/4999
+func (r *StatefulSetResource) disableMaintenanceModeOnDecommissionedNodes(


Any unit test/ginko test?

alenkacz

LGTM

nicolaferraro · 2022-06-15T11:24:34Z

Done latest changes we agreed:

Removed possibility to set nil replicas (was not my intention). Now it's enforced in the webhook, previously the cluster just failed on nil replicas
Made decommission wait interval configurable (plus jitter, no backoff as I needed to track it per CR)

Now status fields have the following behaviour: - replicas: reflects StatefulSet status replicas (no longer readyReplicas) - readyReplicas: reflects StatefulSet status readyReplicas - currentReplicas: managed by the operator to dynamically change the current number of replicas, to gradually match user expectations

When decommissioning a node, the field is populated with the ordinal number of the node being decommissioned. In case of recommission, it also indicates the node being currently recommissioned.

The replicas field can freely change and the controller will make sure that nodes are properly decommissioned. The only remaining restriction is that replicas cannot be 0 or nil.

This allows the pkg/resources package to use the admin API internal interface.

This allows to get local information from brokers, such as the local configuration.

This produced a stacktrace in the logs, while waiting for a condition.

… nodes This adds a handler that correctly manages upscaling and downscaling the cluster, decommissioning nodes wheh needed. The handler uses `status.currentReplicas` to signal the amount of replicas that all subcontrollers should materialize. When a cluster is downscaled, the handler first tries to decommission the last node via admin API, then decreases the value of `status.currentReplicas`, to remove the node only when the cluster allows it. In case the cluster refuses to decommission a node (e.g. min replicas on a topic higher than the desired number of nodes), the user can increase `spec.replicas` to trigger a recommission of the node.

…nitial raft group This tries to solve the problem with empty seed_servers on node 0. With this change, all fresh clusters will be initially set to 1 replica (via `status.currentReplicas`), until a cluster is created and the operator can verify it via admin API. Then the cluster is scaled to the number of instances desired by the user. After the cluster is initialized, and for the entire lifetime of the cluster, the `seed_servers` property will be populated with the full list of available servers, in every node of the cluster. This overcomes redpanda-data#333. Previously, node 0 was always forced to have an empty seed_servers property, but this caused problems when it lost the data dir, as it tried to create a brand-new cluster. With this change, even if node 0 loses the data dir, the seed_servers property will always point to other nodes, so it will try to join the existing cluster.

Since nodes are auto-draining as part of their shutdown hooks, it happens that when maintenance mode is activated for a decommissioned node, no process is really started. We just exit if that is the case.

…fresh cluster This allows to have predictable initial cluster formation. When the cluster is first created, it's composed of a single node. On single-node clusters, we should not activate maintenance mode, because, otherwise, a restart of the node will make it drain leadership and the cluster will not form. On the counter-side, enabling maintenance mode when the cluster scales to multiple instances currently causes a restart of node 0. This will be solved when implementing dynamic hooks.

…rkaround for redpanda-data#4999) When a node is shutdown after decommission, the maintenance mode hooks will trigger. While the process has no visible effect on partitions, it leaves the cluster in an inconsistent state, so that other nodes cannot enter maintenance mode. We force reset the flag with this change.

We should enable downscaling as feature gate when issue with reusable node IDs is fixed.

nicolaferraro · 2022-06-16T08:59:28Z

Added an --allow-downscaling flag, so we disable this feature by default.

I couldn't find a better way to pass config to the webhook 😞 .

RafalKorepta

I left none blocking comments as they are only stylish NITs or documentation.

Is there planed work on documentation?

cc @Feediver1

RafalKorepta · 2022-06-20T06:59:16Z

src/go/k8s/pkg/resources/statefulset_scale.go

+ if r.pandaCluster.Status.CurrentReplicas == 0 {
+ // Initialize the currentReplicas field, so that it can be later controlled
+ r.pandaCluster.Status.CurrentReplicas = r.pandaCluster.ComputeInitialCurrentReplicasField()
+ return r.Status().Update(ctx, r.pandaCluster)
+ }


NIT: Should this go to default webhook?

RafalKorepta · 2022-06-20T07:16:17Z

src/go/k8s/pkg/resources/statefulset_scale.go

+ }
+
+ if r.pandaCluster.Status.DecommissioningNode == nil || r.pandaCluster.Status.CurrentReplicas > *r.pandaCluster.Status.DecommissioningNode {
+ // Only if actually in a decommissioning phase


NIT: This comment miss-leads as it is the opposite. When entered it means do nothing as decommissioning is not executed or it is scaling up.

This comment would be more accurate 2 lines down.

Suggested change

// Only if actually in a decommissioning phase

// Only if not in decommissioning phase

RafalKorepta · 2022-06-20T07:23:27Z

src/go/k8s/apis/redpanda/v1alpha1/cluster_webhook.go

+// AllowDownscalingInWebhook controls the downscaling alpha feature in the Cluster custom resource.
+// Downscaling is not stable since nodeIDs are currently not reusable, so adding to a cluster a node
+// that has previously been decommissioned can cause issues.


It would be good to describe in commit message or in this comment what consequences might happen if someone will downscale the cluster while Kafka clients are still connect.

cc @jcsp @mmaslankaprv

github-actions bot added area/k8s area/rpk labels Jun 3, 2022

nicolaferraro force-pushed the decommission branch from 74268ba to 9052ac1 Compare June 6, 2022 09:26

nicolaferraro marked this pull request as ready for review June 6, 2022 10:01

nicolaferraro requested review from twmb, 0x5d, r-vasquez and a team as code owners June 6, 2022 10:01

nicolaferraro requested review from RafalKorepta and alenkacz June 6, 2022 10:01

pvsune reviewed Jun 7, 2022

View reviewed changes

src/go/k8s/pkg/resources/statefulset_scale.go Show resolved Hide resolved

pvsune reviewed Jun 7, 2022

View reviewed changes

src/go/k8s/pkg/resources/statefulset.go Outdated Show resolved Hide resolved

alenkacz reviewed Jun 8, 2022

View reviewed changes

r-vasquez reviewed Jun 8, 2022

View reviewed changes

src/go/rpk/pkg/api/admin/api_broker.go Outdated Show resolved Hide resolved

nicolaferraro force-pushed the decommission branch 4 times, most recently from ad66435 to 8f3b0bd Compare June 10, 2022 12:50

alenkacz reviewed Jun 13, 2022

View reviewed changes

nicolaferraro force-pushed the decommission branch from 8f3b0bd to 15adb01 Compare June 13, 2022 15:18

RafalKorepta reviewed Jun 14, 2022

View reviewed changes

nicolaferraro force-pushed the decommission branch from 15adb01 to a99e310 Compare June 14, 2022 14:15

RafalKorepta reviewed Jun 14, 2022

View reviewed changes

alenkacz previously approved these changes Jun 15, 2022

View reviewed changes

RafalKorepta previously approved these changes Jun 15, 2022

View reviewed changes

nicolaferraro dismissed RafalKorepta’s stale review via bef2ba2 June 15, 2022 11:18

nicolaferraro dismissed alenkacz’s stale review via bef2ba2 June 15, 2022 11:18

nicolaferraro force-pushed the decommission branch from a99e310 to bef2ba2 Compare June 15, 2022 11:18

nicolaferraro added 18 commits June 16, 2022 10:56

operator: add decommissioningNode status field

8b3559d

When decommissioning a node, the field is populated with the ordinal number of the node being decommissioned. In case of recommission, it also indicates the node being currently recommissioned.

operator: change webhook to allow decommissioning

0131182

The replicas field can freely change and the controller will make sure that nodes are properly decommissioned. The only remaining restriction is that replicas cannot be 0 or nil.

operator: move types to their own package to avoid dependency loop

d6dac46

This allows the pkg/resources package to use the admin API internal interface.

rpk: add enum for membership status

e9ecea7

operator: allow scoping internal admin API to specific nodes

f8a7bcd

This allows to get local information from brokers, such as the local configuration.

operator: remove stack trace from logs when delay is requested

1913822

This produced a stacktrace in the logs, while waiting for a condition.

operator: enable decommission API functions in internal admin API

33c1034

operator: consider draining field when checking maintenance mode status

d48e957

Since nodes are auto-draining as part of their shutdown hooks, it happens that when maintenance mode is activated for a decommissioned node, no process is really started. We just exit if that is the case.

operator: add controller tests for scaling

2b02d34

operator: add kuttl test for decommission

3b7b77e

operator: add documentation to the scale handler

90812c1

operator: make decommission wait interval configurable and add jitter

b1e3fac

operator: mark downscaling as alpha feature and add a startup flag

5bd42df

We should enable downscaling as feature gate when issue with reusable node IDs is fixed.

nicolaferraro force-pushed the decommission branch from bef2ba2 to 5bd42df Compare June 16, 2022 08:56

RafalKorepta approved these changes Jun 20, 2022

View reviewed changes

alenkacz approved these changes Jun 21, 2022

View reviewed changes

nicolaferraro merged commit 035ed04 into redpanda-data:dev Jun 21, 2022

	scaledDown, err := r.verifyRunningCount(ctx, targetReplicas)
	differentCount, err := r.verifyRunningCount(ctx, targetReplicas)

	// Only if actually in a decommissioning phase
	// Only if not in decommissioning phase

Operator: add support for downscaling #5019

Operator: add support for downscaling #5019

Conversation

nicolaferraro commented Jun 3, 2022 • edited Loading

Cover letter

Release notes

Features

nicolaferraro commented Jun 6, 2022

nicolaferraro commented Jun 6, 2022

alenkacz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicolaferraro commented Jun 10, 2022

alenkacz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicolaferraro commented Jun 14, 2022

RafalKorepta left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alenkacz left a comment

Choose a reason for hiding this comment

nicolaferraro commented Jun 15, 2022

nicolaferraro commented Jun 16, 2022

RafalKorepta left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicolaferraro commented Jun 3, 2022 •

edited

Loading