Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operator: add support for downscaling #5019

Merged
merged 18 commits into from
Jun 21, 2022

Conversation

nicolaferraro
Copy link
Member

@nicolaferraro nicolaferraro commented Jun 3, 2022

Cover letter

This adds support for downscaling clusters in Kubernetes. Downscaling is done in coordination with the cluster, using the /decommission endpoint. The StatefulSet is downscaled only if the cluster allows it (e.g. having a topic with 3 replicas prevents the cluster to scale to less than 3 nodes).

When the cluster prevents the statefulset to scale, replicas can still be reverted to previous state and the operator will automatically trigger a /recommission of the broker.

This also removes the need to use an empty seed_server when node 0 starts, because fresh clusters now start with a single replica, then are upscaled.

The whole PR has been created with nodeID possibly diverging from StatefulSet ordinal (a PR to do that is expected after this).

It also deals with many edge cases (see controller tests).

Release notes

Features

  • operator: clusters can now be downscaled (alpha feature enabled using the controller startup flag --allow-downscaling)
  • operator: multi-replica clusters are initialized with one replica then upscaled to the desired number, without need to create empty seed-server

@nicolaferraro
Copy link
Member Author

Code ready for review.

It needs #4964 to work properly, because when scaling up from one node, node 0 is restarted and sometimes the cluster fails to get leadership.

@nicolaferraro
Copy link
Member Author

Code ready for review.

It needs #4964 to work properly, because when scaling up from one node, node 0 is restarted and sometimes the cluster fails to get leadership.

The problem with node 0 not being able to form the initial cluster after restart was caused by maintenance mode hooks enabled on it: i.e. node 0 was losing leadership and no longer be leader for anything after restart. I've disabled maintenance mode when the cluster is starting up. That forces a restart during multi-node cluster first creation (it will be fixed by #4907)

Copy link
Contributor

@alenkacz alenkacz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finished first pass 😅 thanks for the clear commit structure, that helped a lot 👏

src/go/k8s/apis/redpanda/v1alpha1/cluster_types.go Outdated Show resolved Hide resolved
src/go/k8s/apis/redpanda/v1alpha1/cluster_types.go Outdated Show resolved Hide resolved
src/go/k8s/pkg/admin/admin.go Outdated Show resolved Hide resolved
src/go/k8s/pkg/resources/statefulset_scale.go Outdated Show resolved Hide resolved
src/go/k8s/pkg/resources/statefulset_scale.go Outdated Show resolved Hide resolved
src/go/k8s/pkg/resources/statefulset_scale.go Outdated Show resolved Hide resolved
}

// isRunningReplicas checks if the statefulset is configured to run the given amount of replicas and that also pods match the expectations
func (r *StatefulSetResource) isRunningReplicas(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verifyRunningCount ? 🤔

src/go/k8s/pkg/resources/statefulset.go Outdated Show resolved Hide resolved
@nicolaferraro nicolaferraro force-pushed the decommission branch 4 times, most recently from ad66435 to 8f3b0bd Compare June 10, 2022 12:50
@nicolaferraro
Copy link
Member Author

This is ready for a second pass. Other than the suggestions, I've also found other issues that I've fixed in this new version:

  • When upgrading the operator and there's an existing cluster running, it's likely that status.desiredReplicas of the CR is not initialized (unlike the old status.replicas), so I took care of initializing it properly for existing clusters. Only fresh clusters do the initialization starting from 1 node (then scaling up), while existing clusters are initialized at the current number of replicas
  • When a node is decommissioned and shut down, its maintenance mode hooks can put the cluster in an inconsistent state (node deleted from the cluster, but at the same time holding the lock on maintenance mode, preventing other nodes to enter maintenance mode and shutdown properly). See: Enabling maintenance mode on a decommissioned node leaves the cluster in an inconsistent state #4999. I've added a workaround to force releasing the maintenance mode lock
  • I've added a hack to still remove the seeds servers when they contain a single entry. Normally clusters are able to form when the seeds server list contain a single entry pointing to themselves, except when the Kafka API is secured with mutual TLS (this is why the helm test was failing). So I empty the seeds server list when it contains a single entry: it should be equivalent and still prevent the issues that we're facing with the logic based on ordinal 0..

cc: @alenkacz , @pvsune, @RafalKorepta

Copy link
Contributor

@alenkacz alenkacz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the API is the only blocker for me, otherwise looks REALLY good, thank you 👏

// ComputeInitialDesiredReplicas calculates the initial value for status.desiredReplicas.
//
// It needs to consider the following cases:
// - Fresh cluster: we start from 1 replicas, then upscale if needed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why desired is 1 on fresh cluste? desired to me seems like it should equal to replicas in spec 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking HPA docs, I wonder whether this field should not be named currentreplicas rather? I feel like desired is the target and final one 🤔 https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/

but feel free to correct me, I just want to get the API right so it's not confusing since it's hard to change

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I took the name from HPA, but I agree it's confusing, especially on the part where the user "desires" multiple replicas, but the controller keeps them to a lower value. currentReplicas may be better..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for 1 replica is to prevent the problem with seeds_servers, to have them empty only at cluster creation and never again

// In case of a single seed server, the list should contain the current node itself.
// Normally the cluster is able to recognize it's talking to itself, except when the cluster is
// configured to use mutual TLS.
// So, we clear the list of seeds to help Redpanda.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know why is that? the nodes does not talk to itself by mtls. Is this a bug in redpanda they are looking to fix at one point?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm going to add a comment about this in the seeds server issue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return admin.NodeConfig{}, nil
}

// nolint:goerr113 // test code
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to disable this linter completely 🙄

@@ -166,8 +175,21 @@ func (r *StatefulSetResource) Ensure(ctx context.Context) error {
return fmt.Errorf("error while fetching StatefulSet resource: %w", err)
}
r.LastObservedState = &sts

// Hack for: https://github.com/redpanda-data/redpanda/issues/4999
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uff :)

}

r.logger.Info("Running scale handler", "resource name", r.Key().Name)
return r.handleScaling(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have a test for this? Can we maybe change one of the updat test to also scale to see both can be done at the same time?

@@ -366,6 +367,10 @@ func (m *mockAdminAPI) SetUnavailable(unavailable bool) {
m.unavailable = unavailable
}

func (m *mockAdminAPI) GetNodeConfig() (admin.NodeConfig, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing context argument.

Comment on lines +370 to +387
func (m *mockAdminAPI) GetNodeConfig(
_ context.Context,
) (admin.NodeConfig, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: This should change should belong to operator: allow scoping internal admin API to specific nodes commit

3e77ca2#diff-8dc5beefd6b5f6302ad354811063b60d63ef4062149258baa3f5e3284502a963R370

}

r.logger.Info("Running scale handler", "resource name", r.Key().Name)
return r.handleScaling(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I ask you to add more context to the commit message with the details about what is currently new order in stateful set resource reconciliation function. You could add the caveats around edge cases.

@nicolaferraro
Copy link
Member Author

Fixed.. and added more context to the commit messages

Copy link
Contributor

@RafalKorepta RafalKorepta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this amazing set of features! I left few comments.

src/go/k8s/pkg/resources/statefulset_scale.go Show resolved Hide resolved
)

const (
DecommissionRequeueDuration = time.Second * 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Should this timeout have some back-off mechanism? I would see that if cluster is big enough (number of partitions or data on disk) this could be wise to ask every one minute. For small cluster it would be good to ask more frequent (less than 1 second).

src/go/k8s/pkg/resources/statefulset_scale.go Show resolved Hide resolved
@@ -841,6 +841,16 @@ func (r *Cluster) IsUsingMaintenanceModeHooks() bool {
return true
}

// ClusterSpec

// GetReplicas returns the replicas field is present, or 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 1? If user would edit cluster CR and remove replicas it will be down scaled to 1 node?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess so but I think it's a sane default... it's either 1 or 0 🤷‍♀️

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disallow empty replicas

src/go/k8s/pkg/resources/statefulset.go Show resolved Hide resolved
Log: ctrl.Log.WithName("controllers").WithName("redpanda").WithName("Cluster"),
Scheme: mgr.GetScheme(),
AdminAPIClientFactory: adminutils.NewInternalAdminAPI,
DecommissionWaitInterval: 10 * time.Second,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if 10 seconds is good hardcoded interval. This PR is big enough, but maybe we could expose new flag to set the 10 seconds default that could be changed based on the flag?

ordinal := *r.pandaCluster.Status.DecommissioningNode
targetReplicas := ordinal

scaledDown, err := r.verifyRunningCount(ctx, targetReplicas)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: The verify running count doesn't check if targetReplicas aka decommissioning node id is decreasing by looking at statefulset definition

Suggested change
scaledDown, err := r.verifyRunningCount(ctx, targetReplicas)
differentCount, err := r.verifyRunningCount(ctx, targetReplicas)

// preventing other pods clean shutdown.
//
// See: https://github.com/redpanda-data/redpanda/issues/4999
func (r *StatefulSetResource) disableMaintenanceModeOnDecommissionedNodes(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any unit test/ginko test?

alenkacz
alenkacz previously approved these changes Jun 15, 2022
Copy link
Contributor

@alenkacz alenkacz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

RafalKorepta
RafalKorepta previously approved these changes Jun 15, 2022
@nicolaferraro
Copy link
Member Author

Done latest changes we agreed:

  • Removed possibility to set nil replicas (was not my intention). Now it's enforced in the webhook, previously the cluster just failed on nil replicas
  • Made decommission wait interval configurable (plus jitter, no backoff as I needed to track it per CR)

Now status fields have the following behaviour:
- replicas: reflects StatefulSet status replicas (no longer readyReplicas)
- readyReplicas: reflects StatefulSet status readyReplicas
- currentReplicas: managed by the operator to dynamically change the current number of replicas, to gradually match user expectations
When decommissioning a node, the field is populated with the ordinal number of the node being decommissioned. In case of recommission, it also indicates the node being currently recommissioned.
The replicas field can freely change and the controller will make sure that nodes are properly decommissioned. The only remaining restriction is that replicas cannot be 0 or nil.
This allows the pkg/resources package to use the admin API internal interface.
This allows to get local information from brokers, such as the local configuration.
This produced a stacktrace in the logs, while waiting for a condition.
… nodes

This adds a handler that correctly manages upscaling and downscaling the cluster, decommissioning nodes wheh needed.

The handler uses `status.currentReplicas` to signal the amount of replicas that all subcontrollers should materialize.
When a cluster is downscaled, the handler first tries to decommission the last node via admin API, then decreases the value of `status.currentReplicas`, to remove the node only when the cluster allows it.

In case the cluster refuses to decommission a node (e.g. min replicas on a topic higher than the desired number of nodes), the user can increase `spec.replicas` to trigger a recommission of the node.
…nitial raft group

This tries to solve the problem with empty seed_servers on node 0. With this change, all fresh clusters will be initially set to 1 replica (via `status.currentReplicas`), until a cluster is created and the operator can verify it via admin API. Then the cluster is scaled to the number of instances desired by the user.

After the cluster is initialized, and for the entire lifetime of the cluster, the `seed_servers` property will be populated with the full list of available servers, in every node of the cluster.

This overcomes redpanda-data#333. Previously, node 0 was always forced to have an empty seed_servers property, but this caused problems when it lost the data dir, as it tried to create a brand-new cluster. With this change, even if node 0 loses the data dir, the seed_servers property will always point to other nodes, so it will try to join the existing cluster.
Since nodes are auto-draining as part of their shutdown hooks, it happens that when maintenance mode is activated for a decommissioned node, no process is really started. We just exit if that is the case.
…fresh cluster

This allows to have predictable initial cluster formation. When the cluster is first created, it's composed of a single node. On single-node clusters, we should not activate maintenance mode, because, otherwise, a restart of the node will make it drain leadership and the cluster will not form.

On the counter-side, enabling maintenance mode when the cluster scales to multiple instances currently causes a restart of node 0. This will be solved when implementing dynamic hooks.
…rkaround for redpanda-data#4999)

When a node is shutdown after decommission, the maintenance mode hooks will trigger. While the process has no visible effect on partitions, it leaves the cluster in an inconsistent state, so that other nodes cannot enter maintenance mode. We force reset the flag with this change.
We should enable downscaling as feature gate when issue with reusable node IDs is fixed.
@nicolaferraro
Copy link
Member Author

Added an --allow-downscaling flag, so we disable this feature by default.

I couldn't find a better way to pass config to the webhook 😞 .

Copy link
Contributor

@RafalKorepta RafalKorepta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left none blocking comments as they are only stylish NITs or documentation.

Is there planed work on documentation?

cc @Feediver1

Comment on lines +78 to +82
if r.pandaCluster.Status.CurrentReplicas == 0 {
// Initialize the currentReplicas field, so that it can be later controlled
r.pandaCluster.Status.CurrentReplicas = r.pandaCluster.ComputeInitialCurrentReplicasField()
return r.Status().Update(ctx, r.pandaCluster)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Should this go to default webhook?

}

if r.pandaCluster.Status.DecommissioningNode == nil || r.pandaCluster.Status.CurrentReplicas > *r.pandaCluster.Status.DecommissioningNode {
// Only if actually in a decommissioning phase
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: This comment miss-leads as it is the opposite. When entered it means do nothing as decommissioning is not executed or it is scaling up.

This comment would be more accurate 2 lines down.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Only if actually in a decommissioning phase
// Only if not in decommissioning phase

Comment on lines +46 to +48
// AllowDownscalingInWebhook controls the downscaling alpha feature in the Cluster custom resource.
// Downscaling is not stable since nodeIDs are currently not reusable, so adding to a cluster a node
// that has previously been decommissioned can cause issues.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to describe in commit message or in this comment what consequences might happen if someone will downscale the cluster while Kafka clients are still connect.

cc @jcsp @mmaslankaprv

@nicolaferraro nicolaferraro merged commit 035ed04 into redpanda-data:dev Jun 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants