-
Notifications
You must be signed in to change notification settings - Fork 563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Operator: add support for downscaling #5019
Conversation
74268ba
to
9052ac1
Compare
Code ready for review. It needs #4964 to work properly, because when scaling up from one node, node 0 is restarted and sometimes the cluster fails to get leadership. |
The problem with node 0 not being able to form the initial cluster after restart was caused by maintenance mode hooks enabled on it: i.e. node 0 was losing leadership and no longer be leader for anything after restart. I've disabled maintenance mode when the cluster is starting up. That forces a restart during multi-node cluster first creation (it will be fixed by #4907) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finished first pass 😅 thanks for the clear commit structure, that helped a lot 👏
} | ||
|
||
// isRunningReplicas checks if the statefulset is configured to run the given amount of replicas and that also pods match the expectations | ||
func (r *StatefulSetResource) isRunningReplicas( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
verifyRunningCount
? 🤔
ad66435
to
8f3b0bd
Compare
This is ready for a second pass. Other than the suggestions, I've also found other issues that I've fixed in this new version:
cc: @alenkacz , @pvsune, @RafalKorepta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the API is the only blocker for me, otherwise looks REALLY good, thank you 👏
// ComputeInitialDesiredReplicas calculates the initial value for status.desiredReplicas. | ||
// | ||
// It needs to consider the following cases: | ||
// - Fresh cluster: we start from 1 replicas, then upscale if needed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why desired is 1 on fresh cluste? desired to me seems like it should equal to replicas in spec 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking HPA docs, I wonder whether this field should not be named currentreplicas rather? I feel like desired is the target and final one 🤔 https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/
but feel free to correct me, I just want to get the API right so it's not confusing since it's hard to change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I took the name from HPA, but I agree it's confusing, especially on the part where the user "desires" multiple replicas, but the controller keeps them to a lower value. currentReplicas
may be better..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for 1 replica is to prevent the problem with seeds_servers, to have them empty only at cluster creation and never again
// In case of a single seed server, the list should contain the current node itself. | ||
// Normally the cluster is able to recognize it's talking to itself, except when the cluster is | ||
// configured to use mutual TLS. | ||
// So, we clear the list of seeds to help Redpanda. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you know why is that? the nodes does not talk to itself by mtls. Is this a bug in redpanda they are looking to fix at one point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm going to add a comment about this in the seeds server issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return admin.NodeConfig{}, nil | ||
} | ||
|
||
// nolint:goerr113 // test code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to disable this linter completely 🙄
@@ -166,8 +175,21 @@ func (r *StatefulSetResource) Ensure(ctx context.Context) error { | |||
return fmt.Errorf("error while fetching StatefulSet resource: %w", err) | |||
} | |||
r.LastObservedState = &sts | |||
|
|||
// Hack for: https://github.com/redpanda-data/redpanda/issues/4999 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uff :)
} | ||
|
||
r.logger.Info("Running scale handler", "resource name", r.Key().Name) | ||
return r.handleScaling(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have a test for this? Can we maybe change one of the updat test to also scale to see both can be done at the same time?
8f3b0bd
to
15adb01
Compare
@@ -366,6 +367,10 @@ func (m *mockAdminAPI) SetUnavailable(unavailable bool) { | |||
m.unavailable = unavailable | |||
} | |||
|
|||
func (m *mockAdminAPI) GetNodeConfig() (admin.NodeConfig, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing context argument.
func (m *mockAdminAPI) GetNodeConfig( | ||
_ context.Context, | ||
) (admin.NodeConfig, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: This should change should belong to operator: allow scoping internal admin API to specific nodes
commit
3e77ca2#diff-8dc5beefd6b5f6302ad354811063b60d63ef4062149258baa3f5e3284502a963R370
} | ||
|
||
r.logger.Info("Running scale handler", "resource name", r.Key().Name) | ||
return r.handleScaling(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I ask you to add more context to the commit message with the details about what is currently new order in stateful set resource reconciliation function. You could add the caveats around edge cases.
15adb01
to
a99e310
Compare
Fixed.. and added more context to the commit messages |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this amazing set of features! I left few comments.
) | ||
|
||
const ( | ||
DecommissionRequeueDuration = time.Second * 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: Should this timeout have some back-off mechanism? I would see that if cluster is big enough (number of partitions or data on disk) this could be wise to ask every one minute. For small cluster it would be good to ask more frequent (less than 1 second).
@@ -841,6 +841,16 @@ func (r *Cluster) IsUsingMaintenanceModeHooks() bool { | |||
return true | |||
} | |||
|
|||
// ClusterSpec | |||
|
|||
// GetReplicas returns the replicas field is present, or 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 1? If user would edit cluster CR and remove replicas it will be down scaled to 1 node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess so but I think it's a sane default... it's either 1 or 0 🤷♀️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disallow empty replicas
src/go/k8s/main.go
Outdated
Log: ctrl.Log.WithName("controllers").WithName("redpanda").WithName("Cluster"), | ||
Scheme: mgr.GetScheme(), | ||
AdminAPIClientFactory: adminutils.NewInternalAdminAPI, | ||
DecommissionWaitInterval: 10 * time.Second, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if 10 seconds is good hardcoded interval. This PR is big enough, but maybe we could expose new flag to set the 10 seconds default that could be changed based on the flag?
ordinal := *r.pandaCluster.Status.DecommissioningNode | ||
targetReplicas := ordinal | ||
|
||
scaledDown, err := r.verifyRunningCount(ctx, targetReplicas) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: The verify running count doesn't check if targetReplicas aka decommissioning node id is decreasing by looking at statefulset definition
scaledDown, err := r.verifyRunningCount(ctx, targetReplicas) | |
differentCount, err := r.verifyRunningCount(ctx, targetReplicas) |
// preventing other pods clean shutdown. | ||
// | ||
// See: https://github.com/redpanda-data/redpanda/issues/4999 | ||
func (r *StatefulSetResource) disableMaintenanceModeOnDecommissionedNodes( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any unit test/ginko test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
a99e310
to
bef2ba2
Compare
Done latest changes we agreed:
|
Now status fields have the following behaviour: - replicas: reflects StatefulSet status replicas (no longer readyReplicas) - readyReplicas: reflects StatefulSet status readyReplicas - currentReplicas: managed by the operator to dynamically change the current number of replicas, to gradually match user expectations
When decommissioning a node, the field is populated with the ordinal number of the node being decommissioned. In case of recommission, it also indicates the node being currently recommissioned.
The replicas field can freely change and the controller will make sure that nodes are properly decommissioned. The only remaining restriction is that replicas cannot be 0 or nil.
This allows the pkg/resources package to use the admin API internal interface.
This allows to get local information from brokers, such as the local configuration.
This produced a stacktrace in the logs, while waiting for a condition.
… nodes This adds a handler that correctly manages upscaling and downscaling the cluster, decommissioning nodes wheh needed. The handler uses `status.currentReplicas` to signal the amount of replicas that all subcontrollers should materialize. When a cluster is downscaled, the handler first tries to decommission the last node via admin API, then decreases the value of `status.currentReplicas`, to remove the node only when the cluster allows it. In case the cluster refuses to decommission a node (e.g. min replicas on a topic higher than the desired number of nodes), the user can increase `spec.replicas` to trigger a recommission of the node.
…nitial raft group This tries to solve the problem with empty seed_servers on node 0. With this change, all fresh clusters will be initially set to 1 replica (via `status.currentReplicas`), until a cluster is created and the operator can verify it via admin API. Then the cluster is scaled to the number of instances desired by the user. After the cluster is initialized, and for the entire lifetime of the cluster, the `seed_servers` property will be populated with the full list of available servers, in every node of the cluster. This overcomes redpanda-data#333. Previously, node 0 was always forced to have an empty seed_servers property, but this caused problems when it lost the data dir, as it tried to create a brand-new cluster. With this change, even if node 0 loses the data dir, the seed_servers property will always point to other nodes, so it will try to join the existing cluster.
Since nodes are auto-draining as part of their shutdown hooks, it happens that when maintenance mode is activated for a decommissioned node, no process is really started. We just exit if that is the case.
…fresh cluster This allows to have predictable initial cluster formation. When the cluster is first created, it's composed of a single node. On single-node clusters, we should not activate maintenance mode, because, otherwise, a restart of the node will make it drain leadership and the cluster will not form. On the counter-side, enabling maintenance mode when the cluster scales to multiple instances currently causes a restart of node 0. This will be solved when implementing dynamic hooks.
…rkaround for redpanda-data#4999) When a node is shutdown after decommission, the maintenance mode hooks will trigger. While the process has no visible effect on partitions, it leaves the cluster in an inconsistent state, so that other nodes cannot enter maintenance mode. We force reset the flag with this change.
We should enable downscaling as feature gate when issue with reusable node IDs is fixed.
bef2ba2
to
5bd42df
Compare
Added an I couldn't find a better way to pass config to the webhook 😞 . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left none blocking comments as they are only stylish NITs or documentation.
Is there planed work on documentation?
cc @Feediver1
if r.pandaCluster.Status.CurrentReplicas == 0 { | ||
// Initialize the currentReplicas field, so that it can be later controlled | ||
r.pandaCluster.Status.CurrentReplicas = r.pandaCluster.ComputeInitialCurrentReplicasField() | ||
return r.Status().Update(ctx, r.pandaCluster) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: Should this go to default webhook?
} | ||
|
||
if r.pandaCluster.Status.DecommissioningNode == nil || r.pandaCluster.Status.CurrentReplicas > *r.pandaCluster.Status.DecommissioningNode { | ||
// Only if actually in a decommissioning phase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: This comment miss-leads as it is the opposite. When entered it means do nothing as decommissioning is not executed or it is scaling up.
This comment would be more accurate 2 lines down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Only if actually in a decommissioning phase | |
// Only if not in decommissioning phase |
// AllowDownscalingInWebhook controls the downscaling alpha feature in the Cluster custom resource. | ||
// Downscaling is not stable since nodeIDs are currently not reusable, so adding to a cluster a node | ||
// that has previously been decommissioned can cause issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to describe in commit message or in this comment what consequences might happen if someone will downscale the cluster while Kafka clients are still connect.
Cover letter
This adds support for downscaling clusters in Kubernetes. Downscaling is done in coordination with the cluster, using the
/decommission
endpoint. The StatefulSet is downscaled only if the cluster allows it (e.g. having a topic with 3 replicas prevents the cluster to scale to less than 3 nodes).When the cluster prevents the statefulset to scale, replicas can still be reverted to previous state and the operator will automatically trigger a
/recommission
of the broker.This also removes the need to use an empty seed_server when node 0 starts, because fresh clusters now start with a single replica, then are upscaled.
The whole PR has been created withnodeID
possibly diverging from StatefulSetordinal
(a PR to do that is expected after this).It also deals with many edge cases (see controller tests).
Release notes
Features
--allow-downscaling
)