-
Notifications
You must be signed in to change notification settings - Fork 579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support centralized configuration in operator #3978
Support centralized configuration in operator #3978
Conversation
9c4a787
to
5bf5626
Compare
2897340
to
dafa538
Compare
dafa538
to
04b68bf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great change, thanks!
On a related topic, this is a great (very big) PR 😄 The first commit took me very long to get through, so I couldn't review the whole thing in the time I allocated for it. In the future please consider splitting changes in smaller commits, so that it's easier to focus on specific areas. Github doesn't really help either, since it just sorts the files alphabetically, leaving it up to the reader to find the place to start.
src/go/k8s/controllers/redpanda/cluster_controller_configuration.go
Outdated
Show resolved
Hide resolved
src/go/k8s/controllers/redpanda/cluster_controller_configuration.go
Outdated
Show resolved
Hide resolved
src/go/k8s/controllers/redpanda/cluster_controller_configuration.go
Outdated
Show resolved
Hide resolved
src/go/k8s/tests/e2e/centralized-configuration/03-redpanda-cluster-change.yaml
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't get very far through this yet (67 files!) but publishing a couple of comments rather than leaving them hidden in a draft review
} else if config != nil { | ||
if config.ClusterConfiguration == nil { | ||
// Upgrade scenario: current cluster is not using centralized configuration. | ||
// We need to extract bootstrap properties from the old format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bootstrap is only needed on first startup of a cluster: if a cluster is being upgraded from an earlier version, then you can skip it entirely (it's fine for the file to simply not exist).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, the comment was not precise. The state indicates a specific upgrade from a version that was not using centralized configuration e.g. 21.11.1 to 22.1.1, so I think the bootstrap file will be used in this case by the new version.
I preferred anyway to always write the bootstrap file in the configmap, both to track the configuration more easily, but also to detect configuration drifts without having to call the admin API (which may not be available during some reconciliation loops).
I've changed this computation a bit in a subsequent commit to simplify it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping it up to date in the config map makes sense because newly added nodes will need to get it when starting redpanda for the first time.
I'm not sure how that helps with detecting drifts though: if the admin API is unavailable then you cannot know whether the config has drifted, and it's correct that while the API is unavailable the reconcilation cannot proceed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should definitely change the term "drift", also in the code. The operator is not checking if the cluster is still configured as expected (as the term "drift" may imply), it's checking if the expected cluster configuration is different from the last-submitted one (stored in the configmap). Only after a "change" has been detected, the admin API is checked and any drift is fixed.
But e.g. if I set on the CR additional config redpanda.segment_appender_flush_timeout_ms=2000
and someone changes that value manually to 3000
in the admin API, the operator will not set it back to 2000
until there's another change in the CR config section that triggers the merge process. We said that the users must be aware that they should not change the same configuration property in multiple places.
I did this verification on the CM instead of the cluster so that we're sure that, even in corner cases, the reconcile loop is not affected by the cluster state, i.e. it continues to work correctly when cluster is starting/restarting.
If detecting real drifts against the cluster is a desired feature, it's still possible to add a periodic check. Wdyt @RafalKorepta ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that periodic check should be added. Operator cluster custom resource is source of truth and it should maintain the state of the cluster configuration.
It should be added in the next PR.
e0d9abf
to
7192b56
Compare
893aee7
to
e3822ed
Compare
I should have addressed all the comments. I also tried to rewrite the history to be easier to understand (for what is possible). When I enabled the centralized configuration feature also on the It would be great if you have a second look: @0x5d, @RafalKorepta, @jcsp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good, left few comments
} | ||
|
||
// ClusterConditionType is a valid value for ClusterCondition.Type | ||
// +enum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find any reference to +enum
comment in https://book.kubebuilder.io/introduction.html.
Can you help me understand where it is used and how?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I took inspiration from kube structs when creating the condition, but the +enum
tag is only known by kube-openapi. I'll replace it with the Kubebuilder equivalent.
// Package utils contains useful functions for the operator | ||
package utils |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for that utility package. It would be awesome to add more banzaicloud object matchers as a follow up PR:
b65fc07
to
1786d69
Compare
The future of this class is to be only used when modifying redpanda.yaml files, which are now only for node configuration properties. Cluster configuration changes will flow through the new `rpk cluster config ...` commands. For the moment, leave code that expected DeveloperMode in place, setting the property via Other['developer_mode'] instead.
This set of properties is complete as of intended state at 22.1.1 release. Fleshing this out helps k8s operator developers by making it obvious which properties are node properties: if it's not a first class member of RedpandaConfig, then it's not a node property.
Apply suggestions from code review Co-authored-by: Rafal Korepta <rafal.korepta@gmail.com>
1786d69
to
1b0f684
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I just realised that some new files doesn't have copyrights.
Can you follow what was done in this PR #4056
/*
* Copyright 2022 Vectorized, Inc.
*
* Licensed as a Redpanda Enterprise file under the Redpanda Community
* License (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* https://github.com/redpanda-data/redpanda/blob/master/licenses/rcl.md
*/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The copyrights will be addressed in the next follow up PR.
cc @andrewhsu
Cover letter
This adds support for centralized configuration in the operator. Users can change the CR to update the cluster configuration and have it synchronized to the cluster, without triggering a restart if not needed.
The central piece introduced here is an object called GlobalConfiguration which the operator uses to read/write both node/local and clusterwide configuration properties. The GlobalConfiguration has different strategies (modes) to deal with old (pre v22) clusters, where all properties go to
redpanda.yaml
, new clusters where configuration is split betweenredpanda.yaml
and.bootstrap.yaml
and contains also a transitioning (mixed, default) strategy that uses both files, to simplify rolling upgrades from older versions.The management of the configuration aspect is delegated to a pseudo-controller, meaning that it's implemented like a secondary controller, but its code is executed after the main controller. The reason is that we probably need to switch to server side apply and start doing patches to resources instead of full rewrites before we can introduce a real secondary controller, otherwise there'll be too many concurrent modification conflicts.
To manage particular fields of the resources, the main reconcile loop is instructed to not touch some fields when reconciling them, while the configuration controller modify them directly (mimics what server side apply does naturally).
There are now two digest computed by the controller, one for node properties, one for central properties requiring restart. When those change, a restart of the cluster is automatically triggered. Centralized properties that don't require a restart are just updated via the admin API.
Patches sent to the admin API are computed using a three (or four) way merge, so that only properties changed in the CR are changed by the operator, leaving other workflows (rpk, kafka api) that can change configuration unaffected. The last-applied-configuration is tracked in the configmap for this reason.
The state of the cluster configuration is fully mapped in a condition.
I've found many issues with serialization and tried to fix them. The main problem is that most of the centralized configuration is now unstructured and when e.g. retrieving the config from the admin API the Go json deserialization defaults to using
float64
for all numeric values in unknown properties.I've switched to use json.Number when possible to avoid losing information, but still there are issues with yaml serialization that does not understand the type.I needed to address these issues because both detection of changes in configuration and also computation of patches should not be affected by the way we represent numbers. More tests and refactoring are needed for these aspects.I've included a long
envtest
test for checking the behavior, plus somekuttl
e2e tests.Main remaining things:
Release notes