Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support centralized configuration in operator #3978

Merged
merged 16 commits into from
Mar 24, 2022

Conversation

nicolaferraro
Copy link
Member

@nicolaferraro nicolaferraro commented Mar 10, 2022

Cover letter

This adds support for centralized configuration in the operator. Users can change the CR to update the cluster configuration and have it synchronized to the cluster, without triggering a restart if not needed.

The central piece introduced here is an object called GlobalConfiguration which the operator uses to read/write both node/local and clusterwide configuration properties. The GlobalConfiguration has different strategies (modes) to deal with old (pre v22) clusters, where all properties go to redpanda.yaml, new clusters where configuration is split between redpanda.yaml and .bootstrap.yaml and contains also a transitioning (mixed, default) strategy that uses both files, to simplify rolling upgrades from older versions.

The management of the configuration aspect is delegated to a pseudo-controller, meaning that it's implemented like a secondary controller, but its code is executed after the main controller. The reason is that we probably need to switch to server side apply and start doing patches to resources instead of full rewrites before we can introduce a real secondary controller, otherwise there'll be too many concurrent modification conflicts.
To manage particular fields of the resources, the main reconcile loop is instructed to not touch some fields when reconciling them, while the configuration controller modify them directly (mimics what server side apply does naturally).

There are now two digest computed by the controller, one for node properties, one for central properties requiring restart. When those change, a restart of the cluster is automatically triggered. Centralized properties that don't require a restart are just updated via the admin API.

Patches sent to the admin API are computed using a three (or four) way merge, so that only properties changed in the CR are changed by the operator, leaving other workflows (rpk, kafka api) that can change configuration unaffected. The last-applied-configuration is tracked in the configmap for this reason.

The state of the cluster configuration is fully mapped in a condition.

I've found many issues with serialization and tried to fix them. The main problem is that most of the centralized configuration is now unstructured and when e.g. retrieving the config from the admin API the Go json deserialization defaults to using float64 for all numeric values in unknown properties. I've switched to use json.Number when possible to avoid losing information, but still there are issues with yaml serialization that does not understand the type. I needed to address these issues because both detection of changes in configuration and also computation of patches should not be affected by the way we represent numbers. More tests and refactoring are needed for these aspects.

I've included a long envtest test for checking the behavior, plus some kuttl e2e tests.

Main remaining things:

  • Add more tests for floating point properties
  • Support admin API TLS and basic authentication (maybe in another PR)

Release notes

  • Added support for centralized configuration in the operator

Copy link
Contributor

@0x5d 0x5d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great change, thanks!
On a related topic, this is a great (very big) PR 😄 The first commit took me very long to get through, so I couldn't review the whole thing in the time I allocated for it. In the future please consider splitting changes in smaller commits, so that it's easier to focus on specific areas. Github doesn't really help either, since it just sorts the files alphabetically, leaving it up to the reader to find the place to start.

Copy link
Contributor

@jcsp jcsp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't get very far through this yet (67 files!) but publishing a couple of comments rather than leaving them hidden in a draft review

src/go/k8s/config/manager/kustomization.yaml Outdated Show resolved Hide resolved
src/go/k8s/pkg/resources/configuration/configuration.go Outdated Show resolved Hide resolved
} else if config != nil {
if config.ClusterConfiguration == nil {
// Upgrade scenario: current cluster is not using centralized configuration.
// We need to extract bootstrap properties from the old format
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bootstrap is only needed on first startup of a cluster: if a cluster is being upgraded from an earlier version, then you can skip it entirely (it's fine for the file to simply not exist).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the comment was not precise. The state indicates a specific upgrade from a version that was not using centralized configuration e.g. 21.11.1 to 22.1.1, so I think the bootstrap file will be used in this case by the new version.

I preferred anyway to always write the bootstrap file in the configmap, both to track the configuration more easily, but also to detect configuration drifts without having to call the admin API (which may not be available during some reconciliation loops).

I've changed this computation a bit in a subsequent commit to simplify it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping it up to date in the config map makes sense because newly added nodes will need to get it when starting redpanda for the first time.

I'm not sure how that helps with detecting drifts though: if the admin API is unavailable then you cannot know whether the config has drifted, and it's correct that while the API is unavailable the reconcilation cannot proceed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should definitely change the term "drift", also in the code. The operator is not checking if the cluster is still configured as expected (as the term "drift" may imply), it's checking if the expected cluster configuration is different from the last-submitted one (stored in the configmap). Only after a "change" has been detected, the admin API is checked and any drift is fixed.

But e.g. if I set on the CR additional config redpanda.segment_appender_flush_timeout_ms=2000 and someone changes that value manually to 3000 in the admin API, the operator will not set it back to 2000 until there's another change in the CR config section that triggers the merge process. We said that the users must be aware that they should not change the same configuration property in multiple places.

I did this verification on the CM instead of the cluster so that we're sure that, even in corner cases, the reconcile loop is not affected by the cluster state, i.e. it continues to work correctly when cluster is starting/restarting.

If detecting real drifts against the cluster is a desired feature, it's still possible to add a periodic check. Wdyt @RafalKorepta ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that periodic check should be added. Operator cluster custom resource is source of truth and it should maintain the state of the cluster configuration.

It should be added in the next PR.

@nicolaferraro nicolaferraro force-pushed the centralized-config-upd branch 2 times, most recently from e0d9abf to 7192b56 Compare March 16, 2022 09:52
@nicolaferraro nicolaferraro force-pushed the centralized-config-upd branch 3 times, most recently from 893aee7 to e3822ed Compare March 23, 2022 10:02
@nicolaferraro
Copy link
Member Author

I should have addressed all the comments. I also tried to rewrite the history to be easier to understand (for what is possible).

When I enabled the centralized configuration feature also on the dev tag, the CI found some regressions that I had to fix changing some logic a bit.
E.g. the pre-check (to detect configuration changes) that was done before the main reconcile loop changed the resources, has been moved inside the configmap reconciliation, because in some cases the computation of the current configuration relies on secrets being previously stored in the namespace.

It would be great if you have a second look: @0x5d, @RafalKorepta, @jcsp

Copy link
Contributor

@RafalKorepta RafalKorepta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good, left few comments

}

// ClusterConditionType is a valid value for ClusterCondition.Type
// +enum
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find any reference to +enum comment in https://book.kubebuilder.io/introduction.html.

Can you help me understand where it is used and how?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I took inspiration from kube structs when creating the condition, but the +enum tag is only known by kube-openapi. I'll replace it with the Kubebuilder equivalent.

src/go/k8s/pkg/resources/configuration/configuration.go Outdated Show resolved Hide resolved
Comment on lines +1 to +2
// Package utils contains useful functions for the operator
package utils
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for that utility package. It would be awesome to add more banzaicloud object matchers as a follow up PR:

src/go/k8s/pkg/resources/configmap.go Show resolved Hide resolved
src/go/k8s/pkg/resources/configmap.go Outdated Show resolved Hide resolved
jcsp and others added 16 commits March 24, 2022 12:43
The future of this class is to be only used when
modifying redpanda.yaml files, which are now only
for node configuration properties.

Cluster configuration changes will flow through
the new `rpk cluster config ...` commands.

For the moment, leave code that expected DeveloperMode
in place, setting the property via Other['developer_mode']
instead.
This set of properties is complete as of intended
state at 22.1.1 release.  Fleshing this out helps
k8s operator developers by making it obvious
which properties are node properties: if it's not
a first class member of RedpandaConfig, then
it's not a node property.
Apply suggestions from code review

Co-authored-by: Rafal Korepta <rafal.korepta@gmail.com>
Copy link
Contributor

@RafalKorepta RafalKorepta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@RafalKorepta RafalKorepta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I just realised that some new files doesn't have copyrights.

Can you follow what was done in this PR #4056

/*
 * Copyright 2022 Vectorized, Inc.
 *
 * Licensed as a Redpanda Enterprise file under the Redpanda Community
 * License (the "License"); you may not use this file except in compliance with
 * the License. You may obtain a copy of the License at
 *
 * https://github.com/redpanda-data/redpanda/blob/master/licenses/rcl.md
 */

Copy link
Contributor

@RafalKorepta RafalKorepta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The copyrights will be addressed in the next follow up PR.

cc @andrewhsu

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants