Support centralized configuration in operator #3978

nicolaferraro · 2022-03-10T10:40:27Z

Cover letter

This adds support for centralized configuration in the operator. Users can change the CR to update the cluster configuration and have it synchronized to the cluster, without triggering a restart if not needed.

The central piece introduced here is an object called GlobalConfiguration which the operator uses to read/write both node/local and clusterwide configuration properties. The GlobalConfiguration has different strategies (modes) to deal with old (pre v22) clusters, where all properties go to redpanda.yaml, new clusters where configuration is split between redpanda.yaml and .bootstrap.yaml and contains also a transitioning (mixed, default) strategy that uses both files, to simplify rolling upgrades from older versions.

The management of the configuration aspect is delegated to a pseudo-controller, meaning that it's implemented like a secondary controller, but its code is executed after the main controller. The reason is that we probably need to switch to server side apply and start doing patches to resources instead of full rewrites before we can introduce a real secondary controller, otherwise there'll be too many concurrent modification conflicts.
To manage particular fields of the resources, the main reconcile loop is instructed to not touch some fields when reconciling them, while the configuration controller modify them directly (mimics what server side apply does naturally).

There are now two digest computed by the controller, one for node properties, one for central properties requiring restart. When those change, a restart of the cluster is automatically triggered. Centralized properties that don't require a restart are just updated via the admin API.

Patches sent to the admin API are computed using a three (or four) way merge, so that only properties changed in the CR are changed by the operator, leaving other workflows (rpk, kafka api) that can change configuration unaffected. The last-applied-configuration is tracked in the configmap for this reason.

The state of the cluster configuration is fully mapped in a condition.

I've found many issues with serialization and tried to fix them. The main problem is that most of the centralized configuration is now unstructured and when e.g. retrieving the config from the admin API the Go json deserialization defaults to using float64 for all numeric values in unknown properties. ~~I've switched to use json.Number when possible to avoid losing information, but still there are issues with yaml serialization that does not understand the type.~~ I needed to address these issues because both detection of changes in configuration and also computation of patches should not be affected by the way we represent numbers. More tests and refactoring are needed for these aspects.

I've included a long envtest test for checking the behavior, plus some kuttl e2e tests.

Main remaining things:

Add more tests for floating point properties
Support admin API TLS and basic authentication (maybe in another PR)

Release notes

Added support for centralized configuration in the operator

0x5d

This is a great change, thanks!
On a related topic, this is a great (very big) PR 😄 The first commit took me very long to get through, so I couldn't review the whole thing in the time I allocated for it. In the future please consider splitting changes in smaller commits, so that it's easier to focus on specific areas. Github doesn't really help either, since it just sorts the files alphabetically, leaving it up to the reader to find the place to start.

src/go/k8s/controllers/redpanda/cluster_controller.go

src/go/k8s/controllers/redpanda/cluster_controller_configuration.go

src/go/k8s/pkg/resources/configmap.go

src/go/k8s/pkg/resources/configuration/configuration.go

src/go/k8s/pkg/resources/configuration/configuration_modes.go

src/go/k8s/tests/e2e/centralized-configuration/03-redpanda-cluster-change.yaml

jcsp

I didn't get very far through this yet (67 files!) but publishing a couple of comments rather than leaving them hidden in a draft review

src/go/k8s/config/manager/kustomization.yaml

src/go/k8s/pkg/resources/configuration/configuration.go

jcsp · 2022-03-14T22:18:07Z

src/go/k8s/controllers/redpanda/cluster_controller_configuration.go

+		} else if config != nil {
+			if config.ClusterConfiguration == nil {
+				// Upgrade scenario: current cluster is not using centralized configuration.
+				// We need to extract bootstrap properties from the old format


Bootstrap is only needed on first startup of a cluster: if a cluster is being upgraded from an earlier version, then you can skip it entirely (it's fine for the file to simply not exist).

Right, the comment was not precise. The state indicates a specific upgrade from a version that was not using centralized configuration e.g. 21.11.1 to 22.1.1, so I think the bootstrap file will be used in this case by the new version.

I preferred anyway to always write the bootstrap file in the configmap, both to track the configuration more easily, but also to detect configuration drifts without having to call the admin API (which may not be available during some reconciliation loops).

I've changed this computation a bit in a subsequent commit to simplify it.

Keeping it up to date in the config map makes sense because newly added nodes will need to get it when starting redpanda for the first time.

I'm not sure how that helps with detecting drifts though: if the admin API is unavailable then you cannot know whether the config has drifted, and it's correct that while the API is unavailable the reconcilation cannot proceed.

I should definitely change the term "drift", also in the code. The operator is not checking if the cluster is still configured as expected (as the term "drift" may imply), it's checking if the expected cluster configuration is different from the last-submitted one (stored in the configmap). Only after a "change" has been detected, the admin API is checked and any drift is fixed.

But e.g. if I set on the CR additional config redpanda.segment_appender_flush_timeout_ms=2000 and someone changes that value manually to 3000 in the admin API, the operator will not set it back to 2000 until there's another change in the CR config section that triggers the merge process. We said that the users must be aware that they should not change the same configuration property in multiple places.

I did this verification on the CM instead of the cluster so that we're sure that, even in corner cases, the reconcile loop is not affected by the cluster state, i.e. it continues to work correctly when cluster is starting/restarting.

If detecting real drifts against the cluster is a desired feature, it's still possible to add a periodic check. Wdyt @RafalKorepta ?

Agree that periodic check should be added. Operator cluster custom resource is source of truth and it should maintain the state of the cluster configuration.

It should be added in the next PR.

src/go/rpk/pkg/cli/cmd/redpanda/mode_test.go

src/go/k8s/pkg/resources/configmap.go

src/go/k8s/pkg/resources/configuration/patch.go

nicolaferraro · 2022-03-23T11:46:17Z

I should have addressed all the comments. I also tried to rewrite the history to be easier to understand (for what is possible).

When I enabled the centralized configuration feature also on the dev tag, the CI found some regressions that I had to fix changing some logic a bit.
E.g. the pre-check (to detect configuration changes) that was done before the main reconcile loop changed the resources, has been moved inside the configmap reconciliation, because in some cases the computation of the current configuration relies on secrets being previously stored in the namespace.

It would be great if you have a second look: @0x5d, @RafalKorepta, @jcsp

RafalKorepta

It looks good, left few comments

RafalKorepta · 2022-03-23T13:19:50Z

src/go/k8s/apis/redpanda/v1alpha1/cluster_types.go

+}
+
+// ClusterConditionType is a valid value for ClusterCondition.Type
+// +enum


I couldn't find any reference to +enum comment in https://book.kubebuilder.io/introduction.html.

Can you help me understand where it is used and how?

Indeed, I took inspiration from kube structs when creating the condition, but the +enum tag is only known by kube-openapi. I'll replace it with the Kubebuilder equivalent.

src/go/k8s/pkg/resources/configuration/configuration.go

src/go/k8s/pkg/resources/configuration/configuration_modes.go

src/go/k8s/pkg/resources/configuration/configuration_test.go

RafalKorepta · 2022-03-23T14:47:04Z

src/go/k8s/pkg/utils/package.go

+// Package utils contains useful functions for the operator
+package utils


Thank you for that utility package. It would be awesome to add more banzaicloud object matchers as a follow up PR:

ignoreKubernetesTokenVolumeMounts

deleteKubernetesTokenVolumeMounts

ignoreDefaultToleration

deleteDefaultToleration

ignoreExistingVolumes

deleteExistingVolumes

src/go/k8s/pkg/resources/configmap.go

src/go/k8s/controllers/redpanda/cluster_controller_configuration.go

src/go/k8s/pkg/resources/configmap.go

The future of this class is to be only used when modifying redpanda.yaml files, which are now only for node configuration properties. Cluster configuration changes will flow through the new `rpk cluster config ...` commands. For the moment, leave code that expected DeveloperMode in place, setting the property via Other['developer_mode'] instead.

This set of properties is complete as of intended state at 22.1.1 release. Fleshing this out helps k8s operator developers by making it obvious which properties are node properties: if it's not a first class member of RedpandaConfig, then it's not a node property.

Apply suggestions from code review Co-authored-by: Rafal Korepta <rafal.korepta@gmail.com>

RafalKorepta

LGTM

RafalKorepta

Sorry I just realised that some new files doesn't have copyrights.

Can you follow what was done in this PR #4056

/*
 * Copyright 2022 Vectorized, Inc.
 *
 * Licensed as a Redpanda Enterprise file under the Redpanda Community
 * License (the "License"); you may not use this file except in compliance with
 * the License. You may obtain a copy of the License at
 *
 * https://github.com/redpanda-data/redpanda/blob/master/licenses/rcl.md
 */

RafalKorepta

The copyrights will be addressed in the next follow up PR.

cc @andrewhsu

github-actions bot added area/k8s area/rpk labels Mar 10, 2022

nicolaferraro force-pushed the centralized-config-upd branch from 9c4a787 to 5bf5626 Compare March 14, 2022 09:43

nicolaferraro marked this pull request as ready for review March 14, 2022 12:20

nicolaferraro requested review from twmb, 0x5d and LenaAn as code owners March 14, 2022 12:20

nicolaferraro force-pushed the centralized-config-upd branch from 2897340 to dafa538 Compare March 14, 2022 14:04

nicolaferraro requested review from dotnwat, NyaliaLui, mmaslankaprv, ztlpn, jcsp, VadimPlh, rystsov, graphcareful and ivotron as code owners March 14, 2022 14:04

github-actions bot added area/build area/redpanda labels Mar 14, 2022

nicolaferraro force-pushed the centralized-config-upd branch from dafa538 to 04b68bf Compare March 14, 2022 14:05

github-actions bot removed area/redpanda area/build labels Mar 14, 2022

0x5d requested changes Mar 14, 2022

View reviewed changes

jcsp reviewed Mar 14, 2022

View reviewed changes

src/go/k8s/config/manager/kustomization.yaml Outdated Show resolved Hide resolved

src/go/k8s/pkg/resources/configuration/configuration.go Outdated Show resolved Hide resolved

jcsp reviewed Mar 14, 2022

View reviewed changes

nicolaferraro force-pushed the centralized-config-upd branch 2 times, most recently from e0d9abf to 7192b56 Compare March 16, 2022 09:52

jcsp reviewed Mar 16, 2022

View reviewed changes

src/go/rpk/pkg/cli/cmd/redpanda/mode_test.go Outdated Show resolved Hide resolved

jcsp reviewed Mar 16, 2022

View reviewed changes

src/go/k8s/pkg/resources/configmap.go Outdated Show resolved Hide resolved

jcsp reviewed Mar 16, 2022

View reviewed changes

src/go/k8s/pkg/resources/configuration/patch.go Outdated Show resolved Hide resolved

nicolaferraro force-pushed the centralized-config-upd branch 3 times, most recently from 893aee7 to e3822ed Compare March 23, 2022 10:02

RafalKorepta reviewed Mar 23, 2022

View reviewed changes

nicolaferraro force-pushed the centralized-config-upd branch from b65fc07 to 1786d69 Compare March 24, 2022 11:43

jcsp and others added 16 commits March 24, 2022 12:43

rpk: restore developer mode as node configuration option

1a04465

rpk: fix typo found by linter

fcd3ca8

operator: fix typo in test

01e0000

operator: add support for conditions and ClusterConfigured condition

2d6ebdf

operator: add centralized configuration feature gate

9167c95

operator: add global configuration object

3003bac

Apply suggestions from code review Co-authored-by: Rafal Korepta <rafal.korepta@gmail.com>

operator: add utility to ignore annotations during resource comparison

adc5252

operator: add support for centralized configuration

ddcee62

operator: add support for TLS in centralized configuration

1f83586

operator: inject .bootstrap.yaml file into pods

4c2293e

operator: fix mount path in kuttl tests using centralized configuration

2fe79e1

operator: add kuttl tests for centralized configuration

7f65a5f

operator: add kuttl TLS tests for centralized configuration

3b1b686

operator: add cluster configuration metrics

1b0f684

nicolaferraro force-pushed the centralized-config-upd branch from 1786d69 to 1b0f684 Compare March 24, 2022 11:43

RafalKorepta approved these changes Mar 24, 2022

View reviewed changes

RafalKorepta requested changes Mar 24, 2022

View reviewed changes

RafalKorepta approved these changes Mar 24, 2022

View reviewed changes

0x5d approved these changes Mar 24, 2022

View reviewed changes

nicolaferraro merged commit 9ebb20d into redpanda-data:dev Mar 24, 2022

jcsp mentioned this pull request Mar 28, 2022

Failure in kuttl/harness/centralized-configuration-tls #4122

Closed

RafalKorepta mentioned this pull request Jul 25, 2022

Watch Cluster CR explicitly instead of watching owner pvsune/redpanda#2

Closed

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support centralized configuration in operator #3978

Support centralized configuration in operator #3978

nicolaferraro commented Mar 10, 2022 •

edited

Loading

0x5d left a comment

jcsp left a comment

jcsp Mar 14, 2022

nicolaferraro Mar 15, 2022

jcsp Mar 16, 2022

nicolaferraro Mar 16, 2022

RafalKorepta Mar 17, 2022

nicolaferraro commented Mar 23, 2022

RafalKorepta left a comment

RafalKorepta Mar 23, 2022

nicolaferraro Mar 24, 2022

RafalKorepta Mar 23, 2022

RafalKorepta left a comment

RafalKorepta left a comment •

edited

Loading

RafalKorepta left a comment

		// Package utils contains useful functions for the operator
		package utils

Support centralized configuration in operator #3978

Support centralized configuration in operator #3978

Conversation

nicolaferraro commented Mar 10, 2022 • edited Loading

Cover letter

Release notes

0x5d left a comment

Choose a reason for hiding this comment

jcsp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicolaferraro commented Mar 23, 2022

RafalKorepta left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RafalKorepta left a comment

Choose a reason for hiding this comment

RafalKorepta left a comment • edited Loading

Choose a reason for hiding this comment

RafalKorepta left a comment

Choose a reason for hiding this comment

nicolaferraro commented Mar 10, 2022 •

edited

Loading

RafalKorepta left a comment •

edited

Loading