cluster: implement "Feature manager" #2938

jcsp · 2021-11-11T15:03:40Z

Cover letter

This is a subset of the feature manager design here https://docs.google.com/document/d/1QvHcyIK-aQLILLVAlOE0S1qA1s68ufkZxw3PmIJtGYg/edit# -- not enabling manual toggling of features or storing those individual feature states, but just storing+updating the overall cluster logical version and using an internal mapping of version to available features.

There are broadly 3 pieces to this:

Creating the feature manager (which is a fancy name for a single integer that gets updated when all nodes' have been upgraded to a higher cluster_version)
Using the new cluster logical version to reject node joins from older redpanda versions
Using the new cluster logical version to act as a gate for enabling centralized config.

Fixes: #3704

Features

The v1/features admin API endpoint is added, which can be used by automation scripts to query an internal logical cluster version, and feature flags for newly added functionality.

Improvements

Redpanda upgrades are made more robust by tracking all node versions, such that new features can wait until all nodes are up to date before activating.

src/v/cluster/health_monitor_types.h

jcsp · 2021-12-14T12:00:05Z

@mmaslankaprv would appreciate your thoughts on the structure of the controller bits.

I've ended up with a table/backend/frontend separation which feels a bit heavyweight for what this is actually doing, but maybe it's the cost of doing business.

tests/rptest/tests/cluster_features_test.py

src/v/cluster/feature_frontend.cc

emaxerrno · 2021-12-14T18:37:35Z

just wanted to say this is awesome.

src/v/cluster/types.cc

src/v/cluster/feature_frontend.cc

src/v/cluster/feature_table.cc

dotnwat · 2021-12-20T21:13:02Z

src/v/cluster/feature_table.h

+
+    // Bitmask only used at runtime: if we run out of bits for features
+    // just use a bigger one.  Do not serialize this as a bitmask anywhere.
+    uint64_t _active_features_mask{0};


std::bitset is reasonable too if you want to avoid the manual masking effort

src/v/cluster/controller.cc

src/v/cluster/health_monitor_backend.cc

src/v/cluster/types.cc

CLAassistant · 2022-02-01T12:14:54Z

All committers have signed the CLA.

Tests use wait_for_controller_leadership as a utility to wait for all the cluster setup to be done before proceeding. New infrastructure in config_manager and feature_manager leads to some async execution of config writes to the controller log in the background, which confuses some tests that expect all controller writes to be done before they start. Extend wait_for_controller_leadership to wait for these writes to raft0 before proceeding.

Signed-off-by: John Spray <jcs@vectorized.io>

For services (like feature manager) that would like to peek at node health reports as they come in. Signed-off-by: John Spray <jcs@vectorized.io>

Signed-off-by: John Spray <jcs@vectorized.io>

This replaces `join` provides clean encoding versioning, and carries a cluster_version to enable servers to refuse join requests from incompatible versions. Signed-off-by: John Spray <jcs@vectorized.io>

Only use of old one is now in handling incoming RPCs from old versions. This means that new-versioned redpanda will only be able to join new-versioned clusters. That would only impact someone who tried to join an old version to a newer cluster, or someone tryin to join an old version to a cluster in the middle of a rolling upgrade. Signed-off-by: John Spray <jcs@vectorized.io>

When a subsystem wants to check for a feature during startup, it is convenient to do so via a future, to avoid awkward races between initialization of the feature table via raft0 replay, and initialization of other subsystems.

...to only enable central config if the feature is active. Signed-off-by: John Spray <jcs@vectorized.io>

Where the feature table specifies a cluster-wide logical version, do not permit older nodes to join. Where it does not, do not permit nodes older than the current node to join.

Signed-off-by: John Spray <jcs@vectorized.io>

This will happen before we are in a position to check features, but that's okay. If the cache of cluster configuration settings doesn't exist, we fall back to redpanda.yml. Signed-off-by: John Spray <jcs@vectorized.io>

This is an integration testing hook. It is more invasive than I would like, but pretty simple and hopefully obvious to anyone encountering this what is going on. This is NOT for use in the field, and is intentionally undocumented.

This is used for driving the __REDPANDA_LOGICAL_VERSION testing hook for the feature manager.

Signed-off-by: John Spray <jcs@vectorized.io>

Old clusters use encoding version 0, new clusters use encoding version 1 and include the logical version.

This was trying to log current_exception() as if we were in a catch{} block, but it's a future handler.

jcsp · 2022-02-22T13:39:57Z

Retrying CI on a failure of nodes_decommissioning_test (#3878)

github-actions bot added the area/redpanda label Nov 11, 2021

jcsp force-pushed the feature-manager branch from c881268 to 0f109f6 Compare November 11, 2021 15:11

ajfabbri reviewed Nov 19, 2021

View reviewed changes

src/v/cluster/health_monitor_types.h Outdated Show resolved Hide resolved

jcsp force-pushed the feature-manager branch 3 times, most recently from 6bde190 to 5aa1aa1 Compare December 9, 2021 13:27

jcsp changed the title ~~cluster: implement "Feature manager" (initially just tracking the cluster logical version)~~ cluster: implement "Feature manager" Dec 9, 2021

jcsp requested a review from mmaslankaprv December 14, 2021 11:52

jcsp marked this pull request as ready for review December 14, 2021 11:52

jcsp requested review from dotnwat, ivotron, NyaliaLui, VadimPlh and ztlpn as code owners December 14, 2021 11:52

jcsp removed request for dotnwat, ivotron, ztlpn, NyaliaLui and VadimPlh December 14, 2021 11:52

jcsp force-pushed the feature-manager branch from 5aa1aa1 to 9900b37 Compare December 14, 2021 11:59

mmaslankaprv reviewed Dec 14, 2021

View reviewed changes

tests/rptest/tests/cluster_features_test.py Outdated Show resolved Hide resolved

mmaslankaprv reviewed Dec 14, 2021

View reviewed changes

src/v/cluster/feature_frontend.cc Outdated Show resolved Hide resolved

mmaslankaprv reviewed Dec 14, 2021

View reviewed changes

src/v/cluster/feature_frontend.cc Outdated Show resolved Hide resolved

mmaslankaprv reviewed Dec 14, 2021

View reviewed changes

src/v/cluster/feature_frontend.cc Outdated Show resolved Hide resolved

dotnwat reviewed Dec 20, 2021

View reviewed changes

jcsp force-pushed the feature-manager branch from 9900b37 to 78ebc59 Compare February 1, 2022 12:14

jcsp and others added 22 commits February 22, 2022 09:39

cluster: add feature_manager types

710e1a1

Signed-off-by: John Spray <jcs@vectorized.io>

cluster: add callback to health_monitor

fa5e4ab

For services (like feature manager) that would like to peek at node health reports as they come in. Signed-off-by: John Spray <jcs@vectorized.io>

cluster: add logical version to health report struct

c4aadbb

Signed-off-by: John Spray <jcs@vectorized.io>

cluster: create feature_table

7118eb4

Signed-off-by: John Spray <jcs@vectorized.io>

cluster: create feature_backend

da99376

cluster: create feature_manager

a69f4ae

Signed-off-by: John Spray <jcs@vectorized.io>

cluster: wire in feature manager/frontend/table

4399881

Signed-off-by: John Spray <jcs@vectorized.io>

cluster: create new join_node RPC

c2e4cf1

This replaces `join` provides clean encoding versioning, and carries a cluster_version to enable servers to refuse join requests from incompatible versions. Signed-off-by: John Spray <jcs@vectorized.io>

cluster: implement feature_table::await_feature

e15730e

When a subsystem wants to check for a feature during startup, it is convenient to do so via a future, to avoid awkward races between initialization of the feature table via raft0 replay, and initialization of other subsystems.

cluster: consume feature_table from config_manager

8198538

...to only enable central config if the feature is active. Signed-off-by: John Spray <jcs@vectorized.io>

cluster: validate version in join_node_request

27cd964

Where the feature table specifies a cluster-wide logical version, do not permit older nodes to join. Where it does not, do not permit nodes older than the current node to join.

admin: add /v1/features endpoint

ec337f8

Signed-off-by: John Spray <jcs@vectorized.io>

admin: consume feature for config API enablement

5dd910c

Signed-off-by: John Spray <jcs@vectorized.io>

redpanda: always do central config load on start

8634ea7

This will happen before we are in a position to check features, but that's okay. If the cache of cluster configuration settings doesn't exist, we fall back to redpanda.yml. Signed-off-by: John Spray <jcs@vectorized.io>

kafka: respect config feature flag for alter_configs

e7e8094

cluster: enable env override of logical version

89722ee

This is an integration testing hook. It is more invasive than I would like, but pretty simple and hopefully obvious to anyone encountering this what is going on. This is NOT for use in the field, and is intentionally undocumented.

tests/redpanda: enable passing environment variables into redpanda

61a6f73

This is used for driving the __REDPANDA_LOGICAL_VERSION testing hook for the feature manager.

tests: add test for feature manager API endpoint

921a9a3

Signed-off-by: John Spray <jcs@vectorized.io>

cluster: hide logical_version behind encoding version

72e2da2

Old clusters use encoding version 0, new clusters use encoding version 1 and include the logical version.

cluster: fix exception logging from config_manager

4d3c5df

This was trying to log current_exception() as if we were in a catch{} block, but it's a future handler.

jcsp force-pushed the feature-manager branch from 85785bb to 4d3c5df Compare February 22, 2022 09:46

dotnwat approved these changes Feb 22, 2022

View reviewed changes

jcsp merged commit d1491de into redpanda-data:dev Feb 22, 2022

jcsp deleted the feature-manager branch February 22, 2022 18:32

jcsp mentioned this pull request Feb 23, 2022

Feature manager #3704

Closed

jcsp restored the feature-manager branch March 30, 2022 21:36

jcsp deleted the feature-manager branch March 30, 2022 21:37

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster: implement "Feature manager" #2938

cluster: implement "Feature manager" #2938

jcsp commented Nov 11, 2021 •

edited by dotnwat

Loading

jcsp commented Dec 14, 2021

emaxerrno commented Dec 14, 2021

dotnwat Dec 20, 2021

CLAassistant commented Feb 1, 2022 •

edited

Loading

jcsp commented Feb 22, 2022

cluster: implement "Feature manager" #2938

cluster: implement "Feature manager" #2938

Conversation

jcsp commented Nov 11, 2021 • edited by dotnwat Loading

Cover letter

Features

Improvements

jcsp commented Dec 14, 2021

emaxerrno commented Dec 14, 2021

dotnwat Dec 20, 2021

Choose a reason for hiding this comment

CLAassistant commented Feb 1, 2022 • edited Loading

jcsp commented Feb 22, 2022

jcsp commented Nov 11, 2021 •

edited by dotnwat

Loading

CLAassistant commented Feb 1, 2022 •

edited

Loading