Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster: implement "Feature manager" #2938

Merged
merged 24 commits into from
Feb 22, 2022
Merged

Conversation

jcsp
Copy link
Contributor

@jcsp jcsp commented Nov 11, 2021

Cover letter

This is a subset of the feature manager design here https://docs.google.com/document/d/1QvHcyIK-aQLILLVAlOE0S1qA1s68ufkZxw3PmIJtGYg/edit# -- not enabling manual toggling of features or storing those individual feature states, but just storing+updating the overall cluster logical version and using an internal mapping of version to available features.

There are broadly 3 pieces to this:

  • Creating the feature manager (which is a fancy name for a single integer that gets updated when all nodes' have been upgraded to a higher cluster_version)
  • Using the new cluster logical version to reject node joins from older redpanda versions
  • Using the new cluster logical version to act as a gate for enabling centralized config.

Fixes: #3704

Features

  • The v1/features admin API endpoint is added, which can be used by automation scripts to query an internal logical cluster version, and feature flags for newly added functionality.

Improvements

  • Redpanda upgrades are made more robust by tracking all node versions, such that new features can wait until all nodes are up to date before activating.

@jcsp jcsp force-pushed the feature-manager branch 3 times, most recently from 6bde190 to 5aa1aa1 Compare December 9, 2021 13:27
@jcsp jcsp changed the title cluster: implement "Feature manager" (initially just tracking the cluster logical version) cluster: implement "Feature manager" Dec 9, 2021
@jcsp jcsp marked this pull request as ready for review December 14, 2021 11:52
@jcsp
Copy link
Contributor Author

jcsp commented Dec 14, 2021

@mmaslankaprv would appreciate your thoughts on the structure of the controller bits.

I've ended up with a table/backend/frontend separation which feels a bit heavyweight for what this is actually doing, but maybe it's the cost of doing business.

@emaxerrno
Copy link
Contributor

just wanted to say this is awesome.

src/v/cluster/types.cc Show resolved Hide resolved
src/v/cluster/feature_frontend.cc Outdated Show resolved Hide resolved
src/v/cluster/feature_frontend.cc Outdated Show resolved Hide resolved
src/v/cluster/feature_frontend.cc Outdated Show resolved Hide resolved
src/v/cluster/feature_frontend.cc Outdated Show resolved Hide resolved
src/v/cluster/feature_table.cc Outdated Show resolved Hide resolved

// Bitmask only used at runtime: if we run out of bits for features
// just use a bigger one. Do not serialize this as a bitmask anywhere.
uint64_t _active_features_mask{0};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::bitset is reasonable too if you want to avoid the manual masking effort

src/v/cluster/controller.cc Outdated Show resolved Hide resolved
src/v/cluster/health_monitor_backend.cc Outdated Show resolved Hide resolved
src/v/cluster/types.cc Show resolved Hide resolved
@CLAassistant
Copy link

CLAassistant commented Feb 1, 2022

CLA assistant check
All committers have signed the CLA.

jcsp and others added 22 commits February 22, 2022 09:39
Tests use wait_for_controller_leadership as a
utility to wait for all the cluster setup to
be done before proceeding.

New infrastructure in config_manager and feature_manager
leads to some async execution of config writes to the
controller log in the background, which confuses
some tests that expect all controller writes
to be done before they start.

Extend wait_for_controller_leadership to wait
for these writes to raft0 before proceeding.
Signed-off-by: John Spray <jcs@vectorized.io>
For services (like feature manager) that would like
to peek at node health reports as they come in.

Signed-off-by: John Spray <jcs@vectorized.io>
Signed-off-by: John Spray <jcs@vectorized.io>
Signed-off-by: John Spray <jcs@vectorized.io>
Signed-off-by: John Spray <jcs@vectorized.io>
Signed-off-by: John Spray <jcs@vectorized.io>
This replaces `join` provides clean
encoding versioning, and carries a cluster_version
to enable servers to refuse join requests from
incompatible versions.

Signed-off-by: John Spray <jcs@vectorized.io>
Only use of old one is now in handling incoming RPCs
from old versions.

This means that new-versioned redpanda will only
be able to join new-versioned clusters.  That
would only impact someone who tried to join
an old version to a newer cluster, or someone
tryin to join an old version to a cluster in
the middle of a rolling upgrade.

Signed-off-by: John Spray <jcs@vectorized.io>
When a subsystem wants to check for a feature during startup,
it is convenient to do so via a future, to avoid awkward
races between initialization of the feature table via
raft0 replay, and initialization of other subsystems.
...to only enable central config if the feature
is active.

Signed-off-by: John Spray <jcs@vectorized.io>
Where the feature table specifies a cluster-wide
logical version, do not permit older nodes to
join.  Where it does not, do not permit nodes
older than the current node to join.
Signed-off-by: John Spray <jcs@vectorized.io>
Signed-off-by: John Spray <jcs@vectorized.io>
This will happen before we are in a position
to check features, but that's okay.  If the
cache of cluster configuration settings doesn't
exist, we fall back to redpanda.yml.

Signed-off-by: John Spray <jcs@vectorized.io>
This is an integration testing hook.  It is more
invasive than I would like, but pretty simple
and hopefully obvious to anyone encountering
this what is going on.

This is NOT for use in the field, and is intentionally
undocumented.
This is used for driving the __REDPANDA_LOGICAL_VERSION testing
hook for the feature manager.
Signed-off-by: John Spray <jcs@vectorized.io>
Old clusters use encoding version 0, new clusters
use encoding version 1 and include the logical version.
This was trying to log current_exception() as if we were
in a catch{} block, but it's a future handler.
@jcsp
Copy link
Contributor Author

jcsp commented Feb 22, 2022

Retrying CI on a failure of nodes_decommissioning_test (#3878)

@jcsp jcsp merged commit d1491de into redpanda-data:dev Feb 22, 2022
@jcsp jcsp deleted the feature-manager branch February 22, 2022 18:32
@jcsp jcsp mentioned this pull request Feb 23, 2022
@jcsp jcsp restored the feature-manager branch March 30, 2022 21:36
@jcsp jcsp deleted the feature-manager branch March 30, 2022 21:37
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature manager
7 participants