Support node maintenance mode #3932

dotnwat · 2022-03-03T06:17:47Z

Cover letter

Adds three administrative interfaces:

   Start maintenance:  PUT    /v1/brokers/ID/maintenance
   Stop maintenance:   DELETE /v1/brokers/ID/maintenance
   Drain status: GET    /v1/brokers/ID/maintenance

which allow a node to be placed into a maintenance mode in which the primary action is to relinquished all leadership from a node. if a node is taken out of the maintenance state then leadership is allowed to return. this is a building block for rolling upgrades in which a target node to upgrade is drained to minimize disruption to client traffic.

maintenance mode is persistent and stored in the controller. currently, only one node at a time may be in maintenance mode.

reviewers may notice some obvious failure scenarios. for example, if leadership cannot be transferred off of a node then draining can never complete. this is by design: initially the design assumes that an external process (e.g. k8s operator) will monitor cluster health and draining progress. it may choose to proceed with upgrade despite some leadership failing to transfer, or may take a node out of draining state to deal with the underlying issue.

this feature and the heuristics for dealing with cluster situations is sure to expand. this initial set of functionality should be sufficient to allow k8s operator work to proceed.

Fixes: #3706
Fixes: #3705

Release notes

Support for placing node into a draining state in which all leadership is relinquished.

dotnwat · 2022-03-03T06:41:18Z

@simon0191 up in the cover letter are the administration control URLs we discussed yesterday in the context of rolling upgrades.

src/v/redpanda/drain_manager.cc

mmaslankaprv · 2022-03-03T07:53:47Z

This looks great. only two minor nits

jcsp · 2022-03-03T14:26:22Z

High level thoughts:

Is draining==maintenance mode? The latter is probably a more idiomatic name for the state. Draining is also prone to misunderstanding because it sounds like we might be draining the data from the node, as opposed to just leadership.
We might have more nodes in future (maybe to distinguish a node which is draining for restart vs. a node draining for decom?) maybe the API should be a bit more extensible, for example rather than cancelling drain mode with a DELETE to /v1/drain, it could be a PUT {"status": "active"} to /v1/node/.
To be really safe, the state should be persistent -- if it's ephemeral, then the operator has no guarantee that the node won't restart and take up leaderships while it is supposed to be in the draining state.
In future, it would be ideal for some central point to know which nodes are in the draining state. That could be useful for situational awareness but also for restricting when a node can be put into the mode -- a central point of control could enforce rules preventing too many nodes being in maintenance at once.

I guess a couple of those points would be covered if we drove the mode via a state in members_table -- that's a bit more invasive of course (needs new structs/encoding & a feature flag) but probably where we want to be longer term?

src/v/raft/consensus.cc

BenPope · 2022-03-03T14:46:36Z

The simplest way to integrate this with K8s is to call the drain API as a pre-stop hook, which will always be called on shutdown.

How does this interact with decommissioning?

Is draining a part of decommission? Should it be?
What happens if the drain API is called on a decommissioned node?
Would it make sense to wire this into application::shutdown?

emaxerrno · 2022-03-03T17:55:59Z

src/v/cluster/partition_manager.cc

+     * synchronization with the drianing manager.
+     */
+    if (_block_new_leadership) {
+        p->block_new_leadership();


should this have a timeout to unblock, say 30mins or 1hr or smth in case of a cluster split-brain after the draining started and can't make progress.

emaxerrno · 2022-03-03T17:57:15Z

@dotnwat should draining have a timeout like logging where -1 is infinite as part of the API - i.e.: the node should drain in 3 hrs or something terrible has happened kinda thing? just a thought.

BenPope

Very nice.

src/v/raft/consensus.cc

src/v/cluster/partition_manager.cc

src/v/redpanda/application.cc

src/v/redpanda/drain_manager.cc

src/v/redpanda/admin_server.h

dotnwat · 2022-03-09T20:07:01Z

How does this interact with decommissioning?
Is draining a part of decommission? Should it be?

I'm under the impression that decommissioning affectively drains, but instead of draining leadership it actually moves partitions off the node.

What happens if the drain API is called on a decommissioned node?

Mm, I don't think anything would happen. If it's decommissioned it can be removed.

Would it make sense to wire this into application::shutdown?

Yes! The plan was to do this in a follow up PR after the basic drain mechanism is robust.

dotnwat · 2022-03-10T07:18:32Z

Notable changes in latest iteration (cc @jcsp @mmaslankaprv @BenPope)

Rename from drain -> maintenance mode. Internally we still use a drain manager for detailing with draining leadership, but other mode and external interfaces are expressed in terms of a maintenance mode.
Maintenance mode is now persistent and stored in the members_table / raft0. The current policy is to only allow one node at a time to be in maintenance mode. This should allow us to build some tooling for bare-metal, too, via rpk.

Feedback from @jcsp that wasn't done:

I agree with making the interface more extensible by generalizing the representation of node status. I got stuck a bit in design space land with this. Like, should we unify several things like decommission state, etc...? I'm not sure but it seems like something we can clean up at some point.

One annoying thing, that is probably ok for now:

Clients are expected to set maintenance mode via normal interface that gets redirected to controller and made persistent. But clients should follow up after this by querying the target node for when maintenance mode has been fully entered.
This decision made the protocol much simpler by not having to build a state machine between the controller and each node which would track the phases of maintenance mode.
This can probably be deprecated at some point and fixed up transparently in rpk and other tools if we decide to track the life cycle explicitly in the controller.

What should come in follow up PRs

Blocking traffic that is interfering with progress on leadership transfer. This should in principle be straight forward to implement assuming a fairly simple heuristic.

emaxerrno · 2022-03-16T03:13:59Z

maintenance mode is persistent and stored in the controller. currently, only one node at a time may be in maintenance mode.

w/out reading the code, trying to understand the cover letter

what happens if you PUT/DELETE/PUT/DELETE in rapid succession. Does the controller put the node in an irreversible state until it finishes, or can it be "undone" at any time. If so, i assume it is a quorum write to do the PUT/DELETE/PUT/DELETE ops.

(not sure if we have a test for a series of rapid succession, testing the locking mechanism)

dotnwat · 2022-03-16T03:18:24Z

what happens if you PUT/DELETE/PUT/DELETE in rapid succession

last state will win. we will end up deduplicating these states to avoid flapping, but the service will tolerate it well with fast replay.

emaxerrno · 2022-03-16T03:52:56Z

@coral-waters ^ great discussion to add to the docs.

src/v/cluster/members_table.cc

BenPope

Aside from my confusion, this looks good.

I think a second review from somebody who knows the code a bit better would help.

coral-waters · 2022-03-16T14:23:09Z

I created docs issue redpanda-data/docs#266

jcsp

This is very fly, and I can see us using that central maintenance mode status for other things in future.

Minor comments only.

src/v/raft/consensus.h

src/v/cluster/drain_manager.cc

src/v/cluster/members_table.cc

src/v/cluster/types.h

tests/rptest/services/admin.py

src/v/redpanda/admin/api-doc/partition.json

tests/rptest/tests/maintenance_test.py

The node draining manager is responsible for draining (and undraining) leadership from a node. A fiber on each core reacts to these requested states and tracks the status of the process. Draining proceeds by blocking all new leadership, and then proceeds to transfer leadership for all partitions away from the current node. Transfers are done in batches and statistics are maintained which can be retrieves via the status api. Each batch for leadership transfer is randomly selected. This has a natural affect of backing off transfer for partitions that previously encountered errors, as well as making progress across the total set of partitions even if some are stuck. This is useful for policies in which we want to drain within a certain time bound, but will ultimately force a node to shutdown. Signed-off-by: Noah Watkins <noah@redpanda.com>

Signed-off-by: Noah Watkins <noah@redpanda.com>

The redundant use of `_brokers.insert_or_assign` and comment made the method hard to interpret in terms of semantics as well as the affect on iterator stability that the method depended on. Clarified semantics by directly updating index entry value and updated comments. Signed-off-by: Noah Watkins <noah@redpanda.com>

Signed-off-by: Noah Watkins <noah@redpanda.com>

The current values of the two constants are the same so this is safe. Signed-off-by: Noah Watkins <noah@redpanda.com>

Adds the following endpoints: Persistent maintenance mode Start maintenance: PUT /v1/brokers/ID/maintenance Stop maintenance: DELETE /v1/brokers/ID/maintenance Operates on local host: Start maintenance: PUT /v1/maintenance Stop maintenance: DELETE /v1/maintenance maintenance status: GET /v1/maintenance Signed-off-by: Noah Watkins <noah@redpanda.com>

Signed-off-by: Noah Watkins <noah@redpanda.com>

This adds a feature and bumps the cluster version because there is a new command type that is added into the controller log as well as rpc endpoint exposed. Signed-off-by: Noah Watkins <noah@redpanda.com>

Signed-off-by: Noah Watkins <noah@redpanda.com>

dotnwat · 2022-03-17T21:34:17Z

https://github.com/redpanda-data/redpanda/compare/dd6028f14b440db90f53d171cc0ca048ce44153f..87903f8cf0cccc21fa588b3287d7b259746e8728

for the small diff

dotnwat · 2022-03-17T21:55:33Z

Failure was #3924 fixed with integer overflow fix. Restarting that which should pick up the change in the CI rebase.

jcsp

No further questions from me!

BenPope

LGTM

dotnwat requested review from ajfabbri, jcsp and mmaslankaprv March 3, 2022 06:17

dotnwat requested review from NyaliaLui, ztlpn, VadimPlh and rystsov as code owners March 3, 2022 06:17

github-actions bot added the area/redpanda label Mar 3, 2022

dotnwat removed request for rystsov, ztlpn, NyaliaLui and VadimPlh March 3, 2022 06:18

dotnwat force-pushed the drain7 branch from e1f1535 to 863a99d Compare March 3, 2022 06:37

mmaslankaprv reviewed Mar 3, 2022

View reviewed changes

src/v/redpanda/drain_manager.cc Outdated Show resolved Hide resolved

mmaslankaprv reviewed Mar 3, 2022

View reviewed changes

src/v/redpanda/drain_manager.cc Outdated Show resolved Hide resolved

jcsp reviewed Mar 3, 2022

View reviewed changes

src/v/raft/consensus.cc Outdated Show resolved Hide resolved

emaxerrno reviewed Mar 3, 2022

View reviewed changes

BenPope reviewed Mar 9, 2022

View reviewed changes

dotnwat force-pushed the drain7 branch from 863a99d to 0c05512 Compare March 10, 2022 07:03

dotnwat changed the title ~~Support draining all leadership from a node~~ Support node maintenance mode Mar 10, 2022

dotnwat force-pushed the drain7 branch from 0c05512 to 931b6b6 Compare March 10, 2022 07:29

dotnwat requested review from BenPope and jcsp March 10, 2022 07:31

BenPope previously approved these changes Mar 16, 2022

View reviewed changes

src/v/cluster/members_table.cc Show resolved Hide resolved

BenPope reviewed Mar 16, 2022

View reviewed changes

jcsp reviewed Mar 16, 2022

View reviewed changes

dotnwat added 14 commits March 16, 2022 15:03

cluster: wire up the drain manager

23be1b6

Signed-off-by: Noah Watkins <noah@redpanda.com>

broker: add maintenance mode broker state

5ebfde9

Signed-off-by: Noah Watkins <noah@redpanda.com>

cluster: add maintenance mode controller interface

ad73f9f

Signed-off-by: Noah Watkins <noah@redpanda.com>

cluster: fix reference to wrong structure version

2df9e12

The current values of the two constants are the same so this is safe. Signed-off-by: Noah Watkins <noah@redpanda.com>

test: add admin interface for node maintenance mode

8e4c9a4

Signed-off-by: Noah Watkins <noah@redpanda.com>

admin: report leadership for basic partition listing

4d845ea

Signed-off-by: Noah Watkins <noah@redpanda.com>

admin: apply clang-format updates

b422726

Signed-off-by: Noah Watkins <noah@redpanda.com>

test: add test for maintenance mode

1937cfc

Signed-off-by: Noah Watkins <noah@redpanda.com>

feature: add feature flag for maintenance mode

0e20688

This adds a feature and bumps the cluster version because there is a new command type that is added into the controller log as well as rpc endpoint exposed. Signed-off-by: Noah Watkins <noah@redpanda.com>

admin: require maintenance mode feature

2823506

Signed-off-by: Noah Watkins <noah@redpanda.com>

cluster: enforce maintenance feature centrally

87903f8

Signed-off-by: Noah Watkins <noah@redpanda.com>

dotnwat dismissed BenPope’s stale review via 87903f8 March 17, 2022 20:03

dotnwat force-pushed the drain7 branch from dd6028f to 87903f8 Compare March 17, 2022 20:03

dotnwat requested review from BenPope and jcsp March 17, 2022 21:33

jcsp approved these changes Mar 17, 2022

View reviewed changes

BenPope approved these changes Mar 17, 2022

View reviewed changes

dotnwat merged commit 2bf8159 into redpanda-data:dev Mar 18, 2022

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support node maintenance mode #3932

Support node maintenance mode #3932

dotnwat commented Mar 3, 2022 •

edited

Loading

dotnwat commented Mar 3, 2022

mmaslankaprv commented Mar 3, 2022

jcsp commented Mar 3, 2022

BenPope commented Mar 3, 2022 •

edited

Loading

emaxerrno Mar 3, 2022

emaxerrno commented Mar 3, 2022

BenPope left a comment

dotnwat commented Mar 9, 2022

dotnwat commented Mar 10, 2022

emaxerrno commented Mar 16, 2022

dotnwat commented Mar 16, 2022

emaxerrno commented Mar 16, 2022

BenPope left a comment

coral-waters commented Mar 16, 2022

jcsp left a comment •

edited

Loading

dotnwat commented Mar 17, 2022

dotnwat commented Mar 17, 2022

jcsp left a comment

BenPope left a comment

Support node maintenance mode #3932

Support node maintenance mode #3932

Conversation

dotnwat commented Mar 3, 2022 • edited Loading

Cover letter

Release notes

dotnwat commented Mar 3, 2022

mmaslankaprv commented Mar 3, 2022

jcsp commented Mar 3, 2022

BenPope commented Mar 3, 2022 • edited Loading

emaxerrno Mar 3, 2022

Choose a reason for hiding this comment

emaxerrno commented Mar 3, 2022

BenPope left a comment

Choose a reason for hiding this comment

dotnwat commented Mar 9, 2022

dotnwat commented Mar 10, 2022

emaxerrno commented Mar 16, 2022

dotnwat commented Mar 16, 2022

emaxerrno commented Mar 16, 2022

BenPope left a comment

Choose a reason for hiding this comment

coral-waters commented Mar 16, 2022

jcsp left a comment • edited Loading

Choose a reason for hiding this comment

dotnwat commented Mar 17, 2022

dotnwat commented Mar 17, 2022

jcsp left a comment

Choose a reason for hiding this comment

BenPope left a comment

Choose a reason for hiding this comment

dotnwat commented Mar 3, 2022 •

edited

Loading

BenPope commented Mar 3, 2022 •

edited

Loading

jcsp left a comment •

edited

Loading