Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support node maintenance mode #3932

Merged
merged 18 commits into from
Mar 18, 2022
Merged

Support node maintenance mode #3932

merged 18 commits into from
Mar 18, 2022

Conversation

dotnwat
Copy link
Member

@dotnwat dotnwat commented Mar 3, 2022

Cover letter

Adds three administrative interfaces:

   Start maintenance:  PUT    /v1/brokers/ID/maintenance
   Stop maintenance:   DELETE /v1/brokers/ID/maintenance
   Drain status: GET    /v1/brokers/ID/maintenance

which allow a node to be placed into a maintenance mode in which the primary action is to relinquished all leadership from a node. if a node is taken out of the maintenance state then leadership is allowed to return. this is a building block for rolling upgrades in which a target node to upgrade is drained to minimize disruption to client traffic.

maintenance mode is persistent and stored in the controller. currently, only one node at a time may be in maintenance mode.

reviewers may notice some obvious failure scenarios. for example, if leadership cannot be transferred off of a node then draining can never complete. this is by design: initially the design assumes that an external process (e.g. k8s operator) will monitor cluster health and draining progress. it may choose to proceed with upgrade despite some leadership failing to transfer, or may take a node out of draining state to deal with the underlying issue.

this feature and the heuristics for dealing with cluster situations is sure to expand. this initial set of functionality should be sufficient to allow k8s operator work to proceed.

Fixes: #3706
Fixes: #3705

Release notes

  • Support for placing node into a draining state in which all leadership is relinquished.

@dotnwat
Copy link
Member Author

dotnwat commented Mar 3, 2022

@simon0191 up in the cover letter are the administration control URLs we discussed yesterday in the context of rolling upgrades.

@mmaslankaprv
Copy link
Member

This looks great. only two minor nits

@jcsp
Copy link
Contributor

jcsp commented Mar 3, 2022

High level thoughts:

  • Is draining==maintenance mode? The latter is probably a more idiomatic name for the state. Draining is also prone to misunderstanding because it sounds like we might be draining the data from the node, as opposed to just leadership.
  • We might have more nodes in future (maybe to distinguish a node which is draining for restart vs. a node draining for decom?) maybe the API should be a bit more extensible, for example rather than cancelling drain mode with a DELETE to /v1/drain, it could be a PUT {"status": "active"} to /v1/node/.
  • To be really safe, the state should be persistent -- if it's ephemeral, then the operator has no guarantee that the node won't restart and take up leaderships while it is supposed to be in the draining state.
  • In future, it would be ideal for some central point to know which nodes are in the draining state. That could be useful for situational awareness but also for restricting when a node can be put into the mode -- a central point of control could enforce rules preventing too many nodes being in maintenance at once.

I guess a couple of those points would be covered if we drove the mode via a state in members_table -- that's a bit more invasive of course (needs new structs/encoding & a feature flag) but probably where we want to be longer term?

src/v/raft/consensus.cc Outdated Show resolved Hide resolved
@BenPope
Copy link
Member

BenPope commented Mar 3, 2022

The simplest way to integrate this with K8s is to call the drain API as a pre-stop hook, which will always be called on shutdown.

How does this interact with decommissioning?

  • Is draining a part of decommission? Should it be?
  • What happens if the drain API is called on a decommissioned node?
  • Would it make sense to wire this into application::shutdown?

* synchronization with the drianing manager.
*/
if (_block_new_leadership) {
p->block_new_leadership();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this have a timeout to unblock, say 30mins or 1hr or smth in case of a cluster split-brain after the draining started and can't make progress.

@emaxerrno
Copy link
Contributor

@dotnwat should draining have a timeout like logging where -1 is infinite as part of the API - i.e.: the node should drain in 3 hrs or something terrible has happened kinda thing? just a thought.

Copy link
Member

@BenPope BenPope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice.

src/v/raft/consensus.cc Outdated Show resolved Hide resolved
src/v/cluster/partition_manager.cc Show resolved Hide resolved
src/v/redpanda/application.cc Outdated Show resolved Hide resolved
src/v/redpanda/drain_manager.cc Outdated Show resolved Hide resolved
src/v/redpanda/drain_manager.cc Outdated Show resolved Hide resolved
src/v/redpanda/drain_manager.cc Outdated Show resolved Hide resolved
src/v/redpanda/drain_manager.cc Outdated Show resolved Hide resolved
src/v/redpanda/admin_server.h Outdated Show resolved Hide resolved
@dotnwat
Copy link
Member Author

dotnwat commented Mar 9, 2022

How does this interact with decommissioning?
Is draining a part of decommission? Should it be?

I'm under the impression that decommissioning affectively drains, but instead of draining leadership it actually moves partitions off the node.

What happens if the drain API is called on a decommissioned node?

Mm, I don't think anything would happen. If it's decommissioned it can be removed.

Would it make sense to wire this into application::shutdown?

Yes! The plan was to do this in a follow up PR after the basic drain mechanism is robust.

@dotnwat dotnwat changed the title Support draining all leadership from a node Support node maintenance mode Mar 10, 2022
@dotnwat
Copy link
Member Author

dotnwat commented Mar 10, 2022

Notable changes in latest iteration (cc @jcsp @mmaslankaprv @BenPope)

  1. Rename from drain -> maintenance mode. Internally we still use a drain manager for detailing with draining leadership, but other mode and external interfaces are expressed in terms of a maintenance mode.

  2. Maintenance mode is now persistent and stored in the members_table / raft0. The current policy is to only allow one node at a time to be in maintenance mode. This should allow us to build some tooling for bare-metal, too, via rpk.

Feedback from @jcsp that wasn't done:

  1. I agree with making the interface more extensible by generalizing the representation of node status. I got stuck a bit in design space land with this. Like, should we unify several things like decommission state, etc...? I'm not sure but it seems like something we can clean up at some point.

One annoying thing, that is probably ok for now:

  1. Clients are expected to set maintenance mode via normal interface that gets redirected to controller and made persistent. But clients should follow up after this by querying the target node for when maintenance mode has been fully entered.
  2. This decision made the protocol much simpler by not having to build a state machine between the controller and each node which would track the phases of maintenance mode.
  3. This can probably be deprecated at some point and fixed up transparently in rpk and other tools if we decide to track the life cycle explicitly in the controller.

What should come in follow up PRs

  1. Blocking traffic that is interfering with progress on leadership transfer. This should in principle be straight forward to implement assuming a fairly simple heuristic.

@emaxerrno
Copy link
Contributor

maintenance mode is persistent and stored in the controller. currently, only one node at a time may be in maintenance mode.

w/out reading the code, trying to understand the cover letter

what happens if you PUT/DELETE/PUT/DELETE in rapid succession. Does the controller put the node in an irreversible state until it finishes, or can it be "undone" at any time. If so, i assume it is a quorum write to do the PUT/DELETE/PUT/DELETE ops.

(not sure if we have a test for a series of rapid succession, testing the locking mechanism)

@dotnwat
Copy link
Member Author

dotnwat commented Mar 16, 2022

what happens if you PUT/DELETE/PUT/DELETE in rapid succession

last state will win. we will end up deduplicating these states to avoid flapping, but the service will tolerate it well with fast replay.

@emaxerrno
Copy link
Contributor

@coral-waters ^ great discussion to add to the docs.

BenPope
BenPope previously approved these changes Mar 16, 2022
src/v/cluster/members_table.cc Show resolved Hide resolved
Copy link
Member

@BenPope BenPope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from my confusion, this looks good.

I think a second review from somebody who knows the code a bit better would help.

@coral-waters
Copy link
Contributor

I created docs issue redpanda-data/docs#266

Copy link
Contributor

@jcsp jcsp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very fly, and I can see us using that central maintenance mode status for other things in future.

Minor comments only.

src/v/raft/consensus.h Show resolved Hide resolved
src/v/cluster/drain_manager.cc Outdated Show resolved Hide resolved
src/v/cluster/members_table.cc Show resolved Hide resolved
src/v/cluster/members_table.cc Show resolved Hide resolved
src/v/cluster/types.h Show resolved Hide resolved
tests/rptest/services/admin.py Outdated Show resolved Hide resolved
src/v/redpanda/admin/api-doc/partition.json Outdated Show resolved Hide resolved
tests/rptest/tests/maintenance_test.py Outdated Show resolved Hide resolved
The node draining manager is responsible for draining (and undraining)
leadership from a node. A fiber on each core reacts to these requested
states and tracks the status of the process.

Draining proceeds by blocking all new leadership, and then proceeds to
transfer leadership for all partitions away from the current node.
Transfers are done in batches and statistics are maintained which
can be retrieves via the status api.

Each batch for leadership transfer is randomly selected. This has a
natural affect of backing off transfer for partitions that previously
encountered errors, as well as making progress across the total set of
partitions even if some are stuck. This is useful for policies in which
we want to drain within a certain time bound, but will ultimately force
a node to shutdown.

Signed-off-by: Noah Watkins <noah@redpanda.com>
Signed-off-by: Noah Watkins <noah@redpanda.com>
Signed-off-by: Noah Watkins <noah@redpanda.com>
The redundant use of `_brokers.insert_or_assign` and comment made the
method hard to interpret in terms of semantics as well as the affect
on iterator stability that the method depended on.

Clarified semantics by directly updating index entry value and updated
comments.

Signed-off-by: Noah Watkins <noah@redpanda.com>
Signed-off-by: Noah Watkins <noah@redpanda.com>
The current values of the two constants are the same so this is safe.

Signed-off-by: Noah Watkins <noah@redpanda.com>
Adds the following endpoints:

 Persistent maintenance mode

   Start maintenance:  PUT    /v1/brokers/ID/maintenance
   Stop maintenance:   DELETE /v1/brokers/ID/maintenance

 Operates on local host:

   Start maintenance:  PUT    /v1/maintenance
   Stop maintenance:   DELETE /v1/maintenance
   maintenance status: GET    /v1/maintenance

Signed-off-by: Noah Watkins <noah@redpanda.com>
Signed-off-by: Noah Watkins <noah@redpanda.com>
Signed-off-by: Noah Watkins <noah@redpanda.com>
Signed-off-by: Noah Watkins <noah@redpanda.com>
Signed-off-by: Noah Watkins <noah@redpanda.com>
This adds a feature and bumps the cluster version because there is a new
command type that is added into the controller log as well as rpc
endpoint exposed.

Signed-off-by: Noah Watkins <noah@redpanda.com>
Signed-off-by: Noah Watkins <noah@redpanda.com>
Signed-off-by: Noah Watkins <noah@redpanda.com>
@dotnwat
Copy link
Member Author

dotnwat commented Mar 17, 2022

Failure was #3924 fixed with integer overflow fix. Restarting that which should pick up the change in the CI rebase.

Copy link
Contributor

@jcsp jcsp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No further questions from me!

Copy link
Member

@BenPope BenPope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dotnwat dotnwat merged commit 2bf8159 into redpanda-data:dev Mar 18, 2022
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants