-
Notifications
You must be signed in to change notification settings - Fork 579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support node maintenance mode #3932
Conversation
@simon0191 up in the cover letter are the administration control URLs we discussed yesterday in the context of rolling upgrades. |
This looks great. only two minor nits |
High level thoughts:
I guess a couple of those points would be covered if we drove the mode via a state in members_table -- that's a bit more invasive of course (needs new structs/encoding & a feature flag) but probably where we want to be longer term? |
The simplest way to integrate this with K8s is to call the drain API as a pre-stop hook, which will always be called on shutdown. How does this interact with decommissioning?
|
* synchronization with the drianing manager. | ||
*/ | ||
if (_block_new_leadership) { | ||
p->block_new_leadership(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this have a timeout to unblock, say 30mins or 1hr or smth in case of a cluster split-brain after the draining started and can't make progress.
@dotnwat should draining have a timeout like logging where |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice.
I'm under the impression that decommissioning affectively drains, but instead of draining leadership it actually moves partitions off the node.
Mm, I don't think anything would happen. If it's decommissioned it can be removed.
Yes! The plan was to do this in a follow up PR after the basic drain mechanism is robust. |
Notable changes in latest iteration (cc @jcsp @mmaslankaprv @BenPope)
Feedback from @jcsp that wasn't done:
One annoying thing, that is probably ok for now:
What should come in follow up PRs
|
w/out reading the code, trying to understand the cover letter what happens if you PUT/DELETE/PUT/DELETE in rapid succession. Does the controller put the node in an irreversible state until it finishes, or can it be "undone" at any time. If so, i assume it is a quorum write to do the PUT/DELETE/PUT/DELETE ops. (not sure if we have a test for a series of rapid succession, testing the locking mechanism) |
last state will win. we will end up deduplicating these states to avoid flapping, but the service will tolerate it well with fast replay. |
@coral-waters ^ great discussion to add to the docs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside from my confusion, this looks good.
I think a second review from somebody who knows the code a bit better would help.
I created docs issue redpanda-data/docs#266 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very fly, and I can see us using that central maintenance mode status for other things in future.
Minor comments only.
The node draining manager is responsible for draining (and undraining) leadership from a node. A fiber on each core reacts to these requested states and tracks the status of the process. Draining proceeds by blocking all new leadership, and then proceeds to transfer leadership for all partitions away from the current node. Transfers are done in batches and statistics are maintained which can be retrieves via the status api. Each batch for leadership transfer is randomly selected. This has a natural affect of backing off transfer for partitions that previously encountered errors, as well as making progress across the total set of partitions even if some are stuck. This is useful for policies in which we want to drain within a certain time bound, but will ultimately force a node to shutdown. Signed-off-by: Noah Watkins <noah@redpanda.com>
Signed-off-by: Noah Watkins <noah@redpanda.com>
Signed-off-by: Noah Watkins <noah@redpanda.com>
The redundant use of `_brokers.insert_or_assign` and comment made the method hard to interpret in terms of semantics as well as the affect on iterator stability that the method depended on. Clarified semantics by directly updating index entry value and updated comments. Signed-off-by: Noah Watkins <noah@redpanda.com>
Signed-off-by: Noah Watkins <noah@redpanda.com>
The current values of the two constants are the same so this is safe. Signed-off-by: Noah Watkins <noah@redpanda.com>
Adds the following endpoints: Persistent maintenance mode Start maintenance: PUT /v1/brokers/ID/maintenance Stop maintenance: DELETE /v1/brokers/ID/maintenance Operates on local host: Start maintenance: PUT /v1/maintenance Stop maintenance: DELETE /v1/maintenance maintenance status: GET /v1/maintenance Signed-off-by: Noah Watkins <noah@redpanda.com>
Signed-off-by: Noah Watkins <noah@redpanda.com>
Signed-off-by: Noah Watkins <noah@redpanda.com>
Signed-off-by: Noah Watkins <noah@redpanda.com>
Signed-off-by: Noah Watkins <noah@redpanda.com>
This adds a feature and bumps the cluster version because there is a new command type that is added into the controller log as well as rpc endpoint exposed. Signed-off-by: Noah Watkins <noah@redpanda.com>
Signed-off-by: Noah Watkins <noah@redpanda.com>
Signed-off-by: Noah Watkins <noah@redpanda.com>
Failure was #3924 fixed with integer overflow fix. Restarting that which should pick up the change in the CI rebase. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No further questions from me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Cover letter
Adds three administrative interfaces:
which allow a node to be placed into a maintenance mode in which the primary action is to relinquished all leadership from a node. if a node is taken out of the maintenance state then leadership is allowed to return. this is a building block for rolling upgrades in which a target node to upgrade is drained to minimize disruption to client traffic.
maintenance mode is persistent and stored in the controller. currently, only one node at a time may be in maintenance mode.
reviewers may notice some obvious failure scenarios. for example, if leadership cannot be transferred off of a node then draining can never complete. this is by design: initially the design assumes that an external process (e.g. k8s operator) will monitor cluster health and draining progress. it may choose to proceed with upgrade despite some leadership failing to transfer, or may take a node out of draining state to deal with the underlying issue.
this feature and the heuristics for dealing with cluster situations is sure to expand. this initial set of functionality should be sufficient to allow k8s operator work to proceed.
Fixes: #3706
Fixes: #3705
Release notes