Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

admin: reject maintenance mode req on 1 node cluster #4921

Merged
merged 2 commits into from
May 31, 2022

Conversation

jcsp
Copy link
Contributor

@jcsp jcsp commented May 25, 2022

Cover letter

admin: reject maintenance mode req on 1 node cluster

If a single node cluster puts its only node in maintenance
mode, then there is no node elegible to become controller
leader, and all further progress is stopped.

Fixes #4338

Release notes

Improvements

  • The Admin API will now refuse to place a node in maintenance mode if it is the only node in the cluster

jcsp added 2 commits May 25, 2022 10:13
If a single node cluster puts its only node in maintenance
mode, then there is no node elegible to become controller
leader, and all further progress is stopped.

Fixes redpanda-data#4338
If a system got into the bad state of issue redpanda-data#4338, then
their cluster is broken until we replay the controller
log _without_ putting the node into maintenance mode.

Related redpanda-data#4338
@jcsp
Copy link
Contributor Author

jcsp commented May 25, 2022

@dotnwat there is potentially also an issue with decoms if someone is shrinking to a <3 size from a >=3 size, although I think that is already blocked by #4462 , so I think we can defer addressing that until we eventually get to the general task of fixing+testing those kinds of shrinks

@@ -159,6 +159,17 @@ members_table::apply(model::offset version, maintenance_mode_cmd cmd) {
return errc::success;
}

if (_brokers.size() < 2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this change per se, but do we also need a similar check when applying a decommission command? Maybe an even heavier-handed one, eg don't decommission to below the default replication factor, or don't decommission if there are single-replica partitions hosted on the affected node?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decommission will only finish if there are enough nodes in the cluster to keep requested topics replication factor. for single node cluster the decommission will only finish if another node will be added to the cluster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrwng yes, see the comment above "there is potentially an issue with decoms...". Basically shrinking below 3 is already broken, and needs handling as a separate job.

Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch.

For more context: in a one node cluster if the node is put into maintenance mode then nothing bad should happen. leadership transfers won't occur, because they can't (there is no where to transfer the leadership). However if you are trying to reproduce this issue the problem will occur after restart. Maintenance mode prevents becoming leader after starting up too, so things like the controller or other raft groups won't elect themselves.

@dotnwat dotnwat merged commit 0a7cbd0 into redpanda-data:dev May 31, 2022
@jcsp jcsp deleted the issue-4388-1node-maintenance branch June 1, 2022 12:25
@jcsp
Copy link
Contributor Author

jcsp commented Jun 1, 2022

/backport v22.1.x

@vbotbuildovich
Copy link
Collaborator

Failed to run cherry-pick command. I executed the below command:

git cherry-pick -x b970b2d61037ea34aed3607cbf48cbc071254262 20f3597b4fbeb95b0a958f74940e51fe9247bb99

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cannot disable maintenance mode after restart
5 participants