admin: reject maintenance mode req on 1 node cluster #4921

jcsp · 2022-05-25T09:41:48Z

Cover letter

admin: reject maintenance mode req on 1 node cluster

If a single node cluster puts its only node in maintenance
mode, then there is no node elegible to become controller
leader, and all further progress is stopped.

Fixes #4338

Release notes

Improvements

The Admin API will now refuse to place a node in maintenance mode if it is the only node in the cluster

If a single node cluster puts its only node in maintenance mode, then there is no node elegible to become controller leader, and all further progress is stopped. Fixes redpanda-data#4338

If a system got into the bad state of issue redpanda-data#4338, then their cluster is broken until we replay the controller log _without_ putting the node into maintenance mode. Related redpanda-data#4338

jcsp · 2022-05-25T09:43:28Z

@dotnwat there is potentially also an issue with decoms if someone is shrinking to a <3 size from a >=3 size, although I think that is already blocked by #4462 , so I think we can defer addressing that until we eventually get to the general task of fixing+testing those kinds of shrinks

andrwng · 2022-05-26T18:14:43Z

src/v/cluster/members_table.cc

@@ -159,6 +159,17 @@ members_table::apply(model::offset version, maintenance_mode_cmd cmd) {
 return errc::success;
 }

+ if (_brokers.size() < 2) {


Unrelated to this change per se, but do we also need a similar check when applying a decommission command? Maybe an even heavier-handed one, eg don't decommission to below the default replication factor, or don't decommission if there are single-replica partitions hosted on the affected node?

Decommission will only finish if there are enough nodes in the cluster to keep requested topics replication factor. for single node cluster the decommission will only finish if another node will be added to the cluster.

@andrwng yes, see the comment above "there is potentially an issue with decoms...". Basically shrinking below 3 is already broken, and needs handling as a separate job.

dotnwat

Nice catch.

For more context: in a one node cluster if the node is put into maintenance mode then nothing bad should happen. leadership transfers won't occur, because they can't (there is no where to transfer the leadership). However if you are trying to reproduce this issue the problem will occur after restart. Maintenance mode prevents becoming leader after starting up too, so things like the controller or other raft groups won't elect themselves.

jcsp · 2022-06-01T12:25:59Z

/backport v22.1.x

vbotbuildovich · 2022-06-01T12:27:03Z

Failed to run cherry-pick command. I executed the below command:

git cherry-pick -x b970b2d61037ea34aed3607cbf48cbc071254262 20f3597b4fbeb95b0a958f74940e51fe9247bb99

Workflow run logs.

jcsp added 2 commits May 25, 2022 10:13

admin: reject maintenance mode req on 1 node cluster

b970b2d

If a single node cluster puts its only node in maintenance mode, then there is no node elegible to become controller leader, and all further progress is stopped. Fixes redpanda-data#4338

cluster: drop maintenance mode messages on n=1 cluster

20f3597

If a system got into the bad state of issue redpanda-data#4338, then their cluster is broken until we replay the controller log _without_ putting the node into maintenance mode. Related redpanda-data#4338

github-actions bot added the area/redpanda label May 25, 2022

jcsp mentioned this pull request May 25, 2022

Shrinking a cluster from 3 nodes to <3 nodes causes brokers to stick in 'draining' state #4462

Open

jcsp marked this pull request as ready for review May 25, 2022 21:02

jcsp requested review from dotnwat, mmaslankaprv, ztlpn and VadimPlh as code owners May 25, 2022 21:02

andrwng reviewed May 26, 2022

View reviewed changes

dotnwat approved these changes May 31, 2022

View reviewed changes

dotnwat merged commit 0a7cbd0 into redpanda-data:dev May 31, 2022

jcsp deleted the issue-4388-1node-maintenance branch June 1, 2022 12:25

jcsp mentioned this pull request Jun 1, 2022

[v22.1.x] admin: reject maintenance mode req on 1 node cluster #4986

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

admin: reject maintenance mode req on 1 node cluster #4921

admin: reject maintenance mode req on 1 node cluster #4921

jcsp commented May 25, 2022 •

edited

Loading

jcsp commented May 25, 2022 •

edited

Loading

andrwng May 26, 2022

mmaslankaprv May 27, 2022

jcsp May 27, 2022

dotnwat left a comment

jcsp commented Jun 1, 2022

vbotbuildovich commented Jun 1, 2022

admin: reject maintenance mode req on 1 node cluster #4921

admin: reject maintenance mode req on 1 node cluster #4921

Conversation

jcsp commented May 25, 2022 • edited Loading

Cover letter

Release notes

Improvements

jcsp commented May 25, 2022 • edited Loading

andrwng May 26, 2022

Choose a reason for hiding this comment

mmaslankaprv May 27, 2022

Choose a reason for hiding this comment

jcsp May 27, 2022

Choose a reason for hiding this comment

dotnwat left a comment

Choose a reason for hiding this comment

jcsp commented Jun 1, 2022

vbotbuildovich commented Jun 1, 2022

jcsp commented May 25, 2022 •

edited

Loading

jcsp commented May 25, 2022 •

edited

Loading