Shrinking a cluster from 3 nodes to <3 nodes causes brokers to stick in 'draining' state #4462

jcsp · 2022-04-27T18:30:41Z

Create 3 brokers 0 1 2
Create a topic
Consume from the topic using a consumer group (prompts creation of consumer offsets topic)
Decommission broker 2 via "rpk brokers decommission"
Observe /v1/brokers -- broker 2 remains in state 'draining' indefinitely.

This scenario does work if you have never done anything that prompted an internal topic creation, e.g. if you never did that consumer group operation.

Suggested fix:

At the admin API level: if node count is 3, and any partitions exist with replicas=3 (other than controller topic), and internal_topic_replication_factor >= 3, then refuse any decom request.
If internal_topic_replication_factor == 1 and any internal topics are at replicas=3, then when we receive such a decom request, decrease the replication factor of those topics before trying to do the decom. This provides a path for users who really genuinely want to shrink their cluster to a point where they put data at risk. The default for internal_topic_replciation_factor is 3, so only users who manually adjust it will be able to do this -- default to data safety.

JIRA Link: CORE-896

dotnwat · 2022-04-27T20:06:10Z

@jcsp was broker 2 ever put into maintenance mode? also, maybe we are talking about 'draining' in a different context like draining the partitions as part of the decommission process?

jcsp · 2022-04-28T09:32:48Z

@dotnwat this is the membership_state::draining, rather than the maintenance mode draining. In retrospect we probably should have picked a different noun for one of these

jcsp · 2022-05-25T09:44:47Z

When we address this + test shrinks from >=3 to <3 nodes, should also check for maintenance mode and refuse to shrink to <3 nodes if any of the nodes are currently in maintenance mode (see #4921)

dotnwat · 2022-05-25T23:36:35Z

@dotnwat this is the membership_state::draining, rather than the maintenance mode draining. In retrospect we probably should have picked a different noun for one of these

i'll see what i can do to rename the maintenance mode draining state.

jcsp added kind/bug Something isn't working area/controller labels Apr 27, 2022

jcsp mentioned this issue May 25, 2022

admin: reject maintenance mode req on 1 node cluster #4921

Merged

dotnwat mentioned this issue May 25, 2022

core: rename maintenance mode draining internally #4937

Open

mmedenjak added the known-issue label Jul 29, 2022

andijcr self-assigned this Sep 1, 2022

andijcr removed their assignment Oct 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shrinking a cluster from 3 nodes to <3 nodes causes brokers to stick in 'draining' state #4462

Shrinking a cluster from 3 nodes to <3 nodes causes brokers to stick in 'draining' state #4462

jcsp commented Apr 27, 2022 •

edited by jira bot

Loading

dotnwat commented Apr 27, 2022 •

edited

Loading

jcsp commented Apr 28, 2022

jcsp commented May 25, 2022

dotnwat commented May 25, 2022

Shrinking a cluster from 3 nodes to <3 nodes causes brokers to stick in 'draining' state #4462

Shrinking a cluster from 3 nodes to <3 nodes causes brokers to stick in 'draining' state #4462

Comments

jcsp commented Apr 27, 2022 • edited by jira bot Loading

dotnwat commented Apr 27, 2022 • edited Loading

jcsp commented Apr 28, 2022

jcsp commented May 25, 2022

dotnwat commented May 25, 2022

jcsp commented Apr 27, 2022 •

edited by jira bot

Loading

dotnwat commented Apr 27, 2022 •

edited

Loading