Enabling maintenance mode on a decommissioned node leaves the cluster in an inconsistent state #4999

nicolaferraro · 2022-06-01T17:34:56Z

What went wrong?

I'm trying maintenance mode hooks in combination with decommission and finding a strange behavior.

The problem is that maintenance mode can be enabled also for (fully) decommissioned nodes (static hooks do that automatically in k8s).
This causes two issues:

The wait for maintenance mode to finish does not terminate as the finished=true state is never reached, as draining is not actually started (draining=false)
Other nodes are unable to enter maintenance mode since the cluster believes that there's another node in maintenance mode, but it's decommissioned

I think it's ok if the cluster replies (200) ok to a request to enter maintenance mode for a decommissioned node, but probably the internal state remains inconsistent.

cc @dotnwat

Version & Environment

Redpanda v22.1.3, latest operator

How to reproduce the issue?

This process has been executed on a 3 node example cluster:

$ # Starting with a 3 nodes cluster (0, 1, 2), let's decommission broker 2
$ kubectl exec -c redpanda example-2 -- curl --silent -i -X PUT http://localhost:9644/v1/brokers/2/decommission
HTTP/1.1 200 OK
Content-Length: 0
Content-Type: application/json
Date: Wed, 01 Jun 2022 17:18:12 GMT
Server: Seastar httpd

$ # Wait for broker 2 to disappear
$ kubectl exec -c redpanda example-0 -- curl --silent http://localhost:9644/v1/brokers/ | jq .[].node_id | sort
0
1

$ # Node 2 enters maintenance mode (this happens in practice because of static hooks currently set on pods)
$ kubectl exec -c redpanda example-2 -- curl --silent -i -X PUT http://localhost:9644/v1/brokers/2/maintenance
HTTP/1.1 200 OK
Content-Length: 0
Content-Type: application/json
Date: Wed, 01 Jun 2022 17:21:22 GMT
Server: Seastar httpd

$ # Checking maintenance mode says it's not draining, while usually it says draining=true and finished=true
$ kubectl exec -c redpanda example-2 -- curl --silent http://localhost:9644/v1/maintenance
{"draining": false}

$ # Entering maintenance mode on another node fails
$ kubectl exec -c redpanda example-1 -- curl --silent -i -X PUT http://localhost:9644/v1/brokers/1/maintenance
HTTP/1.1 400 Bad Request
Content-Length: 93
Server: Seastar httpd
Date: Wed, 01 Jun 2022 17:23:51 GMT
Content-Type: application/json

{"message": "can not update broker 1 state, invalid state transition requested", "code": 400}

JIRA Link: CORE-933

The text was updated successfully, but these errors were encountered:

…rkaround for redpanda-data#4999)

…rkaround for redpanda-data#4999) When a node is shutdown after decommission, the maintenance mode hooks will trigger. While the process has no visible effect on partitions, it leaves the cluster in an inconsistent state, so that other nodes cannot enter maintenance mode. We force reset the flag with this change.

…node (workaround for redpanda-data#4999) When a node is shutdown after decommission, the maintenance mode hooks will trigger. While the process has no visible effect on partitions, it leaves the cluster in an inconsistent state, so that other nodes cannot enter maintenance mode. We force reset the flag with this change.

…g node (workaround for redpanda-data#4999) When a node is shutdown after decommission, the maintenance mode hooks will trigger. While the process has no visible effect on partitions, it leaves the cluster in an inconsistent state, so that other nodes cannot enter maintenance mode. We force reset the flag with this change.

nicolaferraro added the kind/bug Something isn't working label Jun 1, 2022

piyushredpanda assigned dotnwat Jun 2, 2022

nicolaferraro added a commit to nicolaferraro/redpanda that referenced this issue Jun 10, 2022

operator: fix maintenance mode activation on decommissioning node (wo…

7076ed6

…rkaround for redpanda-data#4999)

nicolaferraro added a commit to nicolaferraro/redpanda that referenced this issue Jun 10, 2022

operator: fix maintenance mode activation on decommissioning node (wo…

ad66435

…rkaround for redpanda-data#4999)

nicolaferraro added a commit to nicolaferraro/redpanda that referenced this issue Jun 10, 2022

operator: fix maintenance mode activation on decommissioning node (wo…

8f3b0bd

…rkaround for redpanda-data#4999)

nicolaferraro mentioned this issue Jun 10, 2022

Operator: add support for downscaling #5019

Merged

nicolaferraro added a commit to nicolaferraro/redpanda that referenced this issue Jun 13, 2022

operator: fix maintenance mode activation on decommissioning node (wo…

15adb01

…rkaround for redpanda-data#4999)

mmedenjak added the area/redpanda label Jul 15, 2022

joejulian mentioned this issue Dec 21, 2022

operator: use pod annotation for node_id and avoid mixing maintenance mode with decommissioning #7581

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enabling maintenance mode on a decommissioned node leaves the cluster in an inconsistent state #4999

Enabling maintenance mode on a decommissioned node leaves the cluster in an inconsistent state #4999

nicolaferraro commented Jun 1, 2022 •

edited by jira bot

Loading

Enabling maintenance mode on a decommissioned node leaves the cluster in an inconsistent state #4999

Enabling maintenance mode on a decommissioned node leaves the cluster in an inconsistent state #4999

Comments

nicolaferraro commented Jun 1, 2022 • edited by jira bot Loading

What went wrong?

Version & Environment

How to reproduce the issue?

nicolaferraro commented Jun 1, 2022 •

edited by jira bot

Loading