Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabling maintenance mode on a decommissioned node leaves the cluster in an inconsistent state #4999

Open
nicolaferraro opened this issue Jun 1, 2022 · 0 comments
Assignees
Labels
area/redpanda kind/bug Something isn't working

Comments

@nicolaferraro
Copy link
Member

nicolaferraro commented Jun 1, 2022

What went wrong?

I'm trying maintenance mode hooks in combination with decommission and finding a strange behavior.

The problem is that maintenance mode can be enabled also for (fully) decommissioned nodes (static hooks do that automatically in k8s).
This causes two issues:

  • The wait for maintenance mode to finish does not terminate as the finished=true state is never reached, as draining is not actually started (draining=false)
  • Other nodes are unable to enter maintenance mode since the cluster believes that there's another node in maintenance mode, but it's decommissioned

I think it's ok if the cluster replies (200) ok to a request to enter maintenance mode for a decommissioned node, but probably the internal state remains inconsistent.

cc @dotnwat

Version & Environment

Redpanda v22.1.3, latest operator

How to reproduce the issue?

This process has been executed on a 3 node example cluster:

$ # Starting with a 3 nodes cluster (0, 1, 2), let's decommission broker 2
$ kubectl exec -c redpanda example-2 -- curl --silent -i -X PUT http://localhost:9644/v1/brokers/2/decommission
HTTP/1.1 200 OK
Content-Length: 0
Content-Type: application/json
Date: Wed, 01 Jun 2022 17:18:12 GMT
Server: Seastar httpd

$ # Wait for broker 2 to disappear
$ kubectl exec -c redpanda example-0 -- curl --silent http://localhost:9644/v1/brokers/ | jq .[].node_id | sort
0
1

$ # Node 2 enters maintenance mode (this happens in practice because of static hooks currently set on pods)
$ kubectl exec -c redpanda example-2 -- curl --silent -i -X PUT http://localhost:9644/v1/brokers/2/maintenance
HTTP/1.1 200 OK
Content-Length: 0
Content-Type: application/json
Date: Wed, 01 Jun 2022 17:21:22 GMT
Server: Seastar httpd

$ # Checking maintenance mode says it's not draining, while usually it says draining=true and finished=true
$ kubectl exec -c redpanda example-2 -- curl --silent http://localhost:9644/v1/maintenance
{"draining": false}

$ # Entering maintenance mode on another node fails
$ kubectl exec -c redpanda example-1 -- curl --silent -i -X PUT http://localhost:9644/v1/brokers/1/maintenance
HTTP/1.1 400 Bad Request
Content-Length: 93
Server: Seastar httpd
Date: Wed, 01 Jun 2022 17:23:51 GMT
Content-Type: application/json

{"message": "can not update broker 1 state, invalid state transition requested", "code": 400}

JIRA Link: CORE-933

@nicolaferraro nicolaferraro added the kind/bug Something isn't working label Jun 1, 2022
nicolaferraro added a commit to nicolaferraro/redpanda that referenced this issue Jun 10, 2022
nicolaferraro added a commit to nicolaferraro/redpanda that referenced this issue Jun 10, 2022
nicolaferraro added a commit to nicolaferraro/redpanda that referenced this issue Jun 10, 2022
nicolaferraro added a commit to nicolaferraro/redpanda that referenced this issue Jun 13, 2022
nicolaferraro added a commit to nicolaferraro/redpanda that referenced this issue Jun 14, 2022
…rkaround for redpanda-data#4999)

When a node is shutdown after decommission, the maintenance mode hooks will trigger. While the process has no visible effect on partitions, it leaves the cluster in an inconsistent state, so that other nodes cannot enter maintenance mode. We force reset the flag with this change.
nicolaferraro added a commit to nicolaferraro/redpanda that referenced this issue Jun 15, 2022
…rkaround for redpanda-data#4999)

When a node is shutdown after decommission, the maintenance mode hooks will trigger. While the process has no visible effect on partitions, it leaves the cluster in an inconsistent state, so that other nodes cannot enter maintenance mode. We force reset the flag with this change.
nicolaferraro added a commit to nicolaferraro/redpanda that referenced this issue Jun 16, 2022
…rkaround for redpanda-data#4999)

When a node is shutdown after decommission, the maintenance mode hooks will trigger. While the process has no visible effect on partitions, it leaves the cluster in an inconsistent state, so that other nodes cannot enter maintenance mode. We force reset the flag with this change.
joejulian pushed a commit to joejulian/redpanda that referenced this issue Mar 10, 2023
…node (workaround for redpanda-data#4999)

When a node is shutdown after decommission, the maintenance mode hooks will trigger. While the process has no visible effect on partitions, it leaves the cluster in an inconsistent state, so that other nodes cannot enter maintenance mode. We force reset the flag with this change.
joejulian pushed a commit to joejulian/redpanda that referenced this issue Mar 24, 2023
…g node (workaround for redpanda-data#4999)

When a node is shutdown after decommission, the maintenance mode hooks will trigger. While the process has no visible effect on partitions, it leaves the cluster in an inconsistent state, so that other nodes cannot enter maintenance mode. We force reset the flag with this change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/redpanda kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants