Cannot disable maintenance mode after restart #4338

nicolaferraro · 2022-04-20T14:01:20Z

Issue description

There's an issue with disabling maintenance mode after a pod is restarted with maintenance mode on. I think this has been introduced recently as e2e tests were working in #4125.

This is the error I get from rpk:

error disabling maintenance mode: request   failed: Service Unavailable, body: "{\"message\": \"Not ready (Currently there is no leader controller elected in the cluster)\", \"code\": 503}"

I extracted a reproducer that can be run on Kubernetes. It's happening on a single node cluster, but may happen also with multiple instances.

Base resource:

apiVersion: redpanda.vectorized.io/v1alpha1
kind: Cluster
metadata:
  name: example
spec:
  image: "localhost/redpanda"
  version: "dev"
  replicas: 1
  resources:
    requests:
      cpu: 100m
      memory: 100Mi
    limits:
      cpu: 1
      memory: 500Mi
  configuration:
    rpcServer:
      port: 33145
    kafkaApi:
    - port: 9092
    adminApi:
    - port: 9644
    pandaproxyApi:
     - port: 8082
    developerMode: true

How to reproduce the issue?

kubectl apply -f example.yaml
Wait for pod to be running
kubectl exec example-0 -- rpk cluster maintenance enable 0
kubectl delete pod example-0
Wait for pod to be running again or even more
kubectl exec example-0 -- rpk cluster maintenance disable 0

Then the error is returned.

I believe it`s due to some recent changes on management of leadership groups, but I'm not sure. cc: @dotnwat .

The cluster node is reporting errors as well:

example-0 redpanda DEBUG 2022-04-20 14:00:03,209 [shard 0] cluster - health_monitor_backend.cc:272 - unable to refresh health metadata, no leader controller
example-0 redpanda INFO  2022-04-20 14:00:03,209 [shard 0] cluster - health_monitor_backend.cc:403 - error refreshing cluster health state - Currently there is no leader controller elected in the cluster
example-0 redpanda INFO  2022-04-20 14:00:03,209 [shard 0] cluster - metadata_dissemination_service.cc:357 - unable to retrieve cluster health report - Currently there is no leader controller elected in the cluster

The text was updated successfully, but these errors were encountered:

jcsp · 2022-04-20T20:54:20Z

@dotnwat is it possible that maintenance mode is causing the node to refuse all leaderships including controller leadership? We probably need a special case for single node clusters if that's the case

dotnwat · 2022-05-20T05:48:52Z

@jcsp yes, maintenance mode tries to prevent all leadership, including the controller.

we can certainly special case the controller, but i didn't do this up front because I wasn't imagining any scenario in which it would be problematic, but moving it would be beneficial for any clients in the midst of doing administrative stuff (or internal communication with the controller leader).

kubectl exec example-0 -- rpk cluster maintenance disable 0

hmm, ill take a look. we may need some retries, unless i'm missing something else here in the ticket.

If a single node cluster puts its only node in maintenance mode, then there is no node elegible to become controller leader, and all further progress is stopped. Fixes redpanda-data#4338

If a system got into the bad state of issue redpanda-data#4338, then their cluster is broken until we replay the controller log _without_ putting the node into maintenance mode. Related redpanda-data#4338

If a single node cluster puts its only node in maintenance mode, then there is no node elegible to become controller leader, and all further progress is stopped. Fixes redpanda-data#4338 (cherry picked from commit b970b2d)

If a system got into the bad state of issue redpanda-data#4338, then their cluster is broken until we replay the controller log _without_ putting the node into maintenance mode. Related redpanda-data#4338 (cherry picked from commit 20f3597)

nicolaferraro added the kind/bug Something isn't working label Apr 20, 2022

nicolaferraro mentioned this issue Apr 20, 2022

Add hooks for rolling restarts and upgrades #4125

Merged

dotnwat self-assigned this May 20, 2022

jcsp mentioned this issue May 25, 2022

admin: reject maintenance mode req on 1 node cluster #4921

Merged

jcsp assigned jcsp and unassigned dotnwat May 25, 2022

dotnwat closed this as completed in #4921 May 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot disable maintenance mode after restart #4338

Cannot disable maintenance mode after restart #4338

nicolaferraro commented Apr 20, 2022

jcsp commented Apr 20, 2022

dotnwat commented May 20, 2022

Cannot disable maintenance mode after restart #4338

Cannot disable maintenance mode after restart #4338

Comments

nicolaferraro commented Apr 20, 2022

Issue description

How to reproduce the issue?

jcsp commented Apr 20, 2022

dotnwat commented May 20, 2022