Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot disable maintenance mode after restart #4338

Closed
nicolaferraro opened this issue Apr 20, 2022 · 2 comments · Fixed by #4921
Closed

Cannot disable maintenance mode after restart #4338

nicolaferraro opened this issue Apr 20, 2022 · 2 comments · Fixed by #4921
Assignees
Labels
kind/bug Something isn't working

Comments

@nicolaferraro
Copy link
Member

Issue description

There's an issue with disabling maintenance mode after a pod is restarted with maintenance mode on. I think this has been introduced recently as e2e tests were working in #4125.

This is the error I get from rpk:

error disabling maintenance mode: request   failed: Service Unavailable, body: "{\"message\": \"Not ready (Currently there is no leader controller elected in the cluster)\", \"code\": 503}"

I extracted a reproducer that can be run on Kubernetes. It's happening on a single node cluster, but may happen also with multiple instances.

Base resource:

apiVersion: redpanda.vectorized.io/v1alpha1
kind: Cluster
metadata:
  name: example
spec:
  image: "localhost/redpanda"
  version: "dev"
  replicas: 1
  resources:
    requests:
      cpu: 100m
      memory: 100Mi
    limits:
      cpu: 1
      memory: 500Mi
  configuration:
    rpcServer:
      port: 33145
    kafkaApi:
    - port: 9092
    adminApi:
    - port: 9644
    pandaproxyApi:
     - port: 8082
    developerMode: true

How to reproduce the issue?

  1. kubectl apply -f example.yaml
  2. Wait for pod to be running
  3. kubectl exec example-0 -- rpk cluster maintenance enable 0
  4. kubectl delete pod example-0
  5. Wait for pod to be running again or even more
  6. kubectl exec example-0 -- rpk cluster maintenance disable 0

Then the error is returned.

I believe it`s due to some recent changes on management of leadership groups, but I'm not sure. cc: @dotnwat .

The cluster node is reporting errors as well:

example-0 redpanda DEBUG 2022-04-20 14:00:03,209 [shard 0] cluster - health_monitor_backend.cc:272 - unable to refresh health metadata, no leader controller
example-0 redpanda INFO  2022-04-20 14:00:03,209 [shard 0] cluster - health_monitor_backend.cc:403 - error refreshing cluster health state - Currently there is no leader controller elected in the cluster
example-0 redpanda INFO  2022-04-20 14:00:03,209 [shard 0] cluster - metadata_dissemination_service.cc:357 - unable to retrieve cluster health report - Currently there is no leader controller elected in the cluster
@nicolaferraro nicolaferraro added the kind/bug Something isn't working label Apr 20, 2022
@jcsp
Copy link
Contributor

jcsp commented Apr 20, 2022

@dotnwat is it possible that maintenance mode is causing the node to refuse all leaderships including controller leadership? We probably need a special case for single node clusters if that's the case

@dotnwat
Copy link
Member

dotnwat commented May 20, 2022

@jcsp yes, maintenance mode tries to prevent all leadership, including the controller.

we can certainly special case the controller, but i didn't do this up front because I wasn't imagining any scenario in which it would be problematic, but moving it would be beneficial for any clients in the midst of doing administrative stuff (or internal communication with the controller leader).

kubectl exec example-0 -- rpk cluster maintenance disable 0

hmm, ill take a look. we may need some retries, unless i'm missing something else here in the ticket.

@dotnwat dotnwat self-assigned this May 20, 2022
jcsp added a commit to jcsp/redpanda that referenced this issue May 25, 2022
If a single node cluster puts its only node in maintenance
mode, then there is no node elegible to become controller
leader, and all further progress is stopped.

Fixes redpanda-data#4338
jcsp added a commit to jcsp/redpanda that referenced this issue May 25, 2022
If a system got into the bad state of issue redpanda-data#4338, then
their cluster is broken until we replay the controller
log _without_ putting the node into maintenance mode.

Related redpanda-data#4338
@jcsp jcsp assigned jcsp and unassigned dotnwat May 25, 2022
jcsp added a commit to jcsp/redpanda that referenced this issue Jun 1, 2022
If a single node cluster puts its only node in maintenance
mode, then there is no node elegible to become controller
leader, and all further progress is stopped.

Fixes redpanda-data#4338

(cherry picked from commit b970b2d)
jcsp added a commit to jcsp/redpanda that referenced this issue Jun 1, 2022
If a system got into the bad state of issue redpanda-data#4338, then
their cluster is broken until we replay the controller
log _without_ putting the node into maintenance mode.

Related redpanda-data#4338

(cherry picked from commit 20f3597)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants