Operator: Scaling up a cluster triggers rolling restart #7313

0x5d · 2022-11-16T14:47:33Z

Version & Environment

Redpanda version: (use rpk version): v22.3.1-rc4

What went wrong?

Increasing the cluster replicas triggers a rolling restart, which means that new Redpanda pods get scheduled on existing pods' nodes.
E.g. in an N node cluster scaled up to M nodes:

Operator triggers a rolling restart
Pod 0 is deleted (restarted)
Pod N+1 is scheduled in Node 0
Pod 0, which has an affinity for Node 0, becomes unschedulable

What should have happened instead?

Existing pods shouldn't be restarted, and new pods should be scheduled on available nodes.

How to reproduce the issue?

Deploy an N-broker Redpanda cluster on an M-node k8s cluster using the operator.
Edit the cluster CR, increasing the replicas from N to M
Monitor the pods in the redpanda namespace (kubectl get pods -n redpanda -w)
Watch a rolling restart be attempted, with a new pod being scheduled on node 0, and then pod 0 becoming unschedulable due to a persistent volume conflict.

Additional information

Please attach any relevant logs, backtraces, or metric charts.
Deleting the pod scheduled on pod 0's node allows the rolling restart to continue, but of course a pod inevitably becomes unschedulable in the end:

redpanda@ip-172-16-1-162:~$ kubectl get po -n redpanda -w
NAME                                       READY   STATUS        RESTARTS   AGE
rp-juan-1111-0                             0/1     Pending       0          22m
rp-juan-1111-1                             1/1     Running       0          93m
rp-juan-1111-2                             1/1     Running       0          93m
rp-juan-1111-3                             1/1     Terminating   0          79m
sasl-user-creation-first-superuser-pvwv7   0/1     Completed     0          93m
rp-juan-1111-3                             0/1     Terminating   0          79m
rp-juan-1111-3                             0/1     Terminating   0          79m
rp-juan-1111-3                             0/1     Terminating   0          79m
rp-juan-1111-3                             0/1     Pending       0          0s
rp-juan-1111-3                             0/1     Pending       0          0s
rp-juan-1111-0                             0/1     Pending       0          22m
rp-juan-1111-0                             0/1     Init:0/1      0          22m
rp-juan-1111-0                             0/1     PodInitializing   0          22m

The text was updated successfully, but these errors were encountered:

0x5d · 2022-11-16T16:11:25Z

There's a beginning of a fix here: #4964

joejulian · 2023-08-21T18:09:28Z

This doesn't happen now.

0x5d added kind/bug Something isn't working area/k8s labels Nov 16, 2022

joejulian closed this as completed Aug 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operator: Scaling up a cluster triggers rolling restart #7313

Operator: Scaling up a cluster triggers rolling restart #7313

0x5d commented Nov 16, 2022 •

edited

Loading

0x5d commented Nov 16, 2022

joejulian commented Aug 21, 2023

Operator: Scaling up a cluster triggers rolling restart #7313

Operator: Scaling up a cluster triggers rolling restart #7313

Comments

0x5d commented Nov 16, 2022 • edited Loading

Version & Environment

What went wrong?

What should have happened instead?

How to reproduce the issue?

Additional information

0x5d commented Nov 16, 2022

joejulian commented Aug 21, 2023

0x5d commented Nov 16, 2022 •

edited

Loading