k8s: implement safe rolling upgrade logic in operator #3023

dotnwat · 2021-11-19T04:24:46Z

The following is the generic upgrade procedure assumed in this document, and is executable manually or automatically by a process such as a k8s operator:

Wait for healthy cluster state via the health monitor service
Select a non-upgraded node and place into maintenance mode
- This may take some time to complete
- If a cluster issue occurs
  - Revert maintenance mode
  - Goto (1)
Once a node is in maintenance mode it may be shutdown
Execute node upgrade process
Restart node
Wait for healthy cluster state via the health monitor service
Take node out of maintenance mode
Goto (1)

Additional notes

jcsp · 2022-11-03T15:54:24Z

@joejulian @dotnwat is this ticket still relvant?

dotnwat · 2022-11-16T23:22:30Z

@joejulian is this covered now in the operator?

joejulian · 2022-11-17T03:53:23Z

It doesn't exactly follow those steps. It doesn't wait for anything external before taking it out of maintenance mode. If "Wait for healthy cluster state" could be that it can call its own admin api, then it's probably close enough.

jcsp · 2022-11-21T08:46:15Z

It sounds like there is still work to do here: checking cluster health before proceeding with upgrades is important for robustness.

Scenario A: unexpected bug

Hypothetical Redpanda version has a bug that causes it to send RPCs to peers that cause the peers to fall over. The upgrade procedure upgrades one node, the upgraded node comes up quite happily, but other nodes start crashing. That should be the signal for the operator to stop the upgrade and roll back.

Scenario B: data recovery

The cluster is under write load. When upgrading (restarting) node 1, node 1 naturally falls behind on writes. Nodes 2+3 are still able to service writes. Then node 2 gets restarted while node 1 is still behind. While node 2 is offline, nodes 1+3 can form a quorum, but cannot service acks=-1 writes yet, because node 1 is behind: it can't service new writes at the tip of the log until it is back online. This manifests as a timeout to producers during the upgrade.

joejulian · 2022-11-22T00:54:43Z

A: rollback

This would have to roll back the state of the Cluster resource.

I'm not sure previous state is actually saved, so this would need added.
When the Cluster gets reverted to a last-known-good state, how does the reconciler know to override the cluster-health check? (Cluster condition?)
What do we do if the previous configuration doesn't fix it? We should probably throw an event and trigger an alert from such event.

A: Press-on!

What if we didn't revert and, instead, pressed on if the first pod to roll came up healthy but the rest of the cluster fell down? Could we just panic flip all the rest of the pods?

joejulian · 2022-11-22T01:10:09Z

For B: What signal should we be looking for? When node 1 is rolled, comes back, takes itself out of maintenance mode - what signal does Redpanda give that the operator should be checking before moving on to node 2? It does check v1/status/ready. Is that sufficient?

jcsp · 2022-11-22T09:36:18Z

For B: What signal should we be looking for? When node 1 is rolled, comes back, takes itself out of maintenance mode - what signal does Redpanda give that the operator should be checking before moving on to node 2? It does check v1/status/ready. Is that sufficient?

The node readiness endpoint is not sufficient. /v1/status/ready is only telling you that the node you touched is up (it's internally just a bool that gets set after the node opens its kafka listener). For a safe upgrade, the essential check is that the overall cluster health is good: this includes things like:

Are the other nodes up? (i.e. did something fall over as in scenario A)
Are any partitions behind on replication? (i.e. do we need to wait to avoid scenario B)

/v1/cluster/health_overview is what gives you that cluster-wide status. It's not perfect (it's always possible for something to go wrong between the health GET and the actual upgrade), but gives an excellent chance of backing off if something has gone dramatically wrong. Currently the main things it reports on are whether any nodes are down and whether any partitions are leaderless, but it will be the place in future that we can extend to give that strong "scenario B" check that data replication is up to date.

@dotnwat please keep me honest: does this line up with recent discussions on local disk storage etc?

jcsp · 2022-11-22T09:37:33Z

A: Press-on!

I don't recommend this. On a major version upgrade, if you give up on a rolling upgrade and flash forward to updating all the nodes, then new feature flags will activate, and the cluster will start writing new-format data to disk. At this point the door slams shut for rolling back.

dotnwat · 2022-11-23T05:27:47Z

@dotnwat please keep me honest: does this line up with recent discussions on local disk storage etc?

@jcsp yes. /v1/status/ready is certainly not sufficient. in fact, we didn't have a single endpoint that would be sufficient for ephemeral disk scenario so the proposal was a function of a couple endpoints until core could enhance the existing health endpoint to be sufficient. this was written down some where in the context of the tt-local-disk channel on slack. i can't seem to find it right now, but I will look in the AM.

As per redpanda-data#3023 the cluster should be healthy before starting put node in maintanance mode and after POD is restarted.

During rolling update, before this change, Redpanda operator was callculating the difference between running pod specification and stateful set pod template. If the specification did not match the pod was deleted. From release v22.1.1 operator is configuring each broker with pod lifecycle hooks. In the PreStop hook the script will try to put broker into maintanance mode for 120 seconds before POD is terminated. Redpanda could not finish within 120 seconds to put one broker into maintanace mode. This PR improves the situation by putting maintanance mode before POD is deleted. The `EnableMaintanaceMode` function is called multiple times until `Broker` function returns correct status. The assumption is that REST admin API maintanance mode endpoint is idempotent. When pod is successfully deleted statefulset would reschedule the pod with correct pod specification. redpanda-data#4125 redpanda-data#3023

During rolling update, before this change, Redpanda operator was calculating the difference between running pod specification and stateful set pod template. If the specification did not match the pod was deleted. From release v22.1.1 operator is configuring each broker with pod lifecycle hooks. In the PreStop hook the script will try to put broker into maintenance mode for 120 seconds before POD is terminated. Redpanda could not finish within 120 seconds to put one broker into maintenance mode. This PR improves the situation by putting maintenance mode before POD is deleted. The `EnableMaintanaceMode` function is called multiple times until `Broker` function returns correct status. The assumption is that REST admin API maintenance mode endpoint is idempotent. When pod is successfully deleted statefulset would reschedule the pod with correct pod specification. redpanda-data#4125 redpanda-data#3023

As per redpanda-data#3023 the cluster should be healthy before starting put node in maintanance mode and after POD is restarted.

During rolling update, before this change, Redpanda operator was calculating the difference between running pod specification and stateful set pod template. If the specification did not match the pod was deleted. From release v22.1.1 operator is configuring each broker with pod lifecycle hooks. In the PreStop hook the script will try to put broker into maintenance mode for 120 seconds before POD is terminated. Redpanda could not finish within 120 seconds to put one broker into maintenance mode. This PR improves the situation by putting maintenance mode before POD is deleted. The `EnableMaintanaceMode` function is called multiple times until `Broker` function returns correct status. The assumption is that REST admin API maintenance mode endpoint is idempotent. When pod is successfully deleted statefulset would reschedule the pod with correct pod specification. redpanda-data#4125 redpanda-data#3023

As per redpanda-data#3023 the cluster should be healthy before starting put node in maintanance mode and after POD is restarted.

During rolling update, before this change, Redpanda operator was calculating the difference between running pod specification and stateful set pod template. If the specification did not match the pod was deleted. From release v22.1.1 operator is configuring each broker with pod lifecycle hooks. In the PreStop hook the script will try to put broker into maintenance mode for 120 seconds before POD is terminated. Redpanda could not finish within 120 seconds to put one broker into maintenance mode. This PR improves the situation by putting maintenance mode before POD is deleted. The `EnableMaintenanceMode` function is called multiple times until `Broker` function returns correct status. The assumption is that REST admin API maintenance mode endpoint is idempotent. When pod is successfully deleted statefulset would reschedule the pod with correct pod specification. redpanda-data#4125 redpanda-data#3023

As per redpanda-data#3023 the cluster should be healthy before starting put node in maintanance mode and after POD is restarted.

…y-cluster Get cluster health before an update

During rolling update, before this change, Redpanda operator was calculating the difference between running pod specification and stateful set pod template. If the specification did not match the pod was deleted. From release v22.1.1 operator is configuring each broker with pod lifecycle hooks. In the PreStop hook the script will try to put broker into maintenance mode for 120 seconds before POD is terminated. Redpanda could not finish within 120 seconds to put one broker into maintenance mode. This PR improves the situation by putting maintenance mode before POD is deleted. The `EnableMaintanaceMode` function is called multiple times until `Broker` function returns correct status. The assumption is that REST admin API maintenance mode endpoint is idempotent. When pod is successfully deleted statefulset would reschedule the pod with correct pod specification. redpanda-data#4125 redpanda-data#3023

…nce-mode k8s: Put brokers in maintenance mode before deleting orphan's pod

…licated-partitions-in-upgrade-procedure

RafalKorepta · 2023-01-08T18:13:01Z

It's done by:

k8s: Put brokers in maintenance mode before deleting orphan's pod #7530
k8s: Wait for restarted broker to catch up #7594
Get cluster health before an update #7528
and previous implementation of the rolling update/upgrade

During rolling update, before this change, Redpanda operator was calculating the difference between running pod specification and stateful set pod template. If the specification did not match the pod was deleted. From release v22.1.1 operator is configuring each broker with pod lifecycle hooks. In the PreStop hook the script will try to put broker into maintenance mode for 120 seconds before POD is terminated. Redpanda could not finish within 120 seconds to put one broker into maintenance mode. This PR improves the situation by putting maintenance mode before POD is deleted. The `EnableMaintanaceMode` function is called multiple times until `Broker` function returns correct status. The assumption is that REST admin API maintenance mode endpoint is idempotent. When pod is successfully deleted statefulset would reschedule the pod with correct pod specification. redpanda-data#4125 redpanda-data#3023

As per redpanda-data#3023 the cluster should be healthy before starting put node in maintanance mode and after POD is restarted.

During rolling update, before this change, Redpanda operator was calculating the difference between running pod specification and stateful set pod template. If the specification did not match the pod was deleted. From release v22.1.1 operator is configuring each broker with pod lifecycle hooks. In the PreStop hook the script will try to put broker into maintenance mode for 120 seconds before POD is terminated. Redpanda could not finish within 120 seconds to put one broker into maintenance mode. This PR improves the situation by putting maintenance mode before POD is deleted. The `EnableMaintanaceMode` function is called multiple times until `Broker` function returns correct status. The assumption is that REST admin API maintenance mode endpoint is idempotent. When pod is successfully deleted statefulset would reschedule the pod with correct pod specification. redpanda-data#4125 redpanda-data#3023

…dpanda-datagh-3023/check-under-replicated-partitions-in-upgrade-procedure

As per redpanda-data#3023 the cluster should be healthy before starting put node in maintanance mode and after POD is restarted.

…t pod During rolling update, before this change, Redpanda operator was calculating the difference between running pod specification and stateful set pod template. If the specification did not match the pod was deleted. From release v22.1.1 operator is configuring each broker with pod lifecycle hooks. In the PreStop hook the script will try to put broker into maintenance mode for 120 seconds before POD is terminated. Redpanda could not finish within 120 seconds to put one broker into maintenance mode. This PR improves the situation by putting maintenance mode before POD is deleted. The `EnableMaintanaceMode` function is called multiple times until `Broker` function returns correct status. The assumption is that REST admin API maintenance mode endpoint is idempotent. When pod is successfully deleted statefulset would reschedule the pod with correct pod specification. redpanda-data#4125 redpanda-data#3023

…edpanda-datagh-3023/check-under-replicated-partitions-in-upgrade-procedure

As per redpanda-data#3023 the cluster should be healthy before starting put node in maintanance mode and after POD is restarted. (cherry picked from commit e0491d1)

During rolling update, before this change, Redpanda operator was calculating the difference between running pod specification and stateful set pod template. If the specification did not match the pod was deleted. From release v22.1.1 operator is configuring each broker with pod lifecycle hooks. In the PreStop hook the script will try to put broker into maintenance mode for 120 seconds before POD is terminated. Redpanda could not finish within 120 seconds to put one broker into maintenance mode. This PR improves the situation by putting maintenance mode before POD is deleted. The `EnableMaintanaceMode` function is called multiple times until `Broker` function returns correct status. The assumption is that REST admin API maintenance mode endpoint is idempotent. When pod is successfully deleted statefulset would reschedule the pod with correct pod specification. redpanda-data#4125 redpanda-data#3023 (cherry picked from commit 3c34855)

As per redpanda-data#3023 the cluster should be healthy before starting put node in maintanance mode and after POD is restarted. (cherry picked from commit e0491d1)

During rolling update, before this change, Redpanda operator was calculating the difference between running pod specification and stateful set pod template. If the specification did not match the pod was deleted. From release v22.1.1 operator is configuring each broker with pod lifecycle hooks. In the PreStop hook the script will try to put broker into maintenance mode for 120 seconds before POD is terminated. Redpanda could not finish within 120 seconds to put one broker into maintenance mode. This PR improves the situation by putting maintenance mode before POD is deleted. The `EnableMaintanaceMode` function is called multiple times until `Broker` function returns correct status. The assumption is that REST admin API maintenance mode endpoint is idempotent. When pod is successfully deleted statefulset would reschedule the pod with correct pod specification. redpanda-data#4125 redpanda-data#3023 (cherry picked from commit 3c34855)

As per redpanda-data#3023 the cluster should be healthy before starting put node in maintanance mode and after POD is restarted. (cherry picked from commit e0491d1)

During rolling update, before this change, Redpanda operator was calculating the difference between running pod specification and stateful set pod template. If the specification did not match the pod was deleted. From release v22.1.1 operator is configuring each broker with pod lifecycle hooks. In the PreStop hook the script will try to put broker into maintenance mode for 120 seconds before POD is terminated. Redpanda could not finish within 120 seconds to put one broker into maintenance mode. This PR improves the situation by putting maintenance mode before POD is deleted. The `EnableMaintanaceMode` function is called multiple times until `Broker` function returns correct status. The assumption is that REST admin API maintenance mode endpoint is idempotent. When pod is successfully deleted statefulset would reschedule the pod with correct pod specification. redpanda-data#4125 redpanda-data#3023 (cherry picked from commit 3c34855)

dotnwat added the area/k8s label Nov 19, 2021

dotnwat added this to the Rolling upgrade safety milestone Nov 19, 2021

flokli mentioned this issue Dec 3, 2021

updating the redpanda operator shouldn't restart/upgrade statefulsets for clusters with pinned versions #3150

Closed

ivotron modified the milestones: Rolling upgrade safety, v22.1.1 Feb 23, 2022

dotnwat removed this from the v22.1.1 (Stale) milestone Apr 26, 2022

dotnwat assigned nicolaferraro Apr 26, 2022

RafalKorepta self-assigned this Nov 25, 2022

RafalKorepta pushed a commit to RafalKorepta/redpanda that referenced this issue Nov 26, 2022

Get cluster health before an update

cf9d1db

As per redpanda-data#3023 the cluster should be healthy before starting put node in maintanance mode and after POD is restarted.

RafalKorepta mentioned this issue Nov 26, 2022

Get cluster health before an update #7528

Merged

6 tasks

RafalKorepta mentioned this issue Nov 27, 2022

k8s: Put brokers in maintenance mode before deleting orphan's pod #7530

Merged

6 tasks

RafalKorepta pushed a commit to RafalKorepta/redpanda that referenced this issue Nov 28, 2022

Get cluster health before an update

feca796

As per redpanda-data#3023 the cluster should be healthy before starting put node in maintanance mode and after POD is restarted.

RafalKorepta pushed a commit to RafalKorepta/redpanda that referenced this issue Nov 28, 2022

Get cluster health before an update

dc393e8

As per redpanda-data#3023 the cluster should be healthy before starting put node in maintanance mode and after POD is restarted.

RafalKorepta pushed a commit to RafalKorepta/redpanda that referenced this issue Nov 29, 2022

Get cluster health before an update

8e4916d

As per redpanda-data#3023 the cluster should be healthy before starting put node in maintanance mode and after POD is restarted.

RafalKorepta pushed a commit to RafalKorepta/redpanda that referenced this issue Nov 29, 2022

Get cluster health before an update

8fa6e62

As per redpanda-data#3023 the cluster should be healthy before starting put node in maintanance mode and after POD is restarted.

RafalKorepta pushed a commit to RafalKorepta/redpanda that referenced this issue Dec 1, 2022

Get cluster health before an update

e0491d1

As per redpanda-data#3023 the cluster should be healthy before starting put node in maintanance mode and after POD is restarted.

RafalKorepta mentioned this issue Dec 1, 2022

k8s: Wait for restarted broker to catch up #7594

Merged

6 tasks

RafalKorepta added a commit that referenced this issue Dec 5, 2022

Merge pull request #7528 from RafalKorepta/rk/gh-3023/wait-for-health…

26dcdf6

…y-cluster Get cluster health before an update

jcsp added the kind/enhance New feature or request label Dec 12, 2022

RafalKorepta added a commit that referenced this issue Jan 5, 2023

Merge pull request #7530 from RafalKorepta/rk/gh-3023/put-in-maintana…

c178778

…nce-mode k8s: Put brokers in maintenance mode before deleting orphan's pod

RafalKorepta added a commit that referenced this issue Jan 6, 2023

Merge pull request #7594 from RafalKorepta/rk/gh-3023/check-under-rep…

5edf3ad

…licated-partitions-in-upgrade-procedure

RafalKorepta closed this as completed Jan 8, 2023

joejulian pushed a commit to joejulian/redpanda that referenced this issue Mar 10, 2023

(split) Get cluster health before an update

6690ef3

As per redpanda-data#3023 the cluster should be healthy before starting put node in maintanance mode and after POD is restarted.

joejulian pushed a commit to joejulian/redpanda that referenced this issue Mar 10, 2023

(split) Merge pull request redpanda-data#7594 from RafalKorepta/rk/re…

745e6f5

…dpanda-datagh-3023/check-under-replicated-partitions-in-upgrade-procedure

joejulian pushed a commit to joejulian/redpanda that referenced this issue Mar 24, 2023

operator: Get cluster health before an update

5862edd

As per redpanda-data#3023 the cluster should be healthy before starting put node in maintanance mode and after POD is restarted.

joejulian pushed a commit to joejulian/redpanda that referenced this issue Mar 24, 2023

operator: Merge pull request redpanda-data#7594 from RafalKorepta/rk/r…

6e58377

…edpanda-datagh-3023/check-under-replicated-partitions-in-upgrade-procedure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

k8s: implement safe rolling upgrade logic in operator #3023

k8s: implement safe rolling upgrade logic in operator #3023

dotnwat commented Nov 19, 2021

jcsp commented Nov 3, 2022

dotnwat commented Nov 16, 2022

joejulian commented Nov 17, 2022

jcsp commented Nov 21, 2022 •

edited

Loading

joejulian commented Nov 22, 2022

joejulian commented Nov 22, 2022

jcsp commented Nov 22, 2022 •

edited

Loading

jcsp commented Nov 22, 2022

dotnwat commented Nov 23, 2022

RafalKorepta commented Jan 8, 2023

k8s: implement safe rolling upgrade logic in operator #3023

k8s: implement safe rolling upgrade logic in operator #3023

Comments

dotnwat commented Nov 19, 2021

Additional notes

jcsp commented Nov 3, 2022

dotnwat commented Nov 16, 2022

joejulian commented Nov 17, 2022

jcsp commented Nov 21, 2022 • edited Loading

Scenario A: unexpected bug

Scenario B: data recovery

joejulian commented Nov 22, 2022

A: rollback

A: Press-on!

joejulian commented Nov 22, 2022

jcsp commented Nov 22, 2022 • edited Loading

jcsp commented Nov 22, 2022

dotnwat commented Nov 23, 2022

RafalKorepta commented Jan 8, 2023

jcsp commented Nov 21, 2022 •

edited

Loading

jcsp commented Nov 22, 2022 •

edited

Loading