-
Notifications
You must be signed in to change notification settings - Fork 574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
k8s: implement safe rolling upgrade logic in operator #3023
Comments
@joejulian @dotnwat is this ticket still relvant? |
@joejulian is this covered now in the operator? |
It doesn't exactly follow those steps. It doesn't wait for anything external before taking it out of maintenance mode. If "Wait for healthy cluster state" could be that it can call its own admin api, then it's probably close enough. |
It sounds like there is still work to do here: checking cluster health before proceeding with upgrades is important for robustness. Scenario A: unexpected bugHypothetical Redpanda version has a bug that causes it to send RPCs to peers that cause the peers to fall over. The upgrade procedure upgrades one node, the upgraded node comes up quite happily, but other nodes start crashing. That should be the signal for the operator to stop the upgrade and roll back. Scenario B: data recoveryThe cluster is under write load. When upgrading (restarting) node 1, node 1 naturally falls behind on writes. Nodes 2+3 are still able to service writes. Then node 2 gets restarted while node 1 is still behind. While node 2 is offline, nodes 1+3 can form a quorum, but cannot service acks=-1 writes yet, because node 1 is behind: it can't service new writes at the tip of the log until it is back online. This manifests as a timeout to producers during the upgrade. |
A: rollbackThis would have to roll back the state of the Cluster resource.
A: Press-on!What if we didn't revert and, instead, pressed on if the first pod to roll came up healthy but the rest of the cluster fell down? Could we just panic flip all the rest of the pods? |
For B: What signal should we be looking for? When node 1 is rolled, comes back, takes itself out of maintenance mode - what signal does Redpanda give that the operator should be checking before moving on to node 2? It does check |
The node readiness endpoint is not sufficient. /v1/status/ready is only telling you that the node you touched is up (it's internally just a bool that gets set after the node opens its kafka listener). For a safe upgrade, the essential check is that the overall cluster health is good: this includes things like:
@dotnwat please keep me honest: does this line up with recent discussions on local disk storage etc? |
I don't recommend this. On a major version upgrade, if you give up on a rolling upgrade and flash forward to updating all the nodes, then new feature flags will activate, and the cluster will start writing new-format data to disk. At this point the door slams shut for rolling back. |
@jcsp yes. /v1/status/ready is certainly not sufficient. in fact, we didn't have a single endpoint that would be sufficient for ephemeral disk scenario so the proposal was a function of a couple endpoints until core could enhance the existing health endpoint to be sufficient. this was written down some where in the context of the tt-local-disk channel on slack. i can't seem to find it right now, but I will look in the AM. |
As per redpanda-data#3023 the cluster should be healthy before starting put node in maintanance mode and after POD is restarted.
During rolling update, before this change, Redpanda operator was callculating the difference between running pod specification and stateful set pod template. If the specification did not match the pod was deleted. From release v22.1.1 operator is configuring each broker with pod lifecycle hooks. In the PreStop hook the script will try to put broker into maintanance mode for 120 seconds before POD is terminated. Redpanda could not finish within 120 seconds to put one broker into maintanace mode. This PR improves the situation by putting maintanance mode before POD is deleted. The `EnableMaintanaceMode` function is called multiple times until `Broker` function returns correct status. The assumption is that REST admin API maintanance mode endpoint is idempotent. When pod is successfully deleted statefulset would reschedule the pod with correct pod specification. redpanda-data#4125 redpanda-data#3023
During rolling update, before this change, Redpanda operator was calculating the difference between running pod specification and stateful set pod template. If the specification did not match the pod was deleted. From release v22.1.1 operator is configuring each broker with pod lifecycle hooks. In the PreStop hook the script will try to put broker into maintenance mode for 120 seconds before POD is terminated. Redpanda could not finish within 120 seconds to put one broker into maintenance mode. This PR improves the situation by putting maintenance mode before POD is deleted. The `EnableMaintanaceMode` function is called multiple times until `Broker` function returns correct status. The assumption is that REST admin API maintenance mode endpoint is idempotent. When pod is successfully deleted statefulset would reschedule the pod with correct pod specification. redpanda-data#4125 redpanda-data#3023
As per redpanda-data#3023 the cluster should be healthy before starting put node in maintanance mode and after POD is restarted.
As per redpanda-data#3023 the cluster should be healthy before starting put node in maintanance mode and after POD is restarted.
During rolling update, before this change, Redpanda operator was calculating the difference between running pod specification and stateful set pod template. If the specification did not match the pod was deleted. From release v22.1.1 operator is configuring each broker with pod lifecycle hooks. In the PreStop hook the script will try to put broker into maintenance mode for 120 seconds before POD is terminated. Redpanda could not finish within 120 seconds to put one broker into maintenance mode. This PR improves the situation by putting maintenance mode before POD is deleted. The `EnableMaintanaceMode` function is called multiple times until `Broker` function returns correct status. The assumption is that REST admin API maintenance mode endpoint is idempotent. When pod is successfully deleted statefulset would reschedule the pod with correct pod specification. redpanda-data#4125 redpanda-data#3023
During rolling update, before this change, Redpanda operator was calculating the difference between running pod specification and stateful set pod template. If the specification did not match the pod was deleted. From release v22.1.1 operator is configuring each broker with pod lifecycle hooks. In the PreStop hook the script will try to put broker into maintenance mode for 120 seconds before POD is terminated. Redpanda could not finish within 120 seconds to put one broker into maintenance mode. This PR improves the situation by putting maintenance mode before POD is deleted. The `EnableMaintanaceMode` function is called multiple times until `Broker` function returns correct status. The assumption is that REST admin API maintenance mode endpoint is idempotent. When pod is successfully deleted statefulset would reschedule the pod with correct pod specification. redpanda-data#4125 redpanda-data#3023
As per redpanda-data#3023 the cluster should be healthy before starting put node in maintanance mode and after POD is restarted.
As per redpanda-data#3023 the cluster should be healthy before starting put node in maintanance mode and after POD is restarted.
During rolling update, before this change, Redpanda operator was calculating the difference between running pod specification and stateful set pod template. If the specification did not match the pod was deleted. From release v22.1.1 operator is configuring each broker with pod lifecycle hooks. In the PreStop hook the script will try to put broker into maintenance mode for 120 seconds before POD is terminated. Redpanda could not finish within 120 seconds to put one broker into maintenance mode. This PR improves the situation by putting maintenance mode before POD is deleted. The `EnableMaintenanceMode` function is called multiple times until `Broker` function returns correct status. The assumption is that REST admin API maintenance mode endpoint is idempotent. When pod is successfully deleted statefulset would reschedule the pod with correct pod specification. redpanda-data#4125 redpanda-data#3023
As per redpanda-data#3023 the cluster should be healthy before starting put node in maintanance mode and after POD is restarted.
…y-cluster Get cluster health before an update
During rolling update, before this change, Redpanda operator was calculating the difference between running pod specification and stateful set pod template. If the specification did not match the pod was deleted. From release v22.1.1 operator is configuring each broker with pod lifecycle hooks. In the PreStop hook the script will try to put broker into maintenance mode for 120 seconds before POD is terminated. Redpanda could not finish within 120 seconds to put one broker into maintenance mode. This PR improves the situation by putting maintenance mode before POD is deleted. The `EnableMaintanaceMode` function is called multiple times until `Broker` function returns correct status. The assumption is that REST admin API maintenance mode endpoint is idempotent. When pod is successfully deleted statefulset would reschedule the pod with correct pod specification. redpanda-data#4125 redpanda-data#3023
During rolling update, before this change, Redpanda operator was calculating the difference between running pod specification and stateful set pod template. If the specification did not match the pod was deleted. From release v22.1.1 operator is configuring each broker with pod lifecycle hooks. In the PreStop hook the script will try to put broker into maintenance mode for 120 seconds before POD is terminated. Redpanda could not finish within 120 seconds to put one broker into maintenance mode. This PR improves the situation by putting maintenance mode before POD is deleted. The `EnableMaintanaceMode` function is called multiple times until `Broker` function returns correct status. The assumption is that REST admin API maintenance mode endpoint is idempotent. When pod is successfully deleted statefulset would reschedule the pod with correct pod specification. redpanda-data#4125 redpanda-data#3023
During rolling update, before this change, Redpanda operator was calculating the difference between running pod specification and stateful set pod template. If the specification did not match the pod was deleted. From release v22.1.1 operator is configuring each broker with pod lifecycle hooks. In the PreStop hook the script will try to put broker into maintenance mode for 120 seconds before POD is terminated. Redpanda could not finish within 120 seconds to put one broker into maintenance mode. This PR improves the situation by putting maintenance mode before POD is deleted. The `EnableMaintanaceMode` function is called multiple times until `Broker` function returns correct status. The assumption is that REST admin API maintenance mode endpoint is idempotent. When pod is successfully deleted statefulset would reschedule the pod with correct pod specification. redpanda-data#4125 redpanda-data#3023
During rolling update, before this change, Redpanda operator was calculating the difference between running pod specification and stateful set pod template. If the specification did not match the pod was deleted. From release v22.1.1 operator is configuring each broker with pod lifecycle hooks. In the PreStop hook the script will try to put broker into maintenance mode for 120 seconds before POD is terminated. Redpanda could not finish within 120 seconds to put one broker into maintenance mode. This PR improves the situation by putting maintenance mode before POD is deleted. The `EnableMaintanaceMode` function is called multiple times until `Broker` function returns correct status. The assumption is that REST admin API maintenance mode endpoint is idempotent. When pod is successfully deleted statefulset would reschedule the pod with correct pod specification. redpanda-data#4125 redpanda-data#3023
During rolling update, before this change, Redpanda operator was calculating the difference between running pod specification and stateful set pod template. If the specification did not match the pod was deleted. From release v22.1.1 operator is configuring each broker with pod lifecycle hooks. In the PreStop hook the script will try to put broker into maintenance mode for 120 seconds before POD is terminated. Redpanda could not finish within 120 seconds to put one broker into maintenance mode. This PR improves the situation by putting maintenance mode before POD is deleted. The `EnableMaintanaceMode` function is called multiple times until `Broker` function returns correct status. The assumption is that REST admin API maintenance mode endpoint is idempotent. When pod is successfully deleted statefulset would reschedule the pod with correct pod specification. redpanda-data#4125 redpanda-data#3023
During rolling update, before this change, Redpanda operator was calculating the difference between running pod specification and stateful set pod template. If the specification did not match the pod was deleted. From release v22.1.1 operator is configuring each broker with pod lifecycle hooks. In the PreStop hook the script will try to put broker into maintenance mode for 120 seconds before POD is terminated. Redpanda could not finish within 120 seconds to put one broker into maintenance mode. This PR improves the situation by putting maintenance mode before POD is deleted. The `EnableMaintanaceMode` function is called multiple times until `Broker` function returns correct status. The assumption is that REST admin API maintenance mode endpoint is idempotent. When pod is successfully deleted statefulset would reschedule the pod with correct pod specification. redpanda-data#4125 redpanda-data#3023
During rolling update, before this change, Redpanda operator was calculating the difference between running pod specification and stateful set pod template. If the specification did not match the pod was deleted. From release v22.1.1 operator is configuring each broker with pod lifecycle hooks. In the PreStop hook the script will try to put broker into maintenance mode for 120 seconds before POD is terminated. Redpanda could not finish within 120 seconds to put one broker into maintenance mode. This PR improves the situation by putting maintenance mode before POD is deleted. The `EnableMaintanaceMode` function is called multiple times until `Broker` function returns correct status. The assumption is that REST admin API maintenance mode endpoint is idempotent. When pod is successfully deleted statefulset would reschedule the pod with correct pod specification. redpanda-data#4125 redpanda-data#3023
During rolling update, before this change, Redpanda operator was calculating the difference between running pod specification and stateful set pod template. If the specification did not match the pod was deleted. From release v22.1.1 operator is configuring each broker with pod lifecycle hooks. In the PreStop hook the script will try to put broker into maintenance mode for 120 seconds before POD is terminated. Redpanda could not finish within 120 seconds to put one broker into maintenance mode. This PR improves the situation by putting maintenance mode before POD is deleted. The `EnableMaintanaceMode` function is called multiple times until `Broker` function returns correct status. The assumption is that REST admin API maintenance mode endpoint is idempotent. When pod is successfully deleted statefulset would reschedule the pod with correct pod specification. redpanda-data#4125 redpanda-data#3023
During rolling update, before this change, Redpanda operator was calculating the difference between running pod specification and stateful set pod template. If the specification did not match the pod was deleted. From release v22.1.1 operator is configuring each broker with pod lifecycle hooks. In the PreStop hook the script will try to put broker into maintenance mode for 120 seconds before POD is terminated. Redpanda could not finish within 120 seconds to put one broker into maintenance mode. This PR improves the situation by putting maintenance mode before POD is deleted. The `EnableMaintanaceMode` function is called multiple times until `Broker` function returns correct status. The assumption is that REST admin API maintenance mode endpoint is idempotent. When pod is successfully deleted statefulset would reschedule the pod with correct pod specification. redpanda-data#4125 redpanda-data#3023
During rolling update, before this change, Redpanda operator was calculating the difference between running pod specification and stateful set pod template. If the specification did not match the pod was deleted. From release v22.1.1 operator is configuring each broker with pod lifecycle hooks. In the PreStop hook the script will try to put broker into maintenance mode for 120 seconds before POD is terminated. Redpanda could not finish within 120 seconds to put one broker into maintenance mode. This PR improves the situation by putting maintenance mode before POD is deleted. The `EnableMaintanaceMode` function is called multiple times until `Broker` function returns correct status. The assumption is that REST admin API maintenance mode endpoint is idempotent. When pod is successfully deleted statefulset would reschedule the pod with correct pod specification. redpanda-data#4125 redpanda-data#3023
…nce-mode k8s: Put brokers in maintenance mode before deleting orphan's pod
…licated-partitions-in-upgrade-procedure
It's done by:
|
During rolling update, before this change, Redpanda operator was calculating the difference between running pod specification and stateful set pod template. If the specification did not match the pod was deleted. From release v22.1.1 operator is configuring each broker with pod lifecycle hooks. In the PreStop hook the script will try to put broker into maintenance mode for 120 seconds before POD is terminated. Redpanda could not finish within 120 seconds to put one broker into maintenance mode. This PR improves the situation by putting maintenance mode before POD is deleted. The `EnableMaintanaceMode` function is called multiple times until `Broker` function returns correct status. The assumption is that REST admin API maintenance mode endpoint is idempotent. When pod is successfully deleted statefulset would reschedule the pod with correct pod specification. redpanda-data#4125 redpanda-data#3023
As per redpanda-data#3023 the cluster should be healthy before starting put node in maintanance mode and after POD is restarted.
During rolling update, before this change, Redpanda operator was calculating the difference between running pod specification and stateful set pod template. If the specification did not match the pod was deleted. From release v22.1.1 operator is configuring each broker with pod lifecycle hooks. In the PreStop hook the script will try to put broker into maintenance mode for 120 seconds before POD is terminated. Redpanda could not finish within 120 seconds to put one broker into maintenance mode. This PR improves the situation by putting maintenance mode before POD is deleted. The `EnableMaintanaceMode` function is called multiple times until `Broker` function returns correct status. The assumption is that REST admin API maintenance mode endpoint is idempotent. When pod is successfully deleted statefulset would reschedule the pod with correct pod specification. redpanda-data#4125 redpanda-data#3023
…dpanda-datagh-3023/check-under-replicated-partitions-in-upgrade-procedure
As per redpanda-data#3023 the cluster should be healthy before starting put node in maintanance mode and after POD is restarted.
…t pod During rolling update, before this change, Redpanda operator was calculating the difference between running pod specification and stateful set pod template. If the specification did not match the pod was deleted. From release v22.1.1 operator is configuring each broker with pod lifecycle hooks. In the PreStop hook the script will try to put broker into maintenance mode for 120 seconds before POD is terminated. Redpanda could not finish within 120 seconds to put one broker into maintenance mode. This PR improves the situation by putting maintenance mode before POD is deleted. The `EnableMaintanaceMode` function is called multiple times until `Broker` function returns correct status. The assumption is that REST admin API maintenance mode endpoint is idempotent. When pod is successfully deleted statefulset would reschedule the pod with correct pod specification. redpanda-data#4125 redpanda-data#3023
…edpanda-datagh-3023/check-under-replicated-partitions-in-upgrade-procedure
As per redpanda-data#3023 the cluster should be healthy before starting put node in maintanance mode and after POD is restarted. (cherry picked from commit e0491d1)
During rolling update, before this change, Redpanda operator was calculating the difference between running pod specification and stateful set pod template. If the specification did not match the pod was deleted. From release v22.1.1 operator is configuring each broker with pod lifecycle hooks. In the PreStop hook the script will try to put broker into maintenance mode for 120 seconds before POD is terminated. Redpanda could not finish within 120 seconds to put one broker into maintenance mode. This PR improves the situation by putting maintenance mode before POD is deleted. The `EnableMaintanaceMode` function is called multiple times until `Broker` function returns correct status. The assumption is that REST admin API maintenance mode endpoint is idempotent. When pod is successfully deleted statefulset would reschedule the pod with correct pod specification. redpanda-data#4125 redpanda-data#3023 (cherry picked from commit 3c34855)
As per redpanda-data#3023 the cluster should be healthy before starting put node in maintanance mode and after POD is restarted. (cherry picked from commit e0491d1)
During rolling update, before this change, Redpanda operator was calculating the difference between running pod specification and stateful set pod template. If the specification did not match the pod was deleted. From release v22.1.1 operator is configuring each broker with pod lifecycle hooks. In the PreStop hook the script will try to put broker into maintenance mode for 120 seconds before POD is terminated. Redpanda could not finish within 120 seconds to put one broker into maintenance mode. This PR improves the situation by putting maintenance mode before POD is deleted. The `EnableMaintanaceMode` function is called multiple times until `Broker` function returns correct status. The assumption is that REST admin API maintenance mode endpoint is idempotent. When pod is successfully deleted statefulset would reschedule the pod with correct pod specification. redpanda-data#4125 redpanda-data#3023 (cherry picked from commit 3c34855)
As per redpanda-data#3023 the cluster should be healthy before starting put node in maintanance mode and after POD is restarted. (cherry picked from commit e0491d1)
During rolling update, before this change, Redpanda operator was calculating the difference between running pod specification and stateful set pod template. If the specification did not match the pod was deleted. From release v22.1.1 operator is configuring each broker with pod lifecycle hooks. In the PreStop hook the script will try to put broker into maintenance mode for 120 seconds before POD is terminated. Redpanda could not finish within 120 seconds to put one broker into maintenance mode. This PR improves the situation by putting maintenance mode before POD is deleted. The `EnableMaintanaceMode` function is called multiple times until `Broker` function returns correct status. The assumption is that REST admin API maintenance mode endpoint is idempotent. When pod is successfully deleted statefulset would reschedule the pod with correct pod specification. redpanda-data#4125 redpanda-data#3023 (cherry picked from commit 3c34855)
The following is the generic upgrade procedure assumed in this document, and is executable manually or automatically by a process such as a k8s operator:
Additional notes
The text was updated successfully, but these errors were encountered: