Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k8s: implement safe rolling upgrade logic in operator #3023

Closed
dotnwat opened this issue Nov 19, 2021 · 10 comments
Closed

k8s: implement safe rolling upgrade logic in operator #3023

dotnwat opened this issue Nov 19, 2021 · 10 comments
Assignees
Labels
area/k8s kind/enhance New feature or request

Comments

@dotnwat
Copy link
Member

dotnwat commented Nov 19, 2021

The following is the generic upgrade procedure assumed in this document, and is executable manually or automatically by a process such as a k8s operator:

  1. Wait for healthy cluster state via the health monitor service
  2. Select a non-upgraded node and place into maintenance mode
    • This may take some time to complete
    • If a cluster issue occurs
      • Revert maintenance mode
      • Goto (1)
  3. Once a node is in maintenance mode it may be shutdown
  4. Execute node upgrade process
  5. Restart node
  6. Wait for healthy cluster state via the health monitor service
  7. Take node out of maintenance mode
  8. Goto (1)

Additional notes

@dotnwat dotnwat added this to the Rolling upgrade safety milestone Nov 19, 2021
@ivotron ivotron modified the milestones: Rolling upgrade safety, v22.1.1 Feb 23, 2022
@dotnwat dotnwat removed this from the v22.1.1 (Stale) milestone Apr 26, 2022
@jcsp
Copy link
Contributor

jcsp commented Nov 3, 2022

@joejulian @dotnwat is this ticket still relvant?

@dotnwat
Copy link
Member Author

dotnwat commented Nov 16, 2022

@joejulian is this covered now in the operator?

@joejulian
Copy link
Contributor

It doesn't exactly follow those steps. It doesn't wait for anything external before taking it out of maintenance mode. If "Wait for healthy cluster state" could be that it can call its own admin api, then it's probably close enough.

@jcsp
Copy link
Contributor

jcsp commented Nov 21, 2022

It sounds like there is still work to do here: checking cluster health before proceeding with upgrades is important for robustness.

Scenario A: unexpected bug

Hypothetical Redpanda version has a bug that causes it to send RPCs to peers that cause the peers to fall over. The upgrade procedure upgrades one node, the upgraded node comes up quite happily, but other nodes start crashing. That should be the signal for the operator to stop the upgrade and roll back.

Scenario B: data recovery

The cluster is under write load. When upgrading (restarting) node 1, node 1 naturally falls behind on writes. Nodes 2+3 are still able to service writes. Then node 2 gets restarted while node 1 is still behind. While node 2 is offline, nodes 1+3 can form a quorum, but cannot service acks=-1 writes yet, because node 1 is behind: it can't service new writes at the tip of the log until it is back online. This manifests as a timeout to producers during the upgrade.

@joejulian
Copy link
Contributor

A: rollback

This would have to roll back the state of the Cluster resource.

  • I'm not sure previous state is actually saved, so this would need added.
  • When the Cluster gets reverted to a last-known-good state, how does the reconciler know to override the cluster-health check? (Cluster condition?)
  • What do we do if the previous configuration doesn't fix it? We should probably throw an event and trigger an alert from such event.

A: Press-on!

What if we didn't revert and, instead, pressed on if the first pod to roll came up healthy but the rest of the cluster fell down? Could we just panic flip all the rest of the pods?

@joejulian
Copy link
Contributor

For B: What signal should we be looking for? When node 1 is rolled, comes back, takes itself out of maintenance mode - what signal does Redpanda give that the operator should be checking before moving on to node 2? It does check v1/status/ready. Is that sufficient?

@jcsp
Copy link
Contributor

jcsp commented Nov 22, 2022

For B: What signal should we be looking for? When node 1 is rolled, comes back, takes itself out of maintenance mode - what signal does Redpanda give that the operator should be checking before moving on to node 2? It does check v1/status/ready. Is that sufficient?

The node readiness endpoint is not sufficient. /v1/status/ready is only telling you that the node you touched is up (it's internally just a bool that gets set after the node opens its kafka listener). For a safe upgrade, the essential check is that the overall cluster health is good: this includes things like:

  • Are the other nodes up? (i.e. did something fall over as in scenario A)
  • Are any partitions behind on replication? (i.e. do we need to wait to avoid scenario B)

/v1/cluster/health_overview is what gives you that cluster-wide status. It's not perfect (it's always possible for something to go wrong between the health GET and the actual upgrade), but gives an excellent chance of backing off if something has gone dramatically wrong. Currently the main things it reports on are whether any nodes are down and whether any partitions are leaderless, but it will be the place in future that we can extend to give that strong "scenario B" check that data replication is up to date.

@dotnwat please keep me honest: does this line up with recent discussions on local disk storage etc?

@jcsp
Copy link
Contributor

jcsp commented Nov 22, 2022

A: Press-on!

I don't recommend this. On a major version upgrade, if you give up on a rolling upgrade and flash forward to updating all the nodes, then new feature flags will activate, and the cluster will start writing new-format data to disk. At this point the door slams shut for rolling back.

@dotnwat
Copy link
Member Author

dotnwat commented Nov 23, 2022

@dotnwat please keep me honest: does this line up with recent discussions on local disk storage etc?

@jcsp yes. /v1/status/ready is certainly not sufficient. in fact, we didn't have a single endpoint that would be sufficient for ephemeral disk scenario so the proposal was a function of a couple endpoints until core could enhance the existing health endpoint to be sufficient. this was written down some where in the context of the tt-local-disk channel on slack. i can't seem to find it right now, but I will look in the AM.

@RafalKorepta RafalKorepta self-assigned this Nov 25, 2022
RafalKorepta pushed a commit to RafalKorepta/redpanda that referenced this issue Nov 26, 2022
As per redpanda-data#3023 the cluster should
be healthy before starting put node in maintanance mode and after POD is
restarted.
RafalKorepta pushed a commit to RafalKorepta/redpanda that referenced this issue Nov 27, 2022
During rolling update, before this change, Redpanda operator was callculating
the difference between running pod specification and stateful set pod template.
If the specification did not match the pod was deleted. From release v22.1.1
operator is configuring each broker with pod lifecycle hooks. In the PreStop
hook the script will try to put broker into maintanance mode for 120 seconds
before POD is terminated. Redpanda could not finish within 120 seconds to
put one broker into maintanace mode.

This PR improves the situation by putting maintanance mode before POD is
deleted. The `EnableMaintanaceMode` function is called multiple times until
`Broker` function returns correct status. The assumption is that REST admin API
maintanance mode endpoint is idempotent.

When pod is successfully deleted statefulset would reschedule the pod with
correct pod specification.

redpanda-data#4125
redpanda-data#3023
RafalKorepta pushed a commit to RafalKorepta/redpanda that referenced this issue Nov 27, 2022
During rolling update, before this change, Redpanda operator was calculating
the difference between running pod specification and stateful set pod template.
If the specification did not match the pod was deleted. From release v22.1.1
operator is configuring each broker with pod lifecycle hooks. In the PreStop
hook the script will try to put broker into maintenance mode for 120 seconds
before POD is terminated. Redpanda could not finish within 120 seconds to put
one broker into maintenance mode.

This PR improves the situation by putting maintenance mode before POD is
deleted. The `EnableMaintanaceMode` function is called multiple times until
`Broker` function returns correct status. The assumption is that REST admin API
maintenance mode endpoint is idempotent.

When pod is successfully deleted statefulset would reschedule the pod with
correct pod specification.

redpanda-data#4125
redpanda-data#3023
RafalKorepta pushed a commit to RafalKorepta/redpanda that referenced this issue Nov 28, 2022
As per redpanda-data#3023 the cluster should
be healthy before starting put node in maintanance mode and after POD is
restarted.
RafalKorepta pushed a commit to RafalKorepta/redpanda that referenced this issue Nov 28, 2022
As per redpanda-data#3023 the cluster should
be healthy before starting put node in maintanance mode and after POD is
restarted.
RafalKorepta pushed a commit to RafalKorepta/redpanda that referenced this issue Nov 28, 2022
During rolling update, before this change, Redpanda operator was calculating
the difference between running pod specification and stateful set pod template.
If the specification did not match the pod was deleted. From release v22.1.1
operator is configuring each broker with pod lifecycle hooks. In the PreStop
hook the script will try to put broker into maintenance mode for 120 seconds
before POD is terminated. Redpanda could not finish within 120 seconds to put
one broker into maintenance mode.

This PR improves the situation by putting maintenance mode before POD is
deleted. The `EnableMaintanaceMode` function is called multiple times until
`Broker` function returns correct status. The assumption is that REST admin API
maintenance mode endpoint is idempotent.

When pod is successfully deleted statefulset would reschedule the pod with
correct pod specification.

redpanda-data#4125
redpanda-data#3023
RafalKorepta pushed a commit to RafalKorepta/redpanda that referenced this issue Nov 28, 2022
During rolling update, before this change, Redpanda operator was calculating
the difference between running pod specification and stateful set pod template.
If the specification did not match the pod was deleted. From release v22.1.1
operator is configuring each broker with pod lifecycle hooks. In the PreStop
hook the script will try to put broker into maintenance mode for 120 seconds
before POD is terminated. Redpanda could not finish within 120 seconds to put
one broker into maintenance mode.

This PR improves the situation by putting maintenance mode before POD is
deleted. The `EnableMaintanaceMode` function is called multiple times until
`Broker` function returns correct status. The assumption is that REST admin API
maintenance mode endpoint is idempotent.

When pod is successfully deleted statefulset would reschedule the pod with
correct pod specification.

redpanda-data#4125
redpanda-data#3023
RafalKorepta pushed a commit to RafalKorepta/redpanda that referenced this issue Nov 29, 2022
As per redpanda-data#3023 the cluster should
be healthy before starting put node in maintanance mode and after POD is
restarted.
RafalKorepta pushed a commit to RafalKorepta/redpanda that referenced this issue Nov 29, 2022
As per redpanda-data#3023 the cluster should
be healthy before starting put node in maintanance mode and after POD is
restarted.
RafalKorepta pushed a commit to RafalKorepta/redpanda that referenced this issue Nov 29, 2022
During rolling update, before this change, Redpanda operator was calculating
the difference between running pod specification and stateful set pod template.
If the specification did not match the pod was deleted. From release v22.1.1
operator is configuring each broker with pod lifecycle hooks. In the PreStop
hook the script will try to put broker into maintenance mode for 120 seconds
before POD is terminated. Redpanda could not finish within 120 seconds to put
one broker into maintenance mode.

This PR improves the situation by putting maintenance mode before POD is
deleted. The `EnableMaintenanceMode` function is called multiple times until
`Broker` function returns correct status. The assumption is that REST admin API
maintenance mode endpoint is idempotent.

When pod is successfully deleted statefulset would reschedule the pod with
correct pod specification.

redpanda-data#4125
redpanda-data#3023
RafalKorepta pushed a commit to RafalKorepta/redpanda that referenced this issue Dec 1, 2022
As per redpanda-data#3023 the cluster should
be healthy before starting put node in maintanance mode and after POD is
restarted.
RafalKorepta added a commit that referenced this issue Dec 5, 2022
…y-cluster

Get cluster health before an update
@jcsp jcsp added the kind/enhance New feature or request label Dec 12, 2022
RafalKorepta pushed a commit to RafalKorepta/redpanda that referenced this issue Dec 21, 2022
During rolling update, before this change, Redpanda operator was calculating
the difference between running pod specification and stateful set pod template.
If the specification did not match the pod was deleted. From release v22.1.1
operator is configuring each broker with pod lifecycle hooks. In the PreStop
hook the script will try to put broker into maintenance mode for 120 seconds
before POD is terminated. Redpanda could not finish within 120 seconds to put
one broker into maintenance mode.

This PR improves the situation by putting maintenance mode before POD is
deleted. The `EnableMaintanaceMode` function is called multiple times until
`Broker` function returns correct status. The assumption is that REST admin API
maintenance mode endpoint is idempotent.

When pod is successfully deleted statefulset would reschedule the pod with
correct pod specification.

redpanda-data#4125
redpanda-data#3023
RafalKorepta pushed a commit to RafalKorepta/redpanda that referenced this issue Jan 2, 2023
During rolling update, before this change, Redpanda operator was calculating
the difference between running pod specification and stateful set pod template.
If the specification did not match the pod was deleted. From release v22.1.1
operator is configuring each broker with pod lifecycle hooks. In the PreStop
hook the script will try to put broker into maintenance mode for 120 seconds
before POD is terminated. Redpanda could not finish within 120 seconds to put
one broker into maintenance mode.

This PR improves the situation by putting maintenance mode before POD is
deleted. The `EnableMaintanaceMode` function is called multiple times until
`Broker` function returns correct status. The assumption is that REST admin API
maintenance mode endpoint is idempotent.

When pod is successfully deleted statefulset would reschedule the pod with
correct pod specification.

redpanda-data#4125
redpanda-data#3023
RafalKorepta pushed a commit to RafalKorepta/redpanda that referenced this issue Jan 2, 2023
During rolling update, before this change, Redpanda operator was calculating
the difference between running pod specification and stateful set pod template.
If the specification did not match the pod was deleted. From release v22.1.1
operator is configuring each broker with pod lifecycle hooks. In the PreStop
hook the script will try to put broker into maintenance mode for 120 seconds
before POD is terminated. Redpanda could not finish within 120 seconds to put
one broker into maintenance mode.

This PR improves the situation by putting maintenance mode before POD is
deleted. The `EnableMaintanaceMode` function is called multiple times until
`Broker` function returns correct status. The assumption is that REST admin API
maintenance mode endpoint is idempotent.

When pod is successfully deleted statefulset would reschedule the pod with
correct pod specification.

redpanda-data#4125
redpanda-data#3023
RafalKorepta pushed a commit to RafalKorepta/redpanda that referenced this issue Jan 2, 2023
During rolling update, before this change, Redpanda operator was calculating
the difference between running pod specification and stateful set pod template.
If the specification did not match the pod was deleted. From release v22.1.1
operator is configuring each broker with pod lifecycle hooks. In the PreStop
hook the script will try to put broker into maintenance mode for 120 seconds
before POD is terminated. Redpanda could not finish within 120 seconds to put
one broker into maintenance mode.

This PR improves the situation by putting maintenance mode before POD is
deleted. The `EnableMaintanaceMode` function is called multiple times until
`Broker` function returns correct status. The assumption is that REST admin API
maintenance mode endpoint is idempotent.

When pod is successfully deleted statefulset would reschedule the pod with
correct pod specification.

redpanda-data#4125
redpanda-data#3023
RafalKorepta pushed a commit to RafalKorepta/redpanda that referenced this issue Jan 3, 2023
During rolling update, before this change, Redpanda operator was calculating
the difference between running pod specification and stateful set pod template.
If the specification did not match the pod was deleted. From release v22.1.1
operator is configuring each broker with pod lifecycle hooks. In the PreStop
hook the script will try to put broker into maintenance mode for 120 seconds
before POD is terminated. Redpanda could not finish within 120 seconds to put
one broker into maintenance mode.

This PR improves the situation by putting maintenance mode before POD is
deleted. The `EnableMaintanaceMode` function is called multiple times until
`Broker` function returns correct status. The assumption is that REST admin API
maintenance mode endpoint is idempotent.

When pod is successfully deleted statefulset would reschedule the pod with
correct pod specification.

redpanda-data#4125
redpanda-data#3023
RafalKorepta pushed a commit to RafalKorepta/redpanda that referenced this issue Jan 3, 2023
During rolling update, before this change, Redpanda operator was calculating
the difference between running pod specification and stateful set pod template.
If the specification did not match the pod was deleted. From release v22.1.1
operator is configuring each broker with pod lifecycle hooks. In the PreStop
hook the script will try to put broker into maintenance mode for 120 seconds
before POD is terminated. Redpanda could not finish within 120 seconds to put
one broker into maintenance mode.

This PR improves the situation by putting maintenance mode before POD is
deleted. The `EnableMaintanaceMode` function is called multiple times until
`Broker` function returns correct status. The assumption is that REST admin API
maintenance mode endpoint is idempotent.

When pod is successfully deleted statefulset would reschedule the pod with
correct pod specification.

redpanda-data#4125
redpanda-data#3023
RafalKorepta pushed a commit to RafalKorepta/redpanda that referenced this issue Jan 3, 2023
During rolling update, before this change, Redpanda operator was calculating
the difference between running pod specification and stateful set pod template.
If the specification did not match the pod was deleted. From release v22.1.1
operator is configuring each broker with pod lifecycle hooks. In the PreStop
hook the script will try to put broker into maintenance mode for 120 seconds
before POD is terminated. Redpanda could not finish within 120 seconds to put
one broker into maintenance mode.

This PR improves the situation by putting maintenance mode before POD is
deleted. The `EnableMaintanaceMode` function is called multiple times until
`Broker` function returns correct status. The assumption is that REST admin API
maintenance mode endpoint is idempotent.

When pod is successfully deleted statefulset would reschedule the pod with
correct pod specification.

redpanda-data#4125
redpanda-data#3023
RafalKorepta pushed a commit to RafalKorepta/redpanda that referenced this issue Jan 4, 2023
During rolling update, before this change, Redpanda operator was calculating
the difference between running pod specification and stateful set pod template.
If the specification did not match the pod was deleted. From release v22.1.1
operator is configuring each broker with pod lifecycle hooks. In the PreStop
hook the script will try to put broker into maintenance mode for 120 seconds
before POD is terminated. Redpanda could not finish within 120 seconds to put
one broker into maintenance mode.

This PR improves the situation by putting maintenance mode before POD is
deleted. The `EnableMaintanaceMode` function is called multiple times until
`Broker` function returns correct status. The assumption is that REST admin API
maintenance mode endpoint is idempotent.

When pod is successfully deleted statefulset would reschedule the pod with
correct pod specification.

redpanda-data#4125
redpanda-data#3023
RafalKorepta pushed a commit to RafalKorepta/redpanda that referenced this issue Jan 4, 2023
During rolling update, before this change, Redpanda operator was calculating
the difference between running pod specification and stateful set pod template.
If the specification did not match the pod was deleted. From release v22.1.1
operator is configuring each broker with pod lifecycle hooks. In the PreStop
hook the script will try to put broker into maintenance mode for 120 seconds
before POD is terminated. Redpanda could not finish within 120 seconds to put
one broker into maintenance mode.

This PR improves the situation by putting maintenance mode before POD is
deleted. The `EnableMaintanaceMode` function is called multiple times until
`Broker` function returns correct status. The assumption is that REST admin API
maintenance mode endpoint is idempotent.

When pod is successfully deleted statefulset would reschedule the pod with
correct pod specification.

redpanda-data#4125
redpanda-data#3023
RafalKorepta pushed a commit to RafalKorepta/redpanda that referenced this issue Jan 5, 2023
During rolling update, before this change, Redpanda operator was calculating
the difference between running pod specification and stateful set pod template.
If the specification did not match the pod was deleted. From release v22.1.1
operator is configuring each broker with pod lifecycle hooks. In the PreStop
hook the script will try to put broker into maintenance mode for 120 seconds
before POD is terminated. Redpanda could not finish within 120 seconds to put
one broker into maintenance mode.

This PR improves the situation by putting maintenance mode before POD is
deleted. The `EnableMaintanaceMode` function is called multiple times until
`Broker` function returns correct status. The assumption is that REST admin API
maintenance mode endpoint is idempotent.

When pod is successfully deleted statefulset would reschedule the pod with
correct pod specification.

redpanda-data#4125
redpanda-data#3023
RafalKorepta added a commit that referenced this issue Jan 5, 2023
…nce-mode

k8s: Put brokers in maintenance mode before deleting orphan's pod
RafalKorepta added a commit that referenced this issue Jan 6, 2023
@RafalKorepta
Copy link
Contributor

joejulian pushed a commit to joejulian/redpanda that referenced this issue Mar 10, 2023
During rolling update, before this change, Redpanda operator was calculating
the difference between running pod specification and stateful set pod template.
If the specification did not match the pod was deleted. From release v22.1.1
operator is configuring each broker with pod lifecycle hooks. In the PreStop
hook the script will try to put broker into maintenance mode for 120 seconds
before POD is terminated. Redpanda could not finish within 120 seconds to put
one broker into maintenance mode.

This PR improves the situation by putting maintenance mode before POD is
deleted. The `EnableMaintanaceMode` function is called multiple times until
`Broker` function returns correct status. The assumption is that REST admin API
maintenance mode endpoint is idempotent.

When pod is successfully deleted statefulset would reschedule the pod with
correct pod specification.

redpanda-data#4125
redpanda-data#3023
joejulian pushed a commit to joejulian/redpanda that referenced this issue Mar 10, 2023
As per redpanda-data#3023 the cluster should
be healthy before starting put node in maintanance mode and after POD is
restarted.
joejulian pushed a commit to joejulian/redpanda that referenced this issue Mar 10, 2023
During rolling update, before this change, Redpanda operator was calculating
the difference between running pod specification and stateful set pod template.
If the specification did not match the pod was deleted. From release v22.1.1
operator is configuring each broker with pod lifecycle hooks. In the PreStop
hook the script will try to put broker into maintenance mode for 120 seconds
before POD is terminated. Redpanda could not finish within 120 seconds to put
one broker into maintenance mode.

This PR improves the situation by putting maintenance mode before POD is
deleted. The `EnableMaintanaceMode` function is called multiple times until
`Broker` function returns correct status. The assumption is that REST admin API
maintenance mode endpoint is idempotent.

When pod is successfully deleted statefulset would reschedule the pod with
correct pod specification.

redpanda-data#4125
redpanda-data#3023
joejulian pushed a commit to joejulian/redpanda that referenced this issue Mar 10, 2023
joejulian pushed a commit to joejulian/redpanda that referenced this issue Mar 24, 2023
As per redpanda-data#3023 the cluster should
be healthy before starting put node in maintanance mode and after POD is
restarted.
joejulian pushed a commit to joejulian/redpanda that referenced this issue Mar 24, 2023
…t pod

During rolling update, before this change, Redpanda operator was calculating
the difference between running pod specification and stateful set pod template.
If the specification did not match the pod was deleted. From release v22.1.1
operator is configuring each broker with pod lifecycle hooks. In the PreStop
hook the script will try to put broker into maintenance mode for 120 seconds
before POD is terminated. Redpanda could not finish within 120 seconds to put
one broker into maintenance mode.

This PR improves the situation by putting maintenance mode before POD is
deleted. The `EnableMaintanaceMode` function is called multiple times until
`Broker` function returns correct status. The assumption is that REST admin API
maintenance mode endpoint is idempotent.

When pod is successfully deleted statefulset would reschedule the pod with
correct pod specification.

redpanda-data#4125
redpanda-data#3023
joejulian pushed a commit to joejulian/redpanda that referenced this issue Mar 24, 2023
joejulian pushed a commit to joejulian/redpanda that referenced this issue Apr 12, 2023
As per redpanda-data#3023 the cluster should
be healthy before starting put node in maintanance mode and after POD is
restarted.

(cherry picked from commit e0491d1)
joejulian pushed a commit to joejulian/redpanda that referenced this issue Apr 12, 2023
During rolling update, before this change, Redpanda operator was calculating
the difference between running pod specification and stateful set pod template.
If the specification did not match the pod was deleted. From release v22.1.1
operator is configuring each broker with pod lifecycle hooks. In the PreStop
hook the script will try to put broker into maintenance mode for 120 seconds
before POD is terminated. Redpanda could not finish within 120 seconds to put
one broker into maintenance mode.

This PR improves the situation by putting maintenance mode before POD is
deleted. The `EnableMaintanaceMode` function is called multiple times until
`Broker` function returns correct status. The assumption is that REST admin API
maintenance mode endpoint is idempotent.

When pod is successfully deleted statefulset would reschedule the pod with
correct pod specification.

redpanda-data#4125
redpanda-data#3023
(cherry picked from commit 3c34855)
joejulian pushed a commit to joejulian/redpanda that referenced this issue Apr 12, 2023
As per redpanda-data#3023 the cluster should
be healthy before starting put node in maintanance mode and after POD is
restarted.

(cherry picked from commit e0491d1)
joejulian pushed a commit to joejulian/redpanda that referenced this issue Apr 12, 2023
During rolling update, before this change, Redpanda operator was calculating
the difference between running pod specification and stateful set pod template.
If the specification did not match the pod was deleted. From release v22.1.1
operator is configuring each broker with pod lifecycle hooks. In the PreStop
hook the script will try to put broker into maintenance mode for 120 seconds
before POD is terminated. Redpanda could not finish within 120 seconds to put
one broker into maintenance mode.

This PR improves the situation by putting maintenance mode before POD is
deleted. The `EnableMaintanaceMode` function is called multiple times until
`Broker` function returns correct status. The assumption is that REST admin API
maintenance mode endpoint is idempotent.

When pod is successfully deleted statefulset would reschedule the pod with
correct pod specification.

redpanda-data#4125
redpanda-data#3023
(cherry picked from commit 3c34855)
joejulian pushed a commit to joejulian/redpanda that referenced this issue Apr 13, 2023
As per redpanda-data#3023 the cluster should
be healthy before starting put node in maintanance mode and after POD is
restarted.

(cherry picked from commit e0491d1)
joejulian pushed a commit to joejulian/redpanda that referenced this issue Apr 13, 2023
During rolling update, before this change, Redpanda operator was calculating
the difference between running pod specification and stateful set pod template.
If the specification did not match the pod was deleted. From release v22.1.1
operator is configuring each broker with pod lifecycle hooks. In the PreStop
hook the script will try to put broker into maintenance mode for 120 seconds
before POD is terminated. Redpanda could not finish within 120 seconds to put
one broker into maintenance mode.

This PR improves the situation by putting maintenance mode before POD is
deleted. The `EnableMaintanaceMode` function is called multiple times until
`Broker` function returns correct status. The assumption is that REST admin API
maintenance mode endpoint is idempotent.

When pod is successfully deleted statefulset would reschedule the pod with
correct pod specification.

redpanda-data#4125
redpanda-data#3023
(cherry picked from commit 3c34855)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/k8s kind/enhance New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants