Get cluster health before an update #7528

RafalKorepta · 2022-11-26T23:15:12Z

As per #3023 the cluster should
be healthy before starting put node in maintanance mode and after POD is restarted.

Backports Required

UX Changes

Internal update logic is improved to not consider update if cluster is not
reporting healthy status via Admin API.

There is one regression that will be addressed later if someone would like
to turn off mTLS from internal Admin API new logic does not now how to
handle correct client certification.

Release Notes

Improvements

Updates inside kuberntes environment the Redpanda update will be more
safe as operator now doesn't consider updates if cluster does not report
healthy status.

REF

#3023

As per redpanda-data#3023 the cluster should be healthy before starting put node in maintanance mode and after POD is restarted.

In the statefulset unit test the admin API needs to be mocked as cluster health should be available.

When cluster is unhealthy the upgrade/restarting procedure should not be executed.

Before 22.X the cluster health overview is not available. All tests could not upgrade from 21.X as operator could validate the health status.

In the centralized configuration e2e test the cluster health can not be retrieved if required client authorization is removed from Admin API. Nodes that are running with mTLS configuration does not respond to operator get health overview. If first out of N brokers is restarted and stops serving Admin API with mTLS configuration, then rpk adminAPI implementation sends http request to all in sequence get health overview. The problem is with http client and TLS configuration as one out of N doesn not need client certificate.

alenkacz

LGTM

alenkacz · 2022-12-02T14:00:06Z

src/go/k8s/pkg/resources/statefulset_update.go

@@ -420,6 +457,10 @@ func (e *RequeueAfterError) Error() string {
 	return fmt.Sprintf("RequeueAfterError %s", e.Msg)
 }

+func (e *RequeueAfterError) Is(target error) bool {
+	return e.Error() == target.Error()


do you really want to use == and not rather errors.Is?

Hmm. I implemented the Is function as I was not able to unit test that error. I can check on the side if that would work, but it should work in the first place in direct call.

alenkacz · 2022-12-02T14:00:22Z

src/go/k8s/tests/e2e/centralized-configuration-tls/00-redpanda-cluster.yaml

@@ -22,7 +22,6 @@ spec:
    - port: 9644
      tls:
        enabled: true


just curious: why this change?

There is one regression that will be addressed later if someone would like
to turn off mTLS from internal Admin API new logic does not now how to
handle correct client certification.

:(

alenkacz · 2022-12-02T14:01:25Z

src/go/k8s/tests/e2e/centralized-configuration-upgrade/01-first-upgrade.yaml

@@ -4,7 +4,7 @@ metadata:
  name: centralized-configuration-upgrade
 spec:
  image: "vectorized/redpanda"
-  version: "v21.11.16"
+  version: "v22.1.10"


I assume this test now runs with the new feature gate enabled, right?

Yes as we will sunset 21.11.X soon

RafalKorepta · 2022-12-03T00:31:35Z

/ci-repeat

pvsune · 2022-12-05T04:16:31Z

src/go/k8s/pkg/resources/statefulset_update.go

 	if err = r.updateStatefulSet(ctx, current, modified); err != nil {
 		return err
 	}

+	if err = r.isClusterHealthy(ctx); err != nil {


i'm not too familiar with this part of the controller and i am tired from flight 😅 please forgive if this is stupid q, but i thought we implement our own rolling update process. should this check be after that?

It is inside our rolling update process. The

redpanda/src/go/k8s/pkg/resources/statefulset_update.go

Lines 43 to 59 in 2d775e6

// runUpdate handles image changes and additional storage in the redpanda cluster

// CR by removing statefulset with orphans Pods. The stateful set is then recreated

// and all Pods are restarted accordingly to the ordinal number.

//

// The process maintains an Restarting bool status that is set to true once the

// generated stateful differentiate from the actual state. It is set back to

// false when all pods are verified.

//

// The steps are as follows: 1) check the Restarting status or if the statefulset

// differentiate from the current stored statefulset definition 2) if true,

// set the Restarting status to true and remove statefulset with the orphan Pods

// 3) perform rolling update like removing Pods accordingly to theirs ordinal

// number 4) requeue until the pod is in ready state 5) prior to a pod update

// verify the previously updated pod and requeue as necessary. Currently, the

// verification checks the pod has started listening in its http Admin API port and may be

// extended.

func (r *StatefulSetResource) runUpdate(

pvsune · 2022-12-05T04:19:14Z

src/go/k8s/pkg/resources/statefulset_update.go

@@ -94,6 +100,37 @@ func (r *StatefulSetResource) runUpdate(
 	return nil
 }

+func (r *StatefulSetResource) isClusterHealthy(ctx context.Context) error {


afaiu in #3023, at this point the cluster should be in maintenance mode. should we enforce it and add a check here?

The #3023 in point 1 and 6 is checking the health of the Redpanda cluster.

In our current rolling update we can not easily implement logic described in #3023. That's why before any operation with Pods (single broker) I'm trying to check health of the cluster as it should be a blocker.

pvsune · 2022-12-05T04:21:45Z

src/go/k8s/pkg/resources/statefulset_update.go

+		return nil
+	}
+
+	adminAPIClient, err := r.getAdminAPIClient(ctx)


nit: not needed for this PR but admin api client is used everywhere and each resource have it's own implementation (console, cluster, sts). i think we should put this to util or some other way so we DRY

Yes, but there is problem with commonality. I'm happy to be wrong, but I don't see it yet.

pvsune

did initial review, mostly questions

RafalKorepta requested a review from a team as a code owner November 26, 2022 23:15

github-actions bot added the area/k8s label Nov 26, 2022

RafalKorepta force-pushed the rk/gh-3023/wait-for-healthy-cluster branch 4 times, most recently from 86f4565 to 7beaf40 Compare November 29, 2022 23:16

RafalKorepta mentioned this pull request Nov 29, 2022

k8s: Put brokers in maintenance mode before deleting orphan's pod #7530

Merged

6 tasks

Rafal Korepta added 5 commits December 1, 2022 02:02

Get cluster health before an update

e0491d1

As per redpanda-data#3023 the cluster should be healthy before starting put node in maintanance mode and after POD is restarted.

k8s: Move admin API mock to shared package

00acf24

In the statefulset unit test the admin API needs to be mocked as cluster health should be available.

k8s: Create negative test for upgrade procedure

bb10e0a

When cluster is unhealthy the upgrade/restarting procedure should not be executed.

k8s: Adjust upgrade end to end tests

c93d967

Before 22.X the cluster health overview is not available. All tests could not upgrade from 21.X as operator could validate the health status.

RafalKorepta force-pushed the rk/gh-3023/wait-for-healthy-cluster branch from 7beaf40 to 2d775e6 Compare December 1, 2022 01:15

alenkacz approved these changes Dec 2, 2022

View reviewed changes

pvsune reviewed Dec 5, 2022

View reviewed changes

RafalKorepta merged commit 26dcdf6 into redpanda-data:dev Dec 5, 2022

RafalKorepta mentioned this pull request Jan 8, 2023

k8s: implement safe rolling upgrade logic in operator #3023

Closed

RafalKorepta mentioned this pull request Feb 28, 2023

operator: remove support and tests for v21.11.x #8630

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get cluster health before an update #7528

Get cluster health before an update #7528

RafalKorepta commented Nov 26, 2022 •

edited

Loading

alenkacz left a comment

alenkacz Dec 2, 2022

RafalKorepta Dec 2, 2022

alenkacz Dec 2, 2022

RafalKorepta Dec 2, 2022

alenkacz Dec 2, 2022

RafalKorepta Dec 2, 2022

RafalKorepta commented Dec 3, 2022

pvsune Dec 5, 2022

RafalKorepta Dec 5, 2022

pvsune Dec 5, 2022

RafalKorepta Dec 5, 2022 •

edited

Loading

pvsune Dec 5, 2022

RafalKorepta Dec 5, 2022

pvsune left a comment

	// runUpdate handles image changes and additional storage in the redpanda cluster
	// CR by removing statefulset with orphans Pods. The stateful set is then recreated
	// and all Pods are restarted accordingly to the ordinal number.
	//
	// The process maintains an Restarting bool status that is set to true once the
	// generated stateful differentiate from the actual state. It is set back to
	// false when all pods are verified.
	//
	// The steps are as follows: 1) check the Restarting status or if the statefulset
	// differentiate from the current stored statefulset definition 2) if true,
	// set the Restarting status to true and remove statefulset with the orphan Pods
	// 3) perform rolling update like removing Pods accordingly to theirs ordinal
	// number 4) requeue until the pod is in ready state 5) prior to a pod update
	// verify the previously updated pod and requeue as necessary. Currently, the
	// verification checks the pod has started listening in its http Admin API port and may be
	// extended.
	func (r *StatefulSetResource) runUpdate(

Get cluster health before an update #7528

Get cluster health before an update #7528

Conversation

RafalKorepta commented Nov 26, 2022 • edited Loading

Backports Required

UX Changes

Release Notes

Improvements

REF

alenkacz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RafalKorepta commented Dec 3, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RafalKorepta Dec 5, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pvsune left a comment

Choose a reason for hiding this comment

RafalKorepta commented Nov 26, 2022 •

edited

Loading

RafalKorepta Dec 5, 2022 •

edited

Loading