Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get cluster health before an update #7528

Conversation

RafalKorepta
Copy link
Contributor

@RafalKorepta RafalKorepta commented Nov 26, 2022

As per #3023 the cluster should
be healthy before starting put node in maintanance mode and after POD is restarted.

Backports Required

  • none - not a bug fix
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v22.3.x
  • v22.2.x
  • v22.1.x

UX Changes

Internal update logic is improved to not consider update if cluster is not
reporting healthy status via Admin API.

There is one regression that will be addressed later if someone would like
to turn off mTLS from internal Admin API new logic does not now how to
handle correct client certification.

Release Notes

Improvements

  • Updates inside kuberntes environment the Redpanda update will be more
    safe as operator now doesn't consider updates if cluster does not report
    healthy status.

REF

#3023

@RafalKorepta RafalKorepta requested a review from a team as a code owner November 26, 2022 23:15
@RafalKorepta RafalKorepta force-pushed the rk/gh-3023/wait-for-healthy-cluster branch 4 times, most recently from 86f4565 to 7beaf40 Compare November 29, 2022 23:16
Rafal Korepta added 5 commits December 1, 2022 02:02
As per redpanda-data#3023 the cluster should
be healthy before starting put node in maintanance mode and after POD is
restarted.
In the statefulset unit test the admin API needs to be mocked as cluster
health should be available.
When cluster is unhealthy the upgrade/restarting procedure should not be
executed.
Before 22.X the cluster health overview is not available. All tests could not
upgrade from 21.X as operator could validate the health status.
In the centralized configuration e2e test the cluster health can not be
retrieved if required client authorization is removed from Admin API. Nodes
that are running with mTLS configuration does not respond to operator get
health overview. If first out of N brokers is restarted and stops serving
Admin API with mTLS configuration, then rpk adminAPI implementation sends
http request to all in sequence get health overview. The problem is with
http client and TLS configuration as one out of N doesn not need client
certificate.
@RafalKorepta RafalKorepta force-pushed the rk/gh-3023/wait-for-healthy-cluster branch from 7beaf40 to 2d775e6 Compare December 1, 2022 01:15
Copy link
Contributor

@alenkacz alenkacz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -420,6 +457,10 @@ func (e *RequeueAfterError) Error() string {
return fmt.Sprintf("RequeueAfterError %s", e.Msg)
}

func (e *RequeueAfterError) Is(target error) bool {
return e.Error() == target.Error()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you really want to use == and not rather errors.Is?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I implemented the Is function as I was not able to unit test that error. I can check on the side if that would work, but it should work in the first place in direct call.

@@ -22,7 +22,6 @@ spec:
- port: 9644
tls:
enabled: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious: why this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one regression that will be addressed later if someone would like
to turn off mTLS from internal Admin API new logic does not now how to
handle correct client certification.

:(

@@ -4,7 +4,7 @@ metadata:
name: centralized-configuration-upgrade
spec:
image: "vectorized/redpanda"
version: "v21.11.16"
version: "v22.1.10"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this test now runs with the new feature gate enabled, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes as we will sunset 21.11.X soon

@RafalKorepta
Copy link
Contributor Author

/ci-repeat

if err = r.updateStatefulSet(ctx, current, modified); err != nil {
return err
}

if err = r.isClusterHealthy(ctx); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not too familiar with this part of the controller and i am tired from flight 😅 please forgive if this is stupid q, but i thought we implement our own rolling update process. should this check be after that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is inside our rolling update process. The

// runUpdate handles image changes and additional storage in the redpanda cluster
// CR by removing statefulset with orphans Pods. The stateful set is then recreated
// and all Pods are restarted accordingly to the ordinal number.
//
// The process maintains an Restarting bool status that is set to true once the
// generated stateful differentiate from the actual state. It is set back to
// false when all pods are verified.
//
// The steps are as follows: 1) check the Restarting status or if the statefulset
// differentiate from the current stored statefulset definition 2) if true,
// set the Restarting status to true and remove statefulset with the orphan Pods
// 3) perform rolling update like removing Pods accordingly to theirs ordinal
// number 4) requeue until the pod is in ready state 5) prior to a pod update
// verify the previously updated pod and requeue as necessary. Currently, the
// verification checks the pod has started listening in its http Admin API port and may be
// extended.
func (r *StatefulSetResource) runUpdate(

@@ -94,6 +100,37 @@ func (r *StatefulSetResource) runUpdate(
return nil
}

func (r *StatefulSetResource) isClusterHealthy(ctx context.Context) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaiu in #3023, at this point the cluster should be in maintenance mode. should we enforce it and add a check here?

Copy link
Contributor Author

@RafalKorepta RafalKorepta Dec 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The #3023 in point 1 and 6 is checking the health of the Redpanda cluster.

In our current rolling update we can not easily implement logic described in #3023. That's why before any operation with Pods (single broker) I'm trying to check health of the cluster as it should be a blocker.

return nil
}

adminAPIClient, err := r.getAdminAPIClient(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not needed for this PR but admin api client is used everywhere and each resource have it's own implementation (console, cluster, sts). i think we should put this to util or some other way so we DRY

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but there is problem with commonality. I'm happy to be wrong, but I don't see it yet.

Copy link
Contributor

@pvsune pvsune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did initial review, mostly questions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants