-
Notifications
You must be signed in to change notification settings - Fork 577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get cluster health before an update #7528
Get cluster health before an update #7528
Conversation
86f4565
to
7beaf40
Compare
As per redpanda-data#3023 the cluster should be healthy before starting put node in maintanance mode and after POD is restarted.
In the statefulset unit test the admin API needs to be mocked as cluster health should be available.
When cluster is unhealthy the upgrade/restarting procedure should not be executed.
Before 22.X the cluster health overview is not available. All tests could not upgrade from 21.X as operator could validate the health status.
In the centralized configuration e2e test the cluster health can not be retrieved if required client authorization is removed from Admin API. Nodes that are running with mTLS configuration does not respond to operator get health overview. If first out of N brokers is restarted and stops serving Admin API with mTLS configuration, then rpk adminAPI implementation sends http request to all in sequence get health overview. The problem is with http client and TLS configuration as one out of N doesn not need client certificate.
7beaf40
to
2d775e6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -420,6 +457,10 @@ func (e *RequeueAfterError) Error() string { | |||
return fmt.Sprintf("RequeueAfterError %s", e.Msg) | |||
} | |||
|
|||
func (e *RequeueAfterError) Is(target error) bool { | |||
return e.Error() == target.Error() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you really want to use ==
and not rather errors.Is
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I implemented the Is
function as I was not able to unit test that error. I can check on the side if that would work, but it should work in the first place in direct call.
@@ -22,7 +22,6 @@ spec: | |||
- port: 9644 | |||
tls: | |||
enabled: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just curious: why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is one regression that will be addressed later if someone would like
to turn off mTLS from internal Admin API new logic does not now how to
handle correct client certification.
:(
@@ -4,7 +4,7 @@ metadata: | |||
name: centralized-configuration-upgrade | |||
spec: | |||
image: "vectorized/redpanda" | |||
version: "v21.11.16" | |||
version: "v22.1.10" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this test now runs with the new feature gate enabled, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes as we will sunset 21.11.X soon
/ci-repeat |
if err = r.updateStatefulSet(ctx, current, modified); err != nil { | ||
return err | ||
} | ||
|
||
if err = r.isClusterHealthy(ctx); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not too familiar with this part of the controller and i am tired from flight 😅 please forgive if this is stupid q, but i thought we implement our own rolling update process. should this check be after that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is inside our rolling update process. The
redpanda/src/go/k8s/pkg/resources/statefulset_update.go
Lines 43 to 59 in 2d775e6
// runUpdate handles image changes and additional storage in the redpanda cluster | |
// CR by removing statefulset with orphans Pods. The stateful set is then recreated | |
// and all Pods are restarted accordingly to the ordinal number. | |
// | |
// The process maintains an Restarting bool status that is set to true once the | |
// generated stateful differentiate from the actual state. It is set back to | |
// false when all pods are verified. | |
// | |
// The steps are as follows: 1) check the Restarting status or if the statefulset | |
// differentiate from the current stored statefulset definition 2) if true, | |
// set the Restarting status to true and remove statefulset with the orphan Pods | |
// 3) perform rolling update like removing Pods accordingly to theirs ordinal | |
// number 4) requeue until the pod is in ready state 5) prior to a pod update | |
// verify the previously updated pod and requeue as necessary. Currently, the | |
// verification checks the pod has started listening in its http Admin API port and may be | |
// extended. | |
func (r *StatefulSetResource) runUpdate( |
@@ -94,6 +100,37 @@ func (r *StatefulSetResource) runUpdate( | |||
return nil | |||
} | |||
|
|||
func (r *StatefulSetResource) isClusterHealthy(ctx context.Context) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
afaiu in #3023, at this point the cluster should be in maintenance mode. should we enforce it and add a check here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return nil | ||
} | ||
|
||
adminAPIClient, err := r.getAdminAPIClient(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: not needed for this PR but admin api client is used everywhere and each resource have it's own implementation (console, cluster, sts). i think we should put this to util or some other way so we DRY
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but there is problem with commonality. I'm happy to be wrong, but I don't see it yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did initial review, mostly questions
As per #3023 the cluster should
be healthy before starting put node in maintanance mode and after POD is restarted.
Backports Required
UX Changes
Internal update logic is improved to not consider update if cluster is not
reporting healthy status via Admin API.
There is one regression that will be addressed later if someone would like
to turn off mTLS from internal Admin API new logic does not now how to
handle correct client certification.
Release Notes
Improvements
safe as operator now doesn't consider updates if cluster does not report
healthy status.
REF
#3023