-
Notifications
You must be signed in to change notification settings - Fork 577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tests: fix RpkTool.group_describe failure #5610
Conversation
If coordinator isn't available, we should retry, not throw. Fixes redpanda-data#5079
CI failures are all one of:
|
The other red-crossed checks are all the things that shouldn't run against a PR anyway, i think that's a side effect of how our CI works combined with me cancelling the original job (to schedule the repeats) at an inopportune moment. This should be good to go. |
return None | ||
else: | ||
raise | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this retry be better implemented at a lower level?
Currently rpk
(via franz-go
) retries the MetadataRequest
call once if its reply is missing the coordinator. Would it be better if it retried more than once, with probably an rpk
parameter providing a timeout value for retrying?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I would prefer rpk to be a bit more friendly do own some level of retries when it comes to this case: currently rpk generally passes through transient leaderless-ness or results of requests to a node that was no longer the leader -- this comes up in e.g. 'rpk topic describe' output as well as consumer groups, and it's increasingly common now that we have the leader balancer, data balancer etc.
Not necessarily actionable right now, but perhaps something for @r-vasquez to contemplate. Current behaviour is not wrong exactly, but we might want to broaden the range of situations that we classify as retryable.
except RpkException as e: | ||
if "COORDINATOR_NOT_AVAILABLE" in e.msg: | ||
# Transient, return None to retry | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can some logging be useful here to record what's going on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we already have the raw RPK output in debug logs -- in the event of a failure we can see what happened.
@dlex any outstanding concerns? Would be good to get this merged before next nightly runs. |
Cover letter
If coordinator isn't available, we should retry, not throw.
Fixes #5079
UX changes
None
Release notes