Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests: fix RpkTool.group_describe failure #5610

Merged
merged 1 commit into from
Jul 26, 2022
Merged

Conversation

jcsp
Copy link
Contributor

@jcsp jcsp commented Jul 25, 2022

Cover letter

If coordinator isn't available, we should retry, not throw.

Fixes #5079

UX changes

None

Release notes

  • none

If coordinator isn't available, we should retry, not throw.

Fixes redpanda-data#5079
@jcsp jcsp added kind/bug Something isn't working area/tests labels Jul 25, 2022
@jcsp jcsp added the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Jul 25, 2022
@jcsp jcsp marked this pull request as ready for review July 25, 2022 10:34
@vbotbuildovich vbotbuildovich removed the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Jul 25, 2022
@jcsp
Copy link
Contributor Author

jcsp commented Jul 25, 2022

CI failures are all one of:

@jcsp
Copy link
Contributor Author

jcsp commented Jul 25, 2022

The other red-crossed checks are all the things that shouldn't run against a PR anyway, i think that's a side effect of how our CI works combined with me cancelling the original job (to schedule the repeats) at an inopportune moment.

This should be good to go.

return None
else:
raise

Copy link
Contributor

@dlex dlex Jul 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this retry be better implemented at a lower level?

Currently rpk (via franz-go) retries the MetadataRequest call once if its reply is missing the coordinator. Would it be better if it retried more than once, with probably an rpk parameter providing a timeout value for retrying?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would prefer rpk to be a bit more friendly do own some level of retries when it comes to this case: currently rpk generally passes through transient leaderless-ness or results of requests to a node that was no longer the leader -- this comes up in e.g. 'rpk topic describe' output as well as consumer groups, and it's increasingly common now that we have the leader balancer, data balancer etc.

Not necessarily actionable right now, but perhaps something for @r-vasquez to contemplate. Current behaviour is not wrong exactly, but we might want to broaden the range of situations that we classify as retryable.

except RpkException as e:
if "COORDINATOR_NOT_AVAILABLE" in e.msg:
# Transient, return None to retry
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can some logging be useful here to record what's going on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we already have the raw RPK output in debug logs -- in the event of a failure we can see what happened.

@jcsp
Copy link
Contributor Author

jcsp commented Jul 26, 2022

@dlex any outstanding concerns? Would be good to get this merged before next nightly runs.

@jcsp jcsp merged commit d73597f into redpanda-data:dev Jul 26, 2022
@jcsp jcsp deleted the issue-5079 branch July 26, 2022 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DescribeGroups failed in ConsumerGroupTest.test_dead_group_recovery.static_members=True
4 participants