tests: fix RpkTool.group_describe failure #5610

jcsp · 2022-07-25T10:31:30Z

Cover letter

If coordinator isn't available, we should retry, not throw.

Fixes #5079

UX changes

None

Release notes

none

If coordinator isn't available, we should retry, not throw. Fixes redpanda-data#5079

jcsp · 2022-07-25T14:16:20Z

CI failures are all one of:

EndToEndTopicRecovery.test_restore (Not all data is uploaded to S3 bucket in EndToEndTopicRecovery.test_restore #5474)
PartitionBalancerTest.test_movement_cancellations (Failure in PartitionBalancerTest.test_movement_cancellations #5531)
PartitionBalancerTest.test_unavailable_nodes (Failure in PartitionBalancerTest.test_unavailable_nodes #5471)

jcsp · 2022-07-25T14:18:18Z

The other red-crossed checks are all the things that shouldn't run against a PR anyway, i think that's a side effect of how our CI works combined with me cancelling the original job (to schedule the repeats) at an inopportune moment.

This should be good to go.

dlex · 2022-07-25T16:35:41Z

tests/rptest/clients/rpk.py

+                    return None
+                else:
+                    raise
+


Should this retry be better implemented at a lower level?

Currently rpk (via franz-go) retries the MetadataRequest call once if its reply is missing the coordinator. Would it be better if it retried more than once, with probably an rpk parameter providing a timeout value for retrying?

Yes, I would prefer rpk to be a bit more friendly do own some level of retries when it comes to this case: currently rpk generally passes through transient leaderless-ness or results of requests to a node that was no longer the leader -- this comes up in e.g. 'rpk topic describe' output as well as consumer groups, and it's increasingly common now that we have the leader balancer, data balancer etc.

Not necessarily actionable right now, but perhaps something for @r-vasquez to contemplate. Current behaviour is not wrong exactly, but we might want to broaden the range of situations that we classify as retryable.

dlex · 2022-07-25T16:39:04Z

tests/rptest/clients/rpk.py

+            except RpkException as e:
+                if "COORDINATOR_NOT_AVAILABLE" in e.msg:
+                    # Transient, return None to retry
+                    return None


Can some logging be useful here to record what's going on?

I think we already have the raw RPK output in debug logs -- in the event of a failure we can see what happened.

jcsp · 2022-07-26T13:31:04Z

@dlex any outstanding concerns? Would be good to get this merged before next nightly runs.

tests: fix RpkTool.group_describe failure

c4747e2

If coordinator isn't available, we should retry, not throw. Fixes redpanda-data#5079

jcsp added kind/bug Something isn't working area/tests labels Jul 25, 2022

jcsp mentioned this pull request Jul 25, 2022

DescribeGroups failed in ConsumerGroupTest.test_dead_group_recovery.static_members=True #5079

Closed

jcsp added the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Jul 25, 2022

jcsp marked this pull request as ready for review July 25, 2022 10:34

jcsp requested review from dotnwat and NyaliaLui as code owners July 25, 2022 10:34

vbotbuildovich removed the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Jul 25, 2022

dlex reviewed Jul 25, 2022

View reviewed changes

mmedenjak added the ci-failure label Jul 26, 2022

dlex approved these changes Jul 26, 2022

View reviewed changes

jcsp merged commit d73597f into redpanda-data:dev Jul 26, 2022

jcsp deleted the issue-5079 branch July 26, 2022 18:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests: fix RpkTool.group_describe failure #5610

tests: fix RpkTool.group_describe failure #5610

jcsp commented Jul 25, 2022

jcsp commented Jul 25, 2022

jcsp commented Jul 25, 2022

dlex Jul 25, 2022 •

edited

Loading

jcsp Jul 25, 2022

dlex Jul 25, 2022

jcsp Jul 25, 2022

jcsp commented Jul 26, 2022

tests: fix RpkTool.group_describe failure #5610

tests: fix RpkTool.group_describe failure #5610

Conversation

jcsp commented Jul 25, 2022

Cover letter

UX changes

Release notes

jcsp commented Jul 25, 2022

jcsp commented Jul 25, 2022

dlex Jul 25, 2022 • edited Loading

Choose a reason for hiding this comment

jcsp Jul 25, 2022

Choose a reason for hiding this comment

dlex Jul 25, 2022

Choose a reason for hiding this comment

jcsp Jul 25, 2022

Choose a reason for hiding this comment

jcsp commented Jul 26, 2022

dlex Jul 25, 2022 •

edited

Loading