Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More robust waiting for the quiescent state in partition balancer tests #6007

Merged
merged 6 commits into from
Aug 15, 2022

Conversation

ztlpn
Copy link
Contributor

@ztlpn ztlpn commented Aug 13, 2022

Cover letter

Because unavailability timer resets every time the controller leader changes, robustly waiting for the timer to elapse in the test context is hard. Instead we simply wait until the unavailable node appears in the "violations" status field.

Backport Required

  • not a bug fix
  • papercut/not impactful enough to backport
  • v22.2.x
  • v22.1.x
  • v21.11.x

UX changes

none

Release notes

  • none

Because unavailability timer resets every time the controller leader
changes, robustly waiting for the timer to elapse is hard. Instead we
simply wait until the unavailable node appears in the "violations"
status field.
@ztlpn
Copy link
Contributor Author

ztlpn commented Aug 13, 2022

/ci-repeat 10

@ivotron ivotron added the ci-repeat-10 repeat tests 10x concurrently to check for flakey tests; self-cancelling label Aug 13, 2022
@vbotbuildovich vbotbuildovich removed the ci-repeat-10 repeat tests 10x concurrently to check for flakey tests; self-cancelling label Aug 13, 2022
Previously, when the controller leader node was suspended during the
test all status requests would fail with the timed-out error.
This was true for all nodes, not just the suspended one (because we
proxy the status request to the controller leader), so internal retries
in the admin API wrapper didn't help. We increase the timeout and add
504 to retriable status codes so that internal retries can handle this
situation.
@ztlpn ztlpn added the ci-repeat-10 repeat tests 10x concurrently to check for flakey tests; self-cancelling label Aug 14, 2022
@vbotbuildovich vbotbuildovich removed the ci-repeat-10 repeat tests 10x concurrently to check for flakey tests; self-cancelling label Aug 14, 2022
@ztlpn
Copy link
Contributor Author

ztlpn commented Aug 14, 2022

/ci-repeat 10

@ztlpn ztlpn added the ci-repeat-10 repeat tests 10x concurrently to check for flakey tests; self-cancelling label Aug 14, 2022
@vbotbuildovich vbotbuildovich removed the ci-repeat-10 repeat tests 10x concurrently to check for flakey tests; self-cancelling label Aug 14, 2022
@mmedenjak mmedenjak added kind/enhance New feature or request area/tests labels Aug 15, 2022
@ztlpn ztlpn marked this pull request as ready for review August 15, 2022 10:26
@ztlpn
Copy link
Contributor Author

ztlpn commented Aug 15, 2022

The results of the x10 run are pretty good, the only failures are #5980, #5324, and a test_maintenance_mode failure due to an rpk issue

@ztlpn ztlpn merged commit 34680c2 into redpanda-data:dev Aug 15, 2022
@ztlpn
Copy link
Contributor Author

ztlpn commented Aug 15, 2022

/backport v22.2.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants