Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: account for metadata dissemination delay #6322

Merged
merged 1 commit into from
Sep 7, 2022

Conversation

dotnwat
Copy link
Member

@dotnwat dotnwat commented Sep 6, 2022

Cover letter

after shutting down a node we asserted that that node was not a leader
for any partition. however, it may take some time for elections and
metadata dissemination to propogate changes. we move the check below a
sleep to give the system time for all the changes to be propogated. if
after 1 minute the changes haven't been propogated then it would be a
worthy thing to investigate.

We can see some indirect evidence of this cause for the assertion
failure where back-to-back metadata queries hit two different
originating brokers where the one query was used to indicate that we'd
reached a desired state in the test, while the other resulted in the
assertion failure from what is apparent yet-to-be-updated metadata.

[DEBUG - 2022-08-19 19:04:18,966 - kafka_cat - _cmd_raw - lineno:54]:
{"originating_broker":{"id":3,"name":"docker-rp-2:9092/3"},"query":{"topic":"*"},"controllerid":1,"brokers":[{"id":3,"name":"docker-rp-2:9092"},{"id":2,"name":"docker-rp-15:9092"},{"id":1,"name":"docker-rp-3:9092"}],"topics":[{"topic":"topic-yhvslmrfme","partitions":[{"partition":0,"leader":2,"replicas":[{"id"

[DEBUG - 2022-08-19 19:04:18,975 - kafka_cat - _cmd_raw - lineno:54]:
{"originating_broker":{"id":2,"name":"docker-rp-15:9092/2"},"query":{"topic":"*"},"controllerid":1,"brokers":[{"id":3,"name":"docker-rp-2:9092"},{"id":2,"name":"docker-rp-15:9092"},{"id":1,"name":"docker-rp-3:9092"}],"topics":[{"topic":"topic-yhvslmrfme","partitions":[{"partition":0,"leader":2,"replicas":[{"id"

Fixes: #6120

Backport Required

  • not a bug fix
  • issue does not exist in previous branches
  • papercut/not impactful enough to backport
  • v22.2.x
  • v22.1.x
  • v21.11.x

Release notes

  • None

after shutting down a node we asserted that that node was not a leader
for any partition. however, it may take some time for elections and
metadata dissemination to propogate changes. we move the check below a
sleep to give the system time for all the changes to be propogated. if
after 1 minute the changes haven't been propogated then it would be a
worthy thing to investigate.

We can see some indirect evidence of this cause for the assertion
failure where back-to-back metadata queries hit two different
originating brokers where the one query was used to indicate that we'd
reached a desired state in the test, while the other resulted in the
assertion failure from what is apparent yet-to-be-updated metadata.

[DEBUG - 2022-08-19 19:04:18,966 - kafka_cat - _cmd_raw - lineno:54]:
{"originating_broker":{"id":3,"name":"docker-rp-2:9092/3"},"query":{"topic":"*"},"controllerid":1,"brokers":[{"id":3,"name":"docker-rp-2:9092"},{"id":2,"name":"docker-rp-15:9092"},{"id":1,"name":"docker-rp-3:9092"}],"topics":[{"topic":"topic-yhvslmrfme","partitions":[{"partition":0,"leader":2,"replicas":[{"id"

[DEBUG - 2022-08-19 19:04:18,975 - kafka_cat - _cmd_raw - lineno:54]:
{"originating_broker":{"id":2,"name":"docker-rp-15:9092/2"},"query":{"topic":"*"},"controllerid":1,"brokers":[{"id":3,"name":"docker-rp-2:9092"},{"id":2,"name":"docker-rp-15:9092"},{"id":1,"name":"docker-rp-3:9092"}],"topics":[{"topic":"topic-yhvslmrfme","partitions":[{"partition":0,"leader":2,"replicas":[{"id"

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
@mmedenjak mmedenjak added kind/bug Something isn't working area/redpanda ci-failure pr-blocker CI failures blocking a PR from being merged labels Sep 7, 2022
Copy link
Contributor

@VadimPlh VadimPlh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. But I thought we should put all kind of check like this inside wait_until

@rystsov rystsov mentioned this pull request Sep 7, 2022
5 tasks
@rystsov
Copy link
Contributor

rystsov commented Sep 7, 2022

I've rebased it to #6003 to test the fix, will get data in several hours

@dotnwat
Copy link
Member Author

dotnwat commented Sep 7, 2022

Looks good. But I thought we should put all kind of check like this inside wait_until

tbh, i think we could have just removed the two lines i moved below the sleep, they seem to only be a defensive assertion. i'd have added it to a wait_until if the sleep weren't so long, but if it still failed after 1 minute, we'd need to revisit the issue again.

Copy link
Contributor

@rystsov rystsov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to go, the tests pass

@rystsov rystsov merged commit d59314e into redpanda-data:dev Sep 7, 2022
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/redpanda ci-failure kind/bug Something isn't working pr-blocker CI failures blocking a PR from being merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI Failure in AutomaticLeadershipBalancingTest.test_automatic_rebalance
4 participants