-
Notifications
You must be signed in to change notification settings - Fork 577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tests: Deflake test_migrating_consume_offsets #5074
tests: Deflake test_migrating_consume_offsets #5074
Conversation
@mmaslankaprv you added this test, you might have some context here, FYI. |
ab7297b
to
68013af
Compare
# or restarts. We only want to do one of those at a time to avoid conflicts. | ||
busy_nodes = set() | ||
# synchronize access to busy_nodes set. | ||
busy_nodes_lock = threading.RLock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixes synchronization around failure injection and
node restarts.
could you add a little context into the commit message about what was wrong and how it was fixed?
Looped through the test 50 times locally and didn't
run into any issues
this is great context to add to the PR cover letter, but doesn't add much value in the commit message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the commit message and did a force push.
Fixes synchronization around failure injection and node restarts. The test is structured as follows. 1. start a cluster without consumer_offsets support 2. Run an async failure injection loop that can potentially do a SIGKILL on a random redpanda process. 3. Do a consumer groups test on a sample topic. 4. Restart the cluster to enable the new __consumer_offsets topic feature. 5. Verify that __consumer_offsets functionality works as expected. It turns out that (2) and (4) are conflicting with each other due to poor synchronization. Due to some racy code these operations are interleaved resulting in failure injector attempting to kill a non-existent PID. This patch makes sure that only one of these two operations run at any given time. A given node is marked 'busy' in a thread-safe way if either of the operations is running and the other thread has to wait.
68013af
to
1fb9dec
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test failure seems related to #2501
# or restarts. We only want to do one of those at a time to avoid conflicts. | ||
busy_nodes = set() | ||
# synchronize access to busy_nodes set. | ||
busy_nodes_lock = threading.RLock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the commit message and did a force push.
thanks! it's excellent ✏️ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code looks good to me. started a 10x repeat on CI for a quick check. should wait to see how that turns out
looks great |
1/10 runs failed with
I think this is good to go. |
/backport v22.1.x |
Cover letter
Fixes synchronization around failure injection and node restarts.
Looped through the test 50 times locally and didn't
run into any issues
Fixes ##5034
Release notes
force-push Amends the commit message to add more detail.