Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests: Deflake test_migrating_consume_offsets #5074

Merged
merged 1 commit into from
Jun 11, 2022

Conversation

bharathv
Copy link
Contributor

@bharathv bharathv commented Jun 8, 2022

Cover letter

Fixes synchronization around failure injection and node restarts.
Looped through the test 50 times locally and didn't
run into any issues

Fixes ##5034

Release notes

  • none

force-push Amends the commit message to add more detail.

@bharathv
Copy link
Contributor Author

bharathv commented Jun 8, 2022

@mmaslankaprv you added this test, you might have some context here, FYI.

@bharathv bharathv requested review from rystsov and ztlpn June 8, 2022 19:39
@bharathv bharathv force-pushed the test_migrating_consume_offsets branch from ab7297b to 68013af Compare June 8, 2022 19:44
# or restarts. We only want to do one of those at a time to avoid conflicts.
busy_nodes = set()
# synchronize access to busy_nodes set.
busy_nodes_lock = threading.RLock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixes synchronization around failure injection and
node restarts.

could you add a little context into the commit message about what was wrong and how it was fixed?

Looped through the test 50 times locally and didn't
run into any issues

this is great context to add to the PR cover letter, but doesn't add much value in the commit message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the commit message and did a force push.

Fixes synchronization around failure injection and
node restarts.

The test is structured as follows.

1. start a cluster without consumer_offsets support
2. Run an async failure injection loop that can potentially do a SIGKILL
on a random redpanda process.
3. Do a consumer groups test on a sample topic.
4. Restart the cluster to enable the new __consumer_offsets topic
feature.
5. Verify that __consumer_offsets functionality works as expected.

It turns out that (2) and (4) are conflicting with each other due to
poor synchronization. Due to some racy code these operations are
interleaved resulting in failure injector attempting to kill a
non-existent PID.

This patch makes sure that only one of these two operations run at any
given time. A given node is marked 'busy' in a thread-safe way if either
of the operations is running and the other thread has to wait.
@bharathv bharathv force-pushed the test_migrating_consume_offsets branch from 68013af to 1fb9dec Compare June 8, 2022 22:36
Copy link
Contributor Author

@bharathv bharathv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test failure seems related to #2501

# or restarts. We only want to do one of those at a time to avoid conflicts.
busy_nodes = set()
# synchronize access to busy_nodes set.
busy_nodes_lock = threading.RLock()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the commit message and did a force push.

@dotnwat
Copy link
Member

dotnwat commented Jun 8, 2022

I updated the commit message and did a force push.

thanks! it's excellent ✏️

@dotnwat dotnwat added the ci-repeat-10 repeat tests 10x concurrently to check for flakey tests; self-cancelling label Jun 8, 2022
@vbotbuildovich vbotbuildovich removed the ci-repeat-10 repeat tests 10x concurrently to check for flakey tests; self-cancelling label Jun 8, 2022
Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code looks good to me. started a 10x repeat on CI for a quick check. should wait to see how that turns out

@mmaslankaprv
Copy link
Member

looks great

@bharathv bharathv added the ci-repeat-10 repeat tests 10x concurrently to check for flakey tests; self-cancelling label Jun 9, 2022
@vbotbuildovich vbotbuildovich removed the ci-repeat-10 repeat tests 10x concurrently to check for flakey tests; self-cancelling label Jun 9, 2022
@bharathv
Copy link
Contributor Author

bharathv commented Jun 9, 2022

1/10 runs failed with ci-repeat-10 and the failure is in an unrelated test. tracked in #5079

<testcase name="test_dead_group_recovery.static_members=False" classname="ConsumerGroupTest" time="46.33625674247742" status="fail" assertions="">

I think this is good to go.

@bharathv bharathv merged commit e52ce8d into redpanda-data:dev Jun 11, 2022
@bharathv
Copy link
Contributor Author

/backport v22.1.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants