-
Notifications
You must be signed in to change notification settings - Fork 577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: use snapshots for detecting segment removal #5812
Conversation
Store the name of the respective node in the NodeStorage test util class. This enables a future commit to produce a cluster-level segment snapshot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good after cross referencing the suggestions from the linked issue.
I added ci-repeat-5 label since this is intended to fix a CI failure.
I'm not sure what that was supposed to do (probably run the CI 5 times), but the bot removed the label. One thing to note is that this fixes a specific failure mode of the test, so we'll have to go through the failures (if any) manually. |
yes, it runs CI 5 times |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! One small change and a question here
for node, node_segments in original_snapshot.items(): | ||
assert len( | ||
node_segments | ||
) == 10, f"Expected 10 segments, but got {len(node_segments)} on {node}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can't assert that there are exactly 10 segments, we should have >= 10 segments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. One of the failures is due to asserting equality here. Why though? Is this due to jitter in the segment size or is there some other reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
produce_until_segments
checks for p >= count
+ we don't stop producer after that
partition_idx=0, | ||
count=6) | ||
|
||
wait_for_removal_of_n_segments(redpanda=self.redpanda, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the key difference here is that we are waiting for a removal of segments that were already here, right? So we are protecting against the following scenario:
- We produced 10 segments
- Before checking for removal, we produce another 5 segments
- We delete first 6 segments
- it looks like we deleted only one segment, so we are not successful while waiting for 6 segments to be removed.
Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Precisely. We take a snapshot of the segments at step 1. and wait until the specified number of segments present in that snapshot are deleted.
I looked at a couple of CI failures and they were not related to this PR, but please look at them to see if we have some related to this PR |
Just went through them. There's two failures of the updated tests:
Let's have the CI do a few more runs to see if the failure mode from #5390 occurs. |
/ci-repeat 5 |
I can't get ci-repeat to work. I've triggered another run manually. |
Changes in force-push: Change segment count assertion to greater than. |
I triggered 5 parallel CI runs. If the failure mode that this PR is trying to fix doesn't occur, I'd say it's fine to merge. |
There was only one failure in the 5 runs: #6054. |
|
This commit introduces a new test utility for waiting for the removal of a partition's segments, wait_for_removal_of_n_segments. It periodically request a snapshot of the segments associated with a given partition and compares it with the provided original snapshot. As opposed to wait_for_removal_of_segments, its result is not impacted by newly created segments. This means that it can be safely used in contexts that produce simulated failures (each failure causes the current segment to roll).
Previously, the shadow indexing end-to-end asserted against the current number of segments when checking for segment removal. This approach has the downside that a restart/failure of a redpanda node causes a segment roll, which makes the assertion unreliable in a context with simulated failures. This commit changes the assertion to use the wait_for_removal_of_n_segments helper method which uses segment snapshots to determine how many segments were removed. The change deflakes the test against a high number of injected node failures.
Changes in force push: increased the timeout as mentioned in this comment. |
/ci-repeat 10 |
CI is happy now: https://buildkite.com/redpanda/redpanda/builds/14882. |
/backport v22.1.x |
Branch name "v22.2.x" not found. |
/backport v22.2.x |
Branch name "v22.2.x" not found. |
/backport v22.2.x |
Branch name "v22.2.x" not found. |
Cover letter
Previously, the shadow indexing end-to-end test asserted against the current
number of segments when checking for segment removal. This approach has
the downside that a restart/failure of a redpanda node causes a segment
roll, which makes the assertion unreliable in a context with simulated
failures. See #5390 for more context.
This PR introduces a new utility for waiting for segments removal
which uses snapshots to determine what was removed. The changes
deflakes the test against a high number of injected node failures.
Fixes #5390
Backport Required
UX changes
Release notes