Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: use snapshots for detecting segment removal #5812

Merged
merged 4 commits into from
Aug 31, 2022

Conversation

VladLazar
Copy link
Contributor

@VladLazar VladLazar commented Aug 3, 2022

Cover letter

Previously, the shadow indexing end-to-end test asserted against the current
number of segments when checking for segment removal. This approach has
the downside that a restart/failure of a redpanda node causes a segment
roll, which makes the assertion unreliable in a context with simulated
failures. See #5390 for more context.

This PR introduces a new utility for waiting for segments removal
which uses snapshots to determine what was removed. The changes
deflakes the test against a high number of injected node failures.

Fixes #5390

Backport Required

  • not a bug fix
  • papercut/not impactful enough to backport
  • v22.2.x
  • v22.1.x
  • v21.11.x

UX changes

  • none

Release notes

  • none

Store the name of the respective node in the NodeStorage test util
class. This enables a future commit to produce a cluster-level segment
snapshot.
@VladLazar VladLazar requested a review from LenaAn August 3, 2022 17:12
@NyaliaLui NyaliaLui added the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Aug 10, 2022
@vbotbuildovich vbotbuildovich removed the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Aug 10, 2022
Copy link
Contributor

@NyaliaLui NyaliaLui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good after cross referencing the suggestions from the linked issue.
I added ci-repeat-5 label since this is intended to fix a CI failure.

@VladLazar
Copy link
Contributor Author

Looks good after cross referencing the suggestions from the linked issue. I added ci-repeat-5 label since this is intended to fix a CI failure.

I'm not sure what that was supposed to do (probably run the CI 5 times), but the bot removed the label. One thing to note is that this fixes a specific failure mode of the test, so we'll have to go through the failures (if any) manually.

@LenaAn
Copy link
Contributor

LenaAn commented Aug 10, 2022

yes, it runs CI 5 times
bot removed this tag when it started 5 jobs (19 out of 20 failed 😃 )

Copy link
Contributor

@LenaAn LenaAn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! One small change and a question here

for node, node_segments in original_snapshot.items():
assert len(
node_segments
) == 10, f"Expected 10 segments, but got {len(node_segments)} on {node}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can't assert that there are exactly 10 segments, we should have >= 10 segments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. One of the failures is due to asserting equality here. Why though? Is this due to jitter in the segment size or is there some other reason?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

produce_until_segments checks for p >= count + we don't stop producer after that

partition_idx=0,
count=6)

wait_for_removal_of_n_segments(redpanda=self.redpanda,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the key difference here is that we are waiting for a removal of segments that were already here, right? So we are protecting against the following scenario:

  1. We produced 10 segments
  2. Before checking for removal, we produce another 5 segments
  3. We delete first 6 segments
  4. it looks like we deleted only one segment, so we are not successful while waiting for 6 segments to be removed.

Right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Precisely. We take a snapshot of the segments at step 1. and wait until the specified number of segments present in that snapshot are deleted.

@LenaAn
Copy link
Contributor

LenaAn commented Aug 10, 2022

I looked at a couple of CI failures and they were not related to this PR, but please look at them to see if we have some related to this PR

@VladLazar
Copy link
Contributor Author

I looked at a couple of CI failures and they were not related to this PR, but please look at them to see if we have some related to this PR

Just went through them. There's two failures of the updated tests:

Let's have the CI do a few more runs to see if the failure mode from #5390 occurs.

@VladLazar VladLazar added ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling and removed ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling labels Aug 10, 2022
@VladLazar
Copy link
Contributor Author

/ci-repeat 5

@VladLazar
Copy link
Contributor Author

I can't get ci-repeat to work. I've triggered another run manually.

@NyaliaLui NyaliaLui added the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Aug 11, 2022
@vbotbuildovich vbotbuildovich removed the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Aug 11, 2022
@VladLazar
Copy link
Contributor Author

Changes in force-push: Change segment count assertion to greater than.

@VladLazar VladLazar added the ci-repeat-3 repeat tests 3x concurrently to check for flakey tests; self-cancelling label Aug 12, 2022
@vbotbuildovich vbotbuildovich removed the ci-repeat-3 repeat tests 3x concurrently to check for flakey tests; self-cancelling label Aug 12, 2022
@VladLazar VladLazar added the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Aug 16, 2022
@vbotbuildovich vbotbuildovich removed the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Aug 16, 2022
@VladLazar
Copy link
Contributor Author

I triggered 5 parallel CI runs. If the failure mode that this PR is trying to fix doesn't occur, I'd say it's fine to merge.

@VladLazar
Copy link
Contributor Author

There was only one failure in the 5 runs: #6054.
It's new, but I doubt that this PR has anything to do with it.

@VladLazar VladLazar requested a review from LenaAn August 23, 2022 14:06
LenaAn
LenaAn previously approved these changes Aug 23, 2022
@VladLazar VladLazar added the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Aug 24, 2022
@vbotbuildovich vbotbuildovich removed the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Aug 24, 2022
@mmedenjak mmedenjak added kind/bug Something isn't working ci-failure area/cloud-storage Shadow indexing subsystem area/tests labels Aug 29, 2022
@VladLazar
Copy link
Contributor Author

test_write_with_node_failures failures failed on one of the runs, but I think it's a legitimate timeout. The node that failed to remove its segments was stopped three times in a row and didn't get a chance to breach the retention policy and remove the last segment. Increasing the timeout decreases the likelihood of this scenario happening. I'll do that and run the CI again.

Vlad Lazar added 2 commits August 30, 2022 14:20
This commit introduces a new test utility for waiting for the removal
of a partition's segments, wait_for_removal_of_n_segments. It
periodically request a snapshot of the segments associated with a given
partition and compares it with the provided original snapshot.

As opposed to wait_for_removal_of_segments, its result is not impacted
by newly created segments. This means that it can be safely used in
contexts that produce simulated failures (each failure causes the
current segment to roll).
Previously, the shadow indexing end-to-end asserted against the current
number of segments when checking for segment removal. This approach has
the downside that a restart/failure of a redpanda node causes a segment
roll, which makes the assertion unreliable in a context with simulated
failures.

This commit changes the assertion to use the
wait_for_removal_of_n_segments helper method which uses segment
snapshots to determine how many segments were removed. The change
deflakes the test against a high number of injected node failures.
@VladLazar VladLazar added the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Aug 30, 2022
@VladLazar
Copy link
Contributor Author

Changes in force push: increased the timeout as mentioned in this comment.

@vbotbuildovich vbotbuildovich removed the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Aug 30, 2022
@VladLazar
Copy link
Contributor Author

/ci-repeat 10

@VladLazar
Copy link
Contributor Author

CI is happy now: https://buildkite.com/redpanda/redpanda/builds/14882.
@LenaAn could you please re-approve if you're still happy with the change?

@VladLazar VladLazar requested a review from LenaAn August 30, 2022 16:37
@VladLazar VladLazar merged commit befdb01 into redpanda-data:dev Aug 31, 2022
@VladLazar
Copy link
Contributor Author

/backport v22.1.x

@vbotbuildovich
Copy link
Collaborator

Branch name "v22.2.x" not found.

Workflow run logs.

@VladLazar
Copy link
Contributor Author

/backport v22.2.x

@vbotbuildovich
Copy link
Collaborator

Branch name "v22.2.x" not found.

Workflow run logs.

@VladLazar
Copy link
Contributor Author

/backport v22.2.x

@vbotbuildovich
Copy link
Collaborator

Branch name "v22.2.x" not found.

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cloud-storage Shadow indexing subsystem area/tests ci-failure kind/bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Timeout failure in EndToEndShadowIndexingTestWithDisruptions.test_write_with_node_failures
6 participants