test: use snapshots for detecting segment removal #5812

VladLazar · 2022-08-03T15:36:47Z

Cover letter

Previously, the shadow indexing end-to-end test asserted against the current
number of segments when checking for segment removal. This approach has
the downside that a restart/failure of a redpanda node causes a segment
roll, which makes the assertion unreliable in a context with simulated
failures. See #5390 for more context.

This PR introduces a new utility for waiting for segments removal
which uses snapshots to determine what was removed. The changes
deflakes the test against a high number of injected node failures.

Fixes #5390

Backport Required

UX changes

none

Release notes

none

Store the name of the respective node in the NodeStorage test util class. This enables a future commit to produce a cluster-level segment snapshot.

NyaliaLui

Looks good after cross referencing the suggestions from the linked issue.
I added ci-repeat-5 label since this is intended to fix a CI failure.

VladLazar · 2022-08-10T12:55:17Z

Looks good after cross referencing the suggestions from the linked issue. I added ci-repeat-5 label since this is intended to fix a CI failure.

I'm not sure what that was supposed to do (probably run the CI 5 times), but the bot removed the label. One thing to note is that this fixes a specific failure mode of the test, so we'll have to go through the failures (if any) manually.

LenaAn · 2022-08-10T14:47:52Z

yes, it runs CI 5 times
bot removed this tag when it started 5 jobs (19 out of 20 failed 😃 )

LenaAn

Looks good! One small change and a question here

LenaAn · 2022-08-10T15:32:42Z

tests/rptest/tests/e2e_shadow_indexing_test.py

+        for node, node_segments in original_snapshot.items():
+            assert len(
+                node_segments
+            ) == 10, f"Expected 10 segments, but got {len(node_segments)} on {node}"


we can't assert that there are exactly 10 segments, we should have >= 10 segments

Yeah. One of the failures is due to asserting equality here. Why though? Is this due to jitter in the segment size or is there some other reason?

produce_until_segments checks for p >= count + we don't stop producer after that

LenaAn · 2022-08-10T15:39:53Z

tests/rptest/tests/e2e_shadow_indexing_test.py

-                                  partition_idx=0,
-                                  count=6)
+
+        wait_for_removal_of_n_segments(redpanda=self.redpanda,


so the key difference here is that we are waiting for a removal of segments that were already here, right? So we are protecting against the following scenario:

We produced 10 segments

Before checking for removal, we produce another 5 segments

We delete first 6 segments

it looks like we deleted only one segment, so we are not successful while waiting for 6 segments to be removed.

Right?

Precisely. We take a snapshot of the segments at step 1. and wait until the specified number of segments present in that snapshot are deleted.

LenaAn · 2022-08-10T15:41:53Z

I looked at a couple of CI failures and they were not related to this PR, but please look at them to see if we have some related to this PR

VladLazar · 2022-08-10T16:03:16Z

I looked at a couple of CI failures and they were not related to this PR, but please look at them to see if we have some related to this PR

Just went through them. There's two failures of the updated tests:

Failure of test_write_node_with_failures here. This is a different failure mode: Failure in EndToEndShadowIndexingTestWithDisruptions.test_write_with_node_failures (Failed to consume up to offsets) #4639
Failure of test_write here. This is due to asserting against a fixed number of segments which I fixed in the last force push.

Let's have the CI do a few more runs to see if the failure mode from #5390 occurs.

VladLazar · 2022-08-11T11:03:48Z

/ci-repeat 5

VladLazar · 2022-08-11T13:54:30Z

I can't get ci-repeat to work. I've triggered another run manually.

VladLazar · 2022-08-12T11:12:47Z

Changes in force-push: Change segment count assertion to greater than.

VladLazar · 2022-08-16T13:59:20Z

I triggered 5 parallel CI runs. If the failure mode that this PR is trying to fix doesn't occur, I'd say it's fine to merge.

VladLazar · 2022-08-16T17:52:32Z

There was only one failure in the 5 runs: #6054.
It's new, but I doubt that this PR has anything to do with it.

VladLazar · 2022-08-30T13:19:20Z

test_write_with_node_failures failures failed on one of the runs, but I think it's a legitimate timeout. The node that failed to remove its segments was stopped three times in a row and didn't get a chance to breach the retention policy and remove the last segment. Increasing the timeout decreases the likelihood of this scenario happening. I'll do that and run the CI again.

This commit introduces a new test utility for waiting for the removal of a partition's segments, wait_for_removal_of_n_segments. It periodically request a snapshot of the segments associated with a given partition and compares it with the provided original snapshot. As opposed to wait_for_removal_of_segments, its result is not impacted by newly created segments. This means that it can be safely used in contexts that produce simulated failures (each failure causes the current segment to roll).

Previously, the shadow indexing end-to-end asserted against the current number of segments when checking for segment removal. This approach has the downside that a restart/failure of a redpanda node causes a segment roll, which makes the assertion unreliable in a context with simulated failures. This commit changes the assertion to use the wait_for_removal_of_n_segments helper method which uses segment snapshots to determine how many segments were removed. The change deflakes the test against a high number of injected node failures.

VladLazar · 2022-08-30T13:22:40Z

Changes in force push: increased the timeout as mentioned in this comment.

VladLazar · 2022-08-30T13:49:00Z

/ci-repeat 10

VladLazar · 2022-08-30T16:37:38Z

CI is happy now: https://buildkite.com/redpanda/redpanda/builds/14882.
@LenaAn could you please re-approve if you're still happy with the change?

VladLazar · 2022-08-31T13:48:13Z

/backport v22.1.x

vbotbuildovich · 2022-08-31T13:48:15Z

Branch name "v22.2.x" not found.

Workflow run logs.

VladLazar · 2022-08-31T13:50:44Z

/backport v22.2.x

vbotbuildovich · 2022-08-31T13:51:18Z

Branch name "v22.2.x" not found.

Workflow run logs.

VladLazar · 2022-09-01T10:07:55Z

/backport v22.2.x

vbotbuildovich · 2022-09-01T10:08:23Z

Branch name "v22.2.x" not found.

Workflow run logs.

tests: store node name in NodeStorage util

9a13885

Store the name of the respective node in the NodeStorage test util class. This enables a future commit to produce a cluster-level segment snapshot.

VladLazar requested review from dotnwat and NyaliaLui as code owners August 3, 2022 15:36

VladLazar requested a review from LenaAn August 3, 2022 17:12

NyaliaLui added the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Aug 10, 2022

vbotbuildovich removed the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Aug 10, 2022

NyaliaLui reviewed Aug 10, 2022

View reviewed changes

LenaAn reviewed Aug 10, 2022

View reviewed changes

VladLazar force-pushed the fix-5390 branch from 67b4473 to 6723da2 Compare August 10, 2022 15:55

VladLazar added ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling and removed ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling labels Aug 10, 2022

NyaliaLui added the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Aug 11, 2022

vbotbuildovich removed the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Aug 11, 2022

VladLazar force-pushed the fix-5390 branch from 6723da2 to a616818 Compare August 12, 2022 11:12

VladLazar added the ci-repeat-3 repeat tests 3x concurrently to check for flakey tests; self-cancelling label Aug 12, 2022

vbotbuildovich removed the ci-repeat-3 repeat tests 3x concurrently to check for flakey tests; self-cancelling label Aug 12, 2022

VladLazar added the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Aug 16, 2022

vbotbuildovich removed the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Aug 16, 2022

VladLazar requested a review from NyaliaLui August 17, 2022 16:24

VladLazar requested a review from LenaAn August 23, 2022 14:06

LenaAn previously approved these changes Aug 23, 2022

View reviewed changes

VladLazar added the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Aug 24, 2022

vbotbuildovich removed the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Aug 24, 2022

mmedenjak added kind/bug Something isn't working ci-failure area/cloud-storage Shadow indexing subsystem area/tests labels Aug 29, 2022

Vlad Lazar added 2 commits August 30, 2022 14:20

VladLazar dismissed LenaAn’s stale review via 3f1cb7f August 30, 2022 13:21

VladLazar force-pushed the fix-5390 branch from 7846321 to 3f1cb7f Compare August 30, 2022 13:21

VladLazar added the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Aug 30, 2022

vbotbuildovich removed the ci-repeat-5 repeat tests 5x concurrently to check for flakey tests; self-cancelling label Aug 30, 2022

VladLazar requested a review from LenaAn August 30, 2022 16:37

LenaAn approved these changes Aug 31, 2022

View reviewed changes

VladLazar merged commit befdb01 into redpanda-data:dev Aug 31, 2022

This was referenced Aug 31, 2022

[v22.1.x] Timeout failure in EndToEndShadowIndexingTestWithDisruptions.test_write_with_node_failures #6286

Closed

[v22.1.x] test: use snapshots for detecting segment removal #6287

Closed

VladLazar mentioned this pull request Sep 1, 2022

[v22.2.x] test: use snapshots for detecting segment removal #6296

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: use snapshots for detecting segment removal #5812

test: use snapshots for detecting segment removal #5812

VladLazar commented Aug 3, 2022 •

edited by rishabh96b

Loading

NyaliaLui left a comment

VladLazar commented Aug 10, 2022

LenaAn commented Aug 10, 2022

LenaAn left a comment

LenaAn Aug 10, 2022

VladLazar Aug 10, 2022

LenaAn Aug 10, 2022

LenaAn Aug 10, 2022

VladLazar Aug 10, 2022

LenaAn commented Aug 10, 2022

VladLazar commented Aug 10, 2022

VladLazar commented Aug 11, 2022

VladLazar commented Aug 11, 2022

VladLazar commented Aug 12, 2022

VladLazar commented Aug 16, 2022

VladLazar commented Aug 16, 2022

VladLazar commented Aug 30, 2022

VladLazar commented Aug 30, 2022

VladLazar commented Aug 30, 2022

VladLazar commented Aug 30, 2022

VladLazar commented Aug 31, 2022

vbotbuildovich commented Aug 31, 2022

VladLazar commented Aug 31, 2022

vbotbuildovich commented Aug 31, 2022

VladLazar commented Sep 1, 2022

vbotbuildovich commented Sep 1, 2022

test: use snapshots for detecting segment removal #5812

test: use snapshots for detecting segment removal #5812

Conversation

VladLazar commented Aug 3, 2022 • edited by rishabh96b Loading

Cover letter

Backport Required

UX changes

Release notes

NyaliaLui left a comment

Choose a reason for hiding this comment

VladLazar commented Aug 10, 2022

LenaAn commented Aug 10, 2022

LenaAn left a comment

Choose a reason for hiding this comment

LenaAn Aug 10, 2022

Choose a reason for hiding this comment

VladLazar Aug 10, 2022

Choose a reason for hiding this comment

LenaAn Aug 10, 2022

Choose a reason for hiding this comment

LenaAn Aug 10, 2022

Choose a reason for hiding this comment

VladLazar Aug 10, 2022

Choose a reason for hiding this comment

LenaAn commented Aug 10, 2022

VladLazar commented Aug 10, 2022

VladLazar commented Aug 11, 2022

VladLazar commented Aug 11, 2022

VladLazar commented Aug 12, 2022

VladLazar commented Aug 16, 2022

VladLazar commented Aug 16, 2022

VladLazar commented Aug 30, 2022

VladLazar commented Aug 30, 2022

VladLazar commented Aug 30, 2022

VladLazar commented Aug 30, 2022

VladLazar commented Aug 31, 2022

vbotbuildovich commented Aug 31, 2022

VladLazar commented Aug 31, 2022

vbotbuildovich commented Aug 31, 2022

VladLazar commented Sep 1, 2022

vbotbuildovich commented Sep 1, 2022

VladLazar commented Aug 3, 2022 •

edited by rishabh96b

Loading