fix(bucket-retriever): Fix flaky tests by removing dependency on sleeping in tests #7223

mdaudali · 2024-08-06T11:10:04Z

General

Before this PR:
Test flaked, because the timing requirement was too tight: https://app.circleci.com/pipelines/github/palantir/atlasdb/18418/workflows/867c6fe8-a6e6-4de0-b87e-2fec4f2ee56c/jobs/114507/tests

After this PR:
Removes dependency on testing the code that uses thread.sleep by injecting a dependency that sleeps.

I didn't want to expand the bound on the existing await code because that would make test practically useless (a bound of 100ms to >1 second!)

==COMMIT_MSG==
==COMMIT_MSG==

Priority: P2

Concerns / possible downsides (what feedback would you like?):
It's a little jank injecting the function that sleeps - we don't do it elsewhere (but I guess it is a dependency!)
Naming - I went with sleeper, open to anything else.
wasSleepCalled in tests - it's only read once in the tests, didn't want to mock unnecessarily, but figured I might as well inject it for all rather than have two separate functions. Can just switch the default for tests to be () -> {} and recreate the retriever as needed in the new test.

defaultSleeperCanBeInterrupted will involve randomness once again, as we're not pinning the backoff value (it's back to being controlled by TLR) - and so it could spuriously pass. It's a 0.16% chance for that to happen ((1000 / 600000) * 100) - I can reduce the odds even further by bumping the maxBackoff, but in cases where it fails legitimately, CI will just
spin.

Testing and Correctness

What, if any, assumptions are made about the current state of the world? If they change over time, how will we find out?:
That flakes are bad
What was existing testing like? What have you done to improve it?:
Flaky, now less flaky!

Development Process

Where should we start reviewing?:
SSBR

…ping in tests

changelog-app · 2024-08-06T11:10:08Z

Generate changelog in `changelog/@unreleased`

What do the change types mean?

feature: A new feature of the service.
improvement: An incremental improvement in the functionality or operation of the service.
fix: Remedies the incorrect behaviour of a component of the service in a backwards-compatible way.
break: Has the potential to break consumers of this service's API, inclusive of both Palantir services
and external consumers of the service's API (e.g. customer-written software or integrations).
deprecation: Advertises the intention to remove service functionality without any change to the
operation of the service itself.
manualTask: Requires the possibility of manual intervention (running a script, eyeballing configuration,
performing database surgery, ...) at the time of upgrade for it to succeed.
migration: A fully automatic upgrade migration task with no engineer input required.

Note: only one type should be chosen.

How are new versions calculated?

❗The break and manual task changelog types will result in a major release!
🐛 The fix changelog type will result in a minor release in most cases, and a patch release version for patch branches. This behaviour is configurable in autorelease.
✨ All others will result in a minor version release.

Type

Description

fix(bucket-retriever): Fix flaky tests by removing dependency on sleeping in tests

Check the box to generate changelog(s)

Generate changelog entry

mdaudali · 2024-08-06T11:11:01Z

...pl-shared/src/main/java/com/palantir/atlasdb/sweep/asts/ShardedSweepableBucketRetriever.java

-    // Exists to facilitate testing in unit tests, rather than needing to mock out ThreadLocalRandom.
-    private final Supplier<Long> backoffMillisGenerator;
+    // Exists to facilitate testing in unit tests, rather than needing to mock out ThreadLocalRandom and Thread#sleep.
+    private final RunnableCheckedException<InterruptedException> sleeper;


open to another name

mdaudali · 2024-08-06T11:11:28Z

...hared/src/test/java/com/palantir/atlasdb/sweep/asts/ShardedSweepableBucketRetrieverTest.java

    private final TestShardedRetrievalStrategy strategy = new TestShardedRetrievalStrategy();
    private final TestParallelTaskExecutor parallelTaskExecutor = new TestParallelTaskExecutor();
    private final ExecutorService executorService = PTExecutors.newSingleThreadScheduledExecutor();
+    private final AtomicBoolean wasSleepCalled = new AtomicBoolean(false);


Copying concern:
wasSleepCalled in tests - it's only read once in the tests, didn't want to mock unnecessarily, but figured I might as well inject it for all rather than have two separate functions. If you feel strongly, I can just switch the default for tests to be () -> {} and recreate the retriever as needed in the new test.

mdaudali · 2024-08-06T11:11:41Z

...hared/src/test/java/com/palantir/atlasdb/sweep/asts/ShardedSweepableBucketRetrieverTest.java

-        // Even though the backoff in 10s, interrupting the task should make it finish much faster.
-        Awaitility.await().atMost(Duration.ofMillis(200)).untilAsserted(() -> assertThat(
+        // Even though the backoff is up to 10 minutes, interrupting the task should make it finish much faster.
+        Awaitility.await().atMost(Duration.ofSeconds(1)).untilAsserted(() -> assertThat(


defaultSleeperCanBeInterrupted will involve randomness once again, as we're not pinning the backoff value (it's back to being controlled by TLR) - and so it could spuriously pass. It's a 0.16% chance for that to happen ((1000 / 600000) * 100) - I can reduce the odds even further by bumping the maxBackoff, but in cases where it fails legitimately, CI will just spin.

jeremyk-91

👍 Makes sense. The concerns are fair though I don't really have a strong opinion on these ones, honestly.

fix(bucket-retriever): Fix flaky tests by removing dependency on slee…

b9dedb3

…ping in tests

mdaudali added the 🤖 fix nits label Aug 6, 2024

mdaudali commented Aug 6, 2024

View reviewed changes

mdaudali requested a review from jeremyk-91 August 6, 2024 11:14

jeremyk-91 approved these changes Aug 6, 2024

View reviewed changes

mdaudali added the merge when ready label Aug 6, 2024

bulldozer-bot bot merged commit 47cc06a into develop Aug 6, 2024
21 of 22 checks passed

bulldozer-bot bot deleted the mdaudali/08-06-fix_bucket-retriever_fix_flaky_tests_by_removing_dependency_on_sleeping_in_tests branch August 6, 2024 13:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(bucket-retriever): Fix flaky tests by removing dependency on sleeping in tests #7223

fix(bucket-retriever): Fix flaky tests by removing dependency on sleeping in tests #7223

mdaudali commented Aug 6, 2024

changelog-app bot commented Aug 6, 2024

mdaudali Aug 6, 2024

mdaudali Aug 6, 2024

mdaudali Aug 6, 2024

jeremyk-91 left a comment

fix(bucket-retriever): Fix flaky tests by removing dependency on sleeping in tests #7223

fix(bucket-retriever): Fix flaky tests by removing dependency on sleeping in tests #7223

Conversation

mdaudali commented Aug 6, 2024

General

Testing and Correctness

Development Process

changelog-app bot commented Aug 6, 2024

Generate changelog in changelog/@unreleased

mdaudali Aug 6, 2024

Choose a reason for hiding this comment

mdaudali Aug 6, 2024

Choose a reason for hiding this comment

mdaudali Aug 6, 2024

Choose a reason for hiding this comment

jeremyk-91 left a comment

Choose a reason for hiding this comment

Generate changelog in `changelog/@unreleased`