Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(bucket-retriever): Fix flaky tests by removing dependency on sleeping in tests #7223

Conversation

mdaudali
Copy link
Contributor

@mdaudali mdaudali commented Aug 6, 2024

General

Before this PR:
Test flaked, because the timing requirement was too tight: https://app.circleci.com/pipelines/github/palantir/atlasdb/18418/workflows/867c6fe8-a6e6-4de0-b87e-2fec4f2ee56c/jobs/114507/tests

After this PR:
Removes dependency on testing the code that uses thread.sleep by injecting a dependency that sleeps.

I didn't want to expand the bound on the existing await code because that would make test practically useless (a bound of 100ms to >1 second!)

==COMMIT_MSG==
==COMMIT_MSG==

Priority: P2

Concerns / possible downsides (what feedback would you like?):
It's a little jank injecting the function that sleeps - we don't do it elsewhere (but I guess it is a dependency!)
Naming - I went with sleeper, open to anything else.
wasSleepCalled in tests - it's only read once in the tests, didn't want to mock unnecessarily, but figured I might as well inject it for all rather than have two separate functions. Can just switch the default for tests to be () -> {} and recreate the retriever as needed in the new test.

defaultSleeperCanBeInterrupted will involve randomness once again, as we're not pinning the backoff value (it's back to being controlled by TLR) - and so it could spuriously pass. It's a 0.16% chance for that to happen ((1000 / 600000) * 100) - I can reduce the odds even further by bumping the maxBackoff, but in cases where it fails legitimately, CI will just
spin.

Testing and Correctness

What, if any, assumptions are made about the current state of the world? If they change over time, how will we find out?:
That flakes are bad
What was existing testing like? What have you done to improve it?:
Flaky, now less flaky!

Development Process

Where should we start reviewing?:
SSBR

@changelog-app
Copy link

changelog-app bot commented Aug 6, 2024

Generate changelog in changelog/@unreleased

What do the change types mean?
  • feature: A new feature of the service.
  • improvement: An incremental improvement in the functionality or operation of the service.
  • fix: Remedies the incorrect behaviour of a component of the service in a backwards-compatible way.
  • break: Has the potential to break consumers of this service's API, inclusive of both Palantir services
    and external consumers of the service's API (e.g. customer-written software or integrations).
  • deprecation: Advertises the intention to remove service functionality without any change to the
    operation of the service itself.
  • manualTask: Requires the possibility of manual intervention (running a script, eyeballing configuration,
    performing database surgery, ...) at the time of upgrade for it to succeed.
  • migration: A fully automatic upgrade migration task with no engineer input required.

Note: only one type should be chosen.

How are new versions calculated?
  • ❗The break and manual task changelog types will result in a major release!
  • 🐛 The fix changelog type will result in a minor release in most cases, and a patch release version for patch branches. This behaviour is configurable in autorelease.
  • ✨ All others will result in a minor version release.

Type

  • Feature
  • Improvement
  • Fix
  • Break
  • Deprecation
  • Manual task
  • Migration

Description

fix(bucket-retriever): Fix flaky tests by removing dependency on sleeping in tests

Check the box to generate changelog(s)

  • Generate changelog entry

// Exists to facilitate testing in unit tests, rather than needing to mock out ThreadLocalRandom.
private final Supplier<Long> backoffMillisGenerator;
// Exists to facilitate testing in unit tests, rather than needing to mock out ThreadLocalRandom and Thread#sleep.
private final RunnableCheckedException<InterruptedException> sleeper;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

open to another name

private final TestShardedRetrievalStrategy strategy = new TestShardedRetrievalStrategy();
private final TestParallelTaskExecutor parallelTaskExecutor = new TestParallelTaskExecutor();
private final ExecutorService executorService = PTExecutors.newSingleThreadScheduledExecutor();
private final AtomicBoolean wasSleepCalled = new AtomicBoolean(false);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copying concern:
wasSleepCalled in tests - it's only read once in the tests, didn't want to mock unnecessarily, but figured I might as well inject it for all rather than have two separate functions. If you feel strongly, I can just switch the default for tests to be () -> {} and recreate the retriever as needed in the new test.

// Even though the backoff in 10s, interrupting the task should make it finish much faster.
Awaitility.await().atMost(Duration.ofMillis(200)).untilAsserted(() -> assertThat(
// Even though the backoff is up to 10 minutes, interrupting the task should make it finish much faster.
Awaitility.await().atMost(Duration.ofSeconds(1)).untilAsserted(() -> assertThat(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defaultSleeperCanBeInterrupted will involve randomness once again, as we're not pinning the backoff value (it's back to being controlled by TLR) - and so it could spuriously pass. It's a 0.16% chance for that to happen ((1000 / 600000) * 100) - I can reduce the odds even further by bumping the maxBackoff, but in cases where it fails legitimately, CI will just spin.

Copy link
Contributor

@jeremyk-91 jeremyk-91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Makes sense. The concerns are fair though I don't really have a strong opinion on these ones, honestly.

@bulldozer-bot bulldozer-bot bot merged commit 47cc06a into develop Aug 6, 2024
21 of 22 checks passed
@bulldozer-bot bulldozer-bot bot deleted the mdaudali/08-06-fix_bucket-retriever_fix_flaky_tests_by_removing_dependency_on_sleeping_in_tests branch August 6, 2024 13:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants