Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud_storage: use larger cache for scale test #5915

Merged
merged 1 commit into from
Aug 10, 2022

Conversation

abhijat
Copy link
Contributor

@abhijat abhijat commented Aug 9, 2022

The cloud storage cache used to be cleared in regular intervals. This allowed the franz go verifier test to grow the cache to larger than limit in short bursts between cleanup, and pass because the consumers could progress.

With the new more strict/realtime cache eviction, this does not happen and the test fails for multiple reasons. This change allows the cache size to be a multiple of segment size so that there is not a continuous cycle of hydrate current segment -> read from segment -> evict old segment in the code.

partially fixes #5753 . still need to investigate the memory usage growing in the pathological case

Backport Required

  • not a bug fix
  • papercut/not impactful enough to backport
  • v22.2.x
  • v22.1.x
  • v21.11.x

UX changes

None

Release notes

None

@abhijat abhijat force-pushed the adjust-cache-size-for-scale-test branch 4 times, most recently from 748b274 to b9b7f08 Compare August 9, 2022 12:00
@abhijat abhijat requested review from Lazin and LenaAn August 9, 2022 13:29
@mmedenjak mmedenjak added kind/bug Something isn't working ci-failure area/cloud-storage Shadow indexing subsystem labels Aug 9, 2022
VladLazar
VladLazar previously approved these changes Aug 9, 2022
Copy link
Contributor

@VladLazar VladLazar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Is this a good heuristic for the minimum SI cache size: parallel_consumers_count * segment_size? If so, should we document it anywhere?

tests/rptest/scale_tests/franz_go_verifiable_test.py Outdated Show resolved Hide resolved
@abhijat
Copy link
Contributor Author

abhijat commented Aug 9, 2022

Looks good to me. Is this a good heuristic for the minimum SI cache size: parallel_consumers_count * segment_size? If so, should we document it anywhere?

My reasoning behind this size was that in the worst case where each parallel read is reading from a different segment, we should have roughly enough space for these reads to continue without evicting each other, although we could probably get away with something smaller, like 0.75 x parallel_reads x segment size, but it would be borderline and probably fail sometimes.

This failure (if we set cache to the smaller 0.75x ration) in my tests happens especially with smaller segment size, I guess with larger segments the random reads often find their offset in the large existing segments on cache, with smaller segments the probability of that offset being in cache decreases.

On the other hand if the cache is much larger we won't exercise eviction often enough.

The official recommendation is 20GiB https://docs.redpanda.com/docs/data-management/tiered-storage/#caching.

Please let me know if it makes sense, we can tune it further if this seems too relaxed of a setting

cc @Lazin

The cloud storage cache used to be cleared in regular intervals. This
allowed the test to actually grow the cache to larger than limit in
short bursts between cleanup, and pass because the consumers could
progress.

With the new more strict/realtime cache eviction, this does not happen
and the test fails for multiple reasons. This change allows the cache
size to be a multiple of segment size so that there is not a continuous
cycle of hydrate current segment -> read from segment -> evict old
segment in the code.
@VladLazar
Copy link
Contributor

Makes sense. Thanks for the extra context.

Copy link
Contributor

@LenaAn LenaAn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@Lazin Lazin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@abhijat abhijat merged commit 8ad74a2 into redpanda-data:dev Aug 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cloud-storage Shadow indexing subsystem ci-failure kind/bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Crash in KgoVerifierWithSiTestLargeSegments.test_si_without_timeboxed
5 participants