Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: per-shard limit on memory for spill_key_index #5722

Merged
merged 7 commits into from
Aug 5, 2022

Conversation

jcsp
Copy link
Contributor

@jcsp jcsp commented Jul 29, 2022

Cover letter

Previously, every compacted partition was allowed to use up to 512kib of memory for its spill_key_index. For high partition counts, this was an unacceptably large overhead.

Now, there is an additional shard-wide limit on memory used for compaction indices. On systems with large numbers of compacted partitions, the indices will spill earlier when this limit is exceeded.

Fixes #4645

Backport Required

  • not a bug fix
  • papercut/not impactful enough to backport
  • v22.2.x
  • v22.1.x
  • v21.11.x

UX changes

None

Release notes

Improvements

  • Compacted topics automatically use less memory for their indices when the partition count is high, improving stability on large scale systems.

@jcsp jcsp added kind/enhance New feature or request area/storage labels Jul 29, 2022
@jcsp jcsp marked this pull request as ready for review July 29, 2022 17:07
Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally looks good. one question inline about semaphore accounting. also:

Previously, every compacted partition was allowed to use up to 512kib of memory for its spill_key_index. For high partition counts, this was an unacceptably large overhead.

i suppose this is a consequence of the design in which all of the head segments on compacted partitions have a spill key index that accumulates as appends occur? we've discussed it in the past and it might be an attractive alternative to this PR: let the normal compaction process build the spill key index on demand instead of tracking it per active segment.

src/v/storage/spill_key_index.cc Outdated Show resolved Hide resolved
@@ -88,14 +88,34 @@ ss::future<> spill_key_index::add_key(compaction_key b, value_type v) {
auto const key_size = b.size();
auto const expected_size = idx_mem_usage() + _keys_mem_usage + key_size;

// TODO call into storage_resources
auto take_result = _resources.compaction_index_take_bytes(key_size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we only account for the keys that are stored in the index, should we also account at least for the values related with them ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was following the existing accounting's behavior of ignoring value sizes, but the including value size in the calculation is pretty simple, so we might as well do it -- I've added a commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do account for the values in idx_mem_usage()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looked to me like AllocatedByteSize on node_hash_map is just returning the space used for table's slots, and that the key+value were both allocated outside of the slot?

@jcsp
Copy link
Contributor Author

jcsp commented Aug 2, 2022

i suppose this is a consequence of the design in which all of the head segments on compacted partitions have a spill key index that accumulates as appends occur? we've discussed it in the past and it might be an attractive alternative to this PR: let the normal compaction process build the spill key index on demand instead of tracking it per active segment.

Yeah, compaction overall would benefit from deferring work, although regenerating the index later is in principle more I/O intensive than generating it while the data is passing through on the way in.

@jcsp jcsp force-pushed the issue-4645-spill-key-index branch from fa4d230 to 5ffb976 Compare August 2, 2022 15:35
@jcsp
Copy link
Contributor Author

jcsp commented Aug 2, 2022

Updated this with refinements for review suggestions

Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. looks like there is a linter error

src/v/storage/spill_key_index.cc Outdated Show resolved Hide resolved
@jcsp jcsp force-pushed the issue-4645-spill-key-index branch from 5ffb976 to c69d869 Compare August 3, 2022 08:07
dotnwat
dotnwat previously approved these changes Aug 4, 2022
Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, compaction overall would benefit from deferring work, although regenerating the index later is in principle more I/O intensive than generating it while the data is passing through on the way in.

IIUC the number of writes to the spill key index will increase as pressure on the memory limit set in this PR increases (more frequent spills). So we can increase the limit manually at runtime to decrease I/O if needed, and may indeed need too since spills effectively become inline with append ops.

OTOH background compaction (including generating indexes when needed), disregarding any sort of qos, can effectively run as best effort I/O.

Is that an accurate summary?

@jcsp
Copy link
Contributor Author

jcsp commented Aug 5, 2022

IIUC the number of writes to the spill key index will increase as pressure on the memory limit set in this PR increases (more frequent spills). So we can increase the limit manually at runtime to decrease I/O if needed, and may indeed need too since spills effectively become inline with append ops.

Right.

OTOH background compaction (including generating indexes when needed), disregarding any sort of qos, can effectively run as best effort I/O.

Is that an accurate summary?

Yeah, it would get to run as lower priority. That wouldn't help you if the system was long-term saturated and filling the retention limits so that compaction needed to keep up with production, but that's where things get subjective about how real world systems use it vs. how a worst-cast brute force benchmark would drive it.

tldr let's redo compaction at some point :-)

This is an additional bound, on top of the existing _max_mem
in spill_key_index: it will now also avoid using more memory
in total per shard.

This commit uses a static total of 128MB, which enables up to 256
compacted partitions to use the same 512kiB per-partition allowance
that they were using before, then as the partition count gets higher
it starts throttling back, although each partition always gets to
use at least 32kib memory each, to avoid a pathological case where
they spill on every key add.
This property controls the new bound on per-shard
memory used for compaction indices at scale.
This will probably be rarely changed in practice, but
it mitigates the risk that we have people using compaction
at high scale and experiencing performance issues from
index spills happening more often than they hoped.
This enables other code to accurately account for memory use.
Previously, we ignored:
- inline buffer in `bytes` which affects the actual memory utilization
- the map value.

Actual memory footprint still depends on how these sizes
get rounded up to allocator boundaries, but this is an improvement.
While in general the accounting is robust, this is a little
fragile in error paths (or when the code is changed in future),
as it is an exception-generating condition to release more
units from a sem_units than you took: a double release
could perhaps occur if we released before spill(), then
there was an exception in spill() that caused us to iterate.
@jcsp
Copy link
Contributor Author

jcsp commented Aug 5, 2022

This needed a rebase for named semaphore changes

@dotnwat
Copy link
Member

dotnwat commented Aug 5, 2022

retention limits so that compaction needed to keep up with production

ahh great point. i guess that's why we have the adaptive I/O priority bits that michal wrote for compaction.

@dotnwat
Copy link
Member

dotnwat commented Aug 5, 2022

failure is #5868 which is unrelated and getting fixed.

@dotnwat dotnwat merged commit d3bb917 into redpanda-data:dev Aug 5, 2022
@jcsp jcsp deleted the issue-4645-spill-key-index branch August 8, 2022 08:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

storage: Compaction index uses hardcoded 512kB memory limit
3 participants