-
Notifications
You must be signed in to change notification settings - Fork 563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: per-shard limit on memory for spill_key_index #5722
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally looks good. one question inline about semaphore accounting. also:
Previously, every compacted partition was allowed to use up to 512kib of memory for its spill_key_index. For high partition counts, this was an unacceptably large overhead.
i suppose this is a consequence of the design in which all of the head segments on compacted partitions have a spill key index that accumulates as appends occur? we've discussed it in the past and it might be an attractive alternative to this PR: let the normal compaction process build the spill key index on demand instead of tracking it per active segment.
src/v/storage/spill_key_index.cc
Outdated
@@ -88,14 +88,34 @@ ss::future<> spill_key_index::add_key(compaction_key b, value_type v) { | |||
auto const key_size = b.size(); | |||
auto const expected_size = idx_mem_usage() + _keys_mem_usage + key_size; | |||
|
|||
// TODO call into storage_resources | |||
auto take_result = _resources.compaction_index_take_bytes(key_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we only account for the keys that are stored in the index, should we also account at least for the values related with them ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was following the existing accounting's behavior of ignoring value sizes, but the including value size in the calculation is pretty simple, so we might as well do it -- I've added a commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do account for the values in idx_mem_usage()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looked to me like AllocatedByteSize
on node_hash_map is just returning the space used for table's slots, and that the key+value were both allocated outside of the slot?
Yeah, compaction overall would benefit from deferring work, although regenerating the index later is in principle more I/O intensive than generating it while the data is passing through on the way in. |
fa4d230
to
5ffb976
Compare
Updated this with refinements for review suggestions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. looks like there is a linter error
5ffb976
to
c69d869
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, compaction overall would benefit from deferring work, although regenerating the index later is in principle more I/O intensive than generating it while the data is passing through on the way in.
IIUC the number of writes to the spill key index will increase as pressure on the memory limit set in this PR increases (more frequent spills). So we can increase the limit manually at runtime to decrease I/O if needed, and may indeed need too since spills effectively become inline with append ops.
OTOH background compaction (including generating indexes when needed), disregarding any sort of qos, can effectively run as best effort I/O.
Is that an accurate summary?
Right.
Yeah, it would get to run as lower priority. That wouldn't help you if the system was long-term saturated and filling the retention limits so that compaction needed to keep up with production, but that's where things get subjective about how real world systems use it vs. how a worst-cast brute force benchmark would drive it. tldr let's redo compaction at some point :-) |
This is an additional bound, on top of the existing _max_mem in spill_key_index: it will now also avoid using more memory in total per shard. This commit uses a static total of 128MB, which enables up to 256 compacted partitions to use the same 512kiB per-partition allowance that they were using before, then as the partition count gets higher it starts throttling back, although each partition always gets to use at least 32kib memory each, to avoid a pathological case where they spill on every key add.
This property controls the new bound on per-shard memory used for compaction indices at scale.
This will probably be rarely changed in practice, but it mitigates the risk that we have people using compaction at high scale and experiencing performance issues from index spills happening more often than they hoped.
This enables other code to accurately account for memory use.
Previously, we ignored: - inline buffer in `bytes` which affects the actual memory utilization - the map value. Actual memory footprint still depends on how these sizes get rounded up to allocator boundaries, but this is an improvement.
While in general the accounting is robust, this is a little fragile in error paths (or when the code is changed in future), as it is an exception-generating condition to release more units from a sem_units than you took: a double release could perhaps occur if we released before spill(), then there was an exception in spill() that caused us to iterate.
c69d869
to
5481557
Compare
This needed a rebase for named semaphore changes |
ahh great point. i guess that's why we have the adaptive I/O priority bits that michal wrote for compaction. |
failure is #5868 which is unrelated and getting fixed. |
Cover letter
Previously, every compacted partition was allowed to use up to 512kib of memory for its spill_key_index. For high partition counts, this was an unacceptably large overhead.
Now, there is an additional shard-wide limit on memory used for compaction indices. On systems with large numbers of compacted partitions, the indices will spill earlier when this limit is exceeded.
Fixes #4645
Backport Required
UX changes
None
Release notes
Improvements