storage: per-shard limit on memory for spill_key_index #5722

jcsp · 2022-07-29T10:35:35Z

Cover letter

Previously, every compacted partition was allowed to use up to 512kib of memory for its spill_key_index. For high partition counts, this was an unacceptably large overhead.

Now, there is an additional shard-wide limit on memory used for compaction indices. On systems with large numbers of compacted partitions, the indices will spill earlier when this limit is exceeded.

Fixes #4645

Backport Required

UX changes

None

Release notes

Improvements

Compacted topics automatically use less memory for their indices when the partition count is high, improving stability on large scale systems.

jcsp · 2022-07-29T17:07:33Z

CI failures are:

dotnwat

generally looks good. one question inline about semaphore accounting. also:

Previously, every compacted partition was allowed to use up to 512kib of memory for its spill_key_index. For high partition counts, this was an unacceptably large overhead.

i suppose this is a consequence of the design in which all of the head segments on compacted partitions have a spill key index that accumulates as appends occur? we've discussed it in the past and it might be an attractive alternative to this PR: let the normal compaction process build the spill key index on demand instead of tracking it per active segment.

src/v/storage/spill_key_index.cc

mmaslankaprv · 2022-08-02T11:58:51Z

src/v/storage/spill_key_index.cc

@@ -88,14 +88,34 @@ ss::future<> spill_key_index::add_key(compaction_key b, value_type v) {
 auto const key_size = b.size();
 auto const expected_size = idx_mem_usage() + _keys_mem_usage + key_size;

- // TODO call into storage_resources
+ auto take_result = _resources.compaction_index_take_bytes(key_size);


are we only account for the keys that are stored in the index, should we also account at least for the values related with them ?

I was following the existing accounting's behavior of ignoring value sizes, but the including value size in the calculation is pretty simple, so we might as well do it -- I've added a commit.

we do account for the values in idx_mem_usage()

It looked to me like AllocatedByteSize on node_hash_map is just returning the space used for table's slots, and that the key+value were both allocated outside of the slot?

jcsp · 2022-08-02T15:35:28Z

i suppose this is a consequence of the design in which all of the head segments on compacted partitions have a spill key index that accumulates as appends occur? we've discussed it in the past and it might be an attractive alternative to this PR: let the normal compaction process build the spill key index on demand instead of tracking it per active segment.

Yeah, compaction overall would benefit from deferring work, although regenerating the index later is in principle more I/O intensive than generating it while the data is passing through on the way in.

jcsp · 2022-08-02T15:36:01Z

Updated this with refinements for review suggestions

dotnwat

lgtm. looks like there is a linter error

src/v/storage/spill_key_index.cc

dotnwat

Yeah, compaction overall would benefit from deferring work, although regenerating the index later is in principle more I/O intensive than generating it while the data is passing through on the way in.

IIUC the number of writes to the spill key index will increase as pressure on the memory limit set in this PR increases (more frequent spills). So we can increase the limit manually at runtime to decrease I/O if needed, and may indeed need too since spills effectively become inline with append ops.

OTOH background compaction (including generating indexes when needed), disregarding any sort of qos, can effectively run as best effort I/O.

Is that an accurate summary?

jcsp · 2022-08-05T16:55:24Z

IIUC the number of writes to the spill key index will increase as pressure on the memory limit set in this PR increases (more frequent spills). So we can increase the limit manually at runtime to decrease I/O if needed, and may indeed need too since spills effectively become inline with append ops.

Right.

OTOH background compaction (including generating indexes when needed), disregarding any sort of qos, can effectively run as best effort I/O.

Is that an accurate summary?

Yeah, it would get to run as lower priority. That wouldn't help you if the system was long-term saturated and filling the retention limits so that compaction needed to keep up with production, but that's where things get subjective about how real world systems use it vs. how a worst-cast brute force benchmark would drive it.

tldr let's redo compaction at some point :-)

This is an additional bound, on top of the existing _max_mem in spill_key_index: it will now also avoid using more memory in total per shard. This commit uses a static total of 128MB, which enables up to 256 compacted partitions to use the same 512kiB per-partition allowance that they were using before, then as the partition count gets higher it starts throttling back, although each partition always gets to use at least 32kib memory each, to avoid a pathological case where they spill on every key add.

Fixes redpanda-data#4645

This property controls the new bound on per-shard memory used for compaction indices at scale.

This will probably be rarely changed in practice, but it mitigates the risk that we have people using compaction at high scale and experiencing performance issues from index spills happening more often than they hoped.

This enables other code to accurately account for memory use.

Previously, we ignored: - inline buffer in `bytes` which affects the actual memory utilization - the map value. Actual memory footprint still depends on how these sizes get rounded up to allocator boundaries, but this is an improvement.

While in general the accounting is robust, this is a little fragile in error paths (or when the code is changed in future), as it is an exception-generating condition to release more units from a sem_units than you took: a double release could perhaps occur if we released before spill(), then there was an exception in spill() that caused us to iterate.

jcsp · 2022-08-05T17:48:10Z

This needed a rebase for named semaphore changes

dotnwat · 2022-08-05T20:23:10Z

retention limits so that compaction needed to keep up with production

ahh great point. i guess that's why we have the adaptive I/O priority bits that michal wrote for compaction.

dotnwat · 2022-08-05T20:25:43Z

failure is #5868 which is unrelated and getting fixed.

jcsp added kind/enhance New feature or request area/storage labels Jul 29, 2022

github-actions bot added the area/redpanda label Jul 29, 2022

jcsp marked this pull request as ready for review July 29, 2022 17:07

jcsp requested review from dotnwat, LenaAn, VadimPlh and mmaslankaprv as code owners July 29, 2022 17:07

dotnwat reviewed Aug 2, 2022

View reviewed changes

src/v/storage/spill_key_index.cc Outdated Show resolved Hide resolved

mmaslankaprv reviewed Aug 2, 2022

View reviewed changes

src/v/storage/spill_key_index.cc Show resolved Hide resolved

mmaslankaprv reviewed Aug 2, 2022

View reviewed changes

jcsp force-pushed the issue-4645-spill-key-index branch from fa4d230 to 5ffb976 Compare August 2, 2022 15:35

dotnwat reviewed Aug 2, 2022

View reviewed changes

src/v/storage/spill_key_index.cc Outdated Show resolved Hide resolved

jcsp force-pushed the issue-4645-spill-key-index branch from 5ffb976 to c69d869 Compare August 3, 2022 08:07

dotnwat previously approved these changes Aug 4, 2022

View reviewed changes

jcsp added 7 commits August 5, 2022 18:44

storage: respect shard-wide memory limit in spill_key_index

a60f1f2

Fixes redpanda-data#4645

config: add storage_compaction_index_memory

3244040

This property controls the new bound on per-shard memory used for compaction indices at scale.

storage: make compaction memory limit configurable

53fcd5d

This will probably be rarely changed in practice, but it mitigates the risk that we have people using compaction at high scale and experiencing performance issues from index spills happening more often than they hoped.

bytes: publicly expose the inline buffer size of bytes

c1b1e9c

This enables other code to accurately account for memory use.

jcsp dismissed dotnwat’s stale review via 5481557 August 5, 2022 17:47

jcsp force-pushed the issue-4645-spill-key-index branch from c69d869 to 5481557 Compare August 5, 2022 17:47

dotnwat approved these changes Aug 5, 2022

View reviewed changes

dotnwat merged commit d3bb917 into redpanda-data:dev Aug 5, 2022

jcsp deleted the issue-4645-spill-key-index branch August 8, 2022 08:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: per-shard limit on memory for spill_key_index #5722

storage: per-shard limit on memory for spill_key_index #5722

jcsp commented Jul 29, 2022

jcsp commented Jul 29, 2022

dotnwat left a comment

mmaslankaprv Aug 2, 2022

jcsp Aug 2, 2022

mmaslankaprv Aug 3, 2022

jcsp Aug 3, 2022

jcsp commented Aug 2, 2022

jcsp commented Aug 2, 2022

dotnwat left a comment

dotnwat left a comment •

edited

Loading

jcsp commented Aug 5, 2022

jcsp commented Aug 5, 2022

dotnwat commented Aug 5, 2022

dotnwat commented Aug 5, 2022

storage: per-shard limit on memory for spill_key_index #5722

storage: per-shard limit on memory for spill_key_index #5722

Conversation

jcsp commented Jul 29, 2022

Cover letter

Backport Required

UX changes

Release notes

Improvements

jcsp commented Jul 29, 2022

dotnwat left a comment

Choose a reason for hiding this comment

mmaslankaprv Aug 2, 2022

Choose a reason for hiding this comment

jcsp Aug 2, 2022

Choose a reason for hiding this comment

mmaslankaprv Aug 3, 2022

Choose a reason for hiding this comment

jcsp Aug 3, 2022

Choose a reason for hiding this comment

jcsp commented Aug 2, 2022

jcsp commented Aug 2, 2022

dotnwat left a comment

Choose a reason for hiding this comment

dotnwat left a comment • edited Loading

Choose a reason for hiding this comment

jcsp commented Aug 5, 2022

jcsp commented Aug 5, 2022

dotnwat commented Aug 5, 2022

dotnwat commented Aug 5, 2022

dotnwat left a comment •

edited

Loading