Allow rm_stm to control max compaction offset #5318

ajfabbri · 2022-07-02T00:54:18Z

Allow rm_stm to limit maximum offset that compaction may affect, so it can prevent unresolved transactional messages from affecting other records.

Fixes #4823

Highlights:

storage:` add max_collectible_offset to compaction_config

Use model::offset::max() as a default value; this means you may compact up to any offset.
Compaction uses max_collectible_offset, which is set in latter patch in this PR.

storage/tests: add key-limited random batch generator

For testing compaction, we want a variant of append_random_batches() which limits the total number of distinct keys used. Unlike fully-random keys, this ensures that compaction can actually do some work.

s/test/util: basic log gap analysis for compaction tests

Introduce a utility for compaction testing which helps detect which offset ranges were affected by compaction. This utility makes a single pass over all of the message offsets in a given log instance, and returning the total number of offset gaps, and the min and max offsets that were missing.

storage/test: integration test for max compaction (max_collectible) offset

Three integration tests that verify desired behavior:

Randomized batch / messsage generaton with a limited number of keys. Verify that compaction only deletes messages less than this max.
Single key with max compact offset (MCO): fixed message test that checks that the boundaries of the compacted area are precisely as expected.
Single key w/o MCO: same as above, but with max_collectible_offset unset, we validate that compactions works as before.

rm_stm: use last stable offset to limit compaction

Last Stable Offset (LSO) is the largest offset such that no messages, up to and including LSO, are part of an unresolved transaction. Use LSO to determine the maximum offset that may be affected by compaction. This ensures that messages for open transactions--which may be aborted or committed in the future--cannot affect compaction. For example, an aborted message cannot replace a previous message with the same key, during compaction.

v1 PR

included these changes to support separate control for deletion/eviction versus compaction, which we agreed can be eliminated for now:
~~storage: stm_manager support for compactable_stm~~
~~Make stm_manager aware of a new type of stm: compactable_stm. This allows STM implementations to control compaction policy without leaking details into the main compaction logic.~~

For now, this just exposes max_compactable_offset(), which allows rm_stm to prevent unresolved transactional messages from affecting compaction. We also plan to use this plumbing for a per-message predicate which will allow compaction to query the STM, and thus handle things like aborted records properly.

Release notes

none

src/v/storage/segment.h

ajfabbri

Still needs hook into compaction, and an integration test; to validate the compaction offset limiting works as expected.

src/v/storage/types.h

src/v/storage/compaction_reducers.cc

dotnwat · 2022-07-02T01:51:56Z

src/v/storage/log_manager.cc

@@ -122,6 +122,7 @@ log_manager::housekeeping_scan(model::timestamp collection_threshold) {
        co_await current_log.handle.compact(compaction_config(
          collection_threshold,
          _config.retention_bytes(),
+          current_log.handle.stm_manager()->max_compactible_offset(),


it kind of seems like rm_stm should update storage layer (as opposed to storage querying rm_stm). rm_stm is in a position to know when an offset changes, and having information flow downward there is less complex coupling. wdyt?

@mmaslankaprv suggested using polling, it seems like a more straightforward solution, also in the next part of the compaction /transaction integration the compaction logic will need to query if a given record is a part of the aborted or committed transaction. using an object to query instead of passing a list of aborted records gives more space for memory optimization

@mmaslankaprv suggested using polling, it seems like a more straightforward solution

I'm not sure I agree it is more straight forward when we consider the lifetime of all the new objects involved, namely that the callback to storage is going to have a captured reference back to rm_stm or some other thing. It's not a serious issue and shouldn't block work on a fix.

I do agree for querying and memory optimization of aborted records it doesn't make sense to push that down. It seems like this suggests that either part of transaction management should be in storage or some of compaction should be lifted out. I think it's worth discussing this sort of thing even if we go with something quick so we stay aware of how much tech debt we might be accumulating.

I agree that the inverted dependencies are annoying and it's a good idea to lift compaction logic above raft and let stm directly control it but it's out of scope of the compaction / transactions integration and we can't condition transactions release on the fundamental refactoring of the compaction logic.

Right now we compact adjacent segments but when the transactions are involved in order to decide if a record b overwrites a record a (even when the records are from the same segment) we may need to look ahead for a unbounded amount of segments for search for corresponding abort or commit txn markers. So we either need to change compaction from working with adjacent segments to work with unbounded segments or to rely on an external oracle who happens to know future to guide the compaction logic. The oracle requires less changes.

But there is a silver lining - we are not increasing the tech debt balance. The storage layer already depends on stm_manager which contains a reference to rm_stm so things are not getting worse.

I think i agree with Noah here, it would be cleaner to reverse the dependency, but i also got Denis'es point about already existing stm_manager

Yep. I was thinking similar things as I wrote this PR; would like to chat about layering here.

But there is a silver lining - we are not increasing the tech debt balance. The storage layer already depends on stm_manager which contains a reference to rm_stm so things are not getting worse.

thanks for mentioning this observation

rystsov

Looks good but I suggest to postpone merging before we start using max_compactable_offset to actually limit compaction to get the whole picture of the change.

src/v/storage/types.h

rystsov · 2022-07-03T21:50:33Z

src/v/storage/types.h

@@ -340,11 +340,14 @@ struct compaction_config {
    explicit compaction_config(
      model::timestamp upper,
      std::optional<size_t> max_bytes_in_log,
+      std::optional<model::offset> max_compact_offset,


we already use model::offset::max() as an unset value (in stm_manager) so it's better to stick with it and not to introduce optional. making a type optional adds another dimension to the value space which we should keep in mind when we use the type.

This particular API, compaction_config uses optional, so I thought it was more consistent. I will change it to ...max()..

we already use model::offset::max()

the intention of default initialization model::offset{} (aka ::min) is to indicate an unset state. unless using ::max is convenient because of a calculation that is based on min(a, b) then it is probably worth sticking to default initialization? I think that is what most of storage:: uses?

😆 you guys are making an advertisement for std::optional. Joking aside, stm_manager does depend on being able to use min() to reduce multiple values.

Absolutely. I am pro-std::optional and very much would like to see model::offset{} as an unset value go away. But in the mean time I guess we should be consistent :(

src/v/storage/types.h

mmaslankaprv · 2022-07-04T09:55:12Z

src/v/storage/types.h

    std::vector<ss::shared_ptr<snapshotable_stm>> _stms;
+    ss::shared_ptr<compactable_stm> _comp_stm;


do we need a special case for this stm type ?, would it be enough to use max_collectible_offset as a compaction boundary ?

This was one of my first ideas. @rystsov indicated that this compaction codepath should be distinct from the snapshot/eviction code. We should chat about the definition of "max collectible" and all the use cases.

I was talking that just setting max_collectible_offset to the lso isn't enough because it isn't used in compaction. We may update the compaction layer and use max_collectible_offset to postpone the compaction but the current approach has merit too e.g. we may use max_compactable_offset & max_collectible_offset to make sure we never do log eviction before we compact. Thoughts?

Thoughts?

I'm not too sure what you are proposing/asking, @rystsov.

IIUC, he's saying the value of having them separate is that it allows stm to control compaction and eviction separately, and asking if we think it is worthwhile.

i would vote for using max_collectible_offset in compaction, there is no point in compacting a log if it is about to be delete either way, having a special case for compaction stm make the whole thing more complicated

As much as I'd love to get this merged, I tend to agree with @mmaslankaprv. At least this is what I wanted to try first, because we can always add separate compaction control in the future if needed. I didn't see a need for this currently, but maybe in the future we introduce a new compaction algorithm or something that needs it?

Unless there are any objections, I'll whip up a new version which eliminates the separate compactable offset plumbing, so we can see the difference in approaches.

src/v/storage/types.h

dotnwat · 2022-07-06T01:06:19Z

src/v/storage/types.h

+    template<
+      typename T,
+      typename = std::enable_if_t<std::is_base_of_v<snapshotable_stm, T>>>
+    void add_stm(ss::shared_ptr<T> stm) {


It might be a clearer api if you added an add_compactable_stm interface since we already have a run-time requirement that two compatible stms cannot be added

I ended up not doing this (separately-named method) because of the awkwardness of the class hierarchy; a compactable_stm is not a snapshotable_stm. Ended up using a dynamic cast. LMK what you think.

src/v/storage/types.h

ajfabbri · 2022-07-06T05:37:23Z

Force push:

Remove template for add_stm(). Could also add separate method add_compactable_stm(), but that either requires a dynamic cast (compactable_stm is not a snapshotable_stm), or calling both add_stm() and add_compactable_stm() in the case of an instance of rm_stm.
Use offset::max() instead of optional/nullopt
Add comment to clarify >= for max collectable offset

ajfabbri · 2022-07-06T05:43:51Z

Looks good but I suggest to postpone merging before we start using max_compactable_offset to actually limit compaction to get the whole picture of the change.

Marking draft--will also include integration with compaction and a test in this PR. Stay tuned.

ajfabbri · 2022-07-07T19:46:40Z

Added: integration test and hooking into actual compaction logic. Still a draft: rewriting the test code.

src/v/storage/tests/storage_e2e_test.cc

ajfabbri · 2022-08-26T00:24:44Z

Force-push:

Remove redundant inline
Remove two reference default lambda captures, make explicit.
Rebase on dev.
Fix formatting.

(Sorry for the multiple pushes--I will debounce better in the future 😅)

- Use model::offset::max() as a default value; this means you may delete or compact up to any offset.

- Compaction now takes max_collectible_offset into account. - No effect yet, yet since it is still set to max() in the compaction config.

Code is moved as-is, except where I cheated and moved a constant up to file scope.

This is a `generator` in the computer science sense; it generates a sequence, like an iterator. The set of values this generator will output is limited to a maximum size (cardinality). The distribution of values is uniform. They are a fixed length, and the simple algorithm values memory efficiency over actual entropy. Useful for randomly generating keys that can be compacted.

For testing compaction, we want a variant of append_random_batches() which limits the total number of distinct keys used. Unlike fully-random keys, this ensures that compaction can actually do some work.

Introduce a utility for compaction testing which helps detect which offset ranges were affected by compaction. This utility makes a single pass over all of the message offsets in a given log instance, and returning the total number of offset gaps, and the min and max offsets that were missing.

Last Stable Offset (LSO) is the largest offset such that no messages, up to and including LSO, are part of an unresolved transaction. Use LSO to determine the maximum offset that may be affected by compaction. This ensures that messages for open transactions--which may be aborted or committed in the future--cannot affect compaction. For example, an aborted message cannot replace a previous message with the same key, during compaction.

Three integration tests that verify behavior of max_compactable_offset: - Randomized batch / messsage generaton with a limited number of keys. Verify that compaction only deletes messages less than this max. - Single key #1: fixed message test that checks that the boundaries of the compacted area are precisely as expected. - Single key redpanda-data#2: same as above, but with max_compactible_offset unset, we validate that compactions works as before.

ajfabbri · 2022-08-26T01:02:39Z

nice stuff. this generally looks good to me.

i think one of the earlier commits might not compile

added a few minor nits i saw along the way

I was surprised to see the dynamic_cast / interface mixin disappear. I looked back through some of the feedback but didn't see that discussion. Do you have a pointer or could explain?

Thanks @dotnwat. My next force push should address all of these, except the question about dirty offset, and maybe the one on changing random_generators::make_iobuf() to be random_generators::make_random_iobuf()? I'm in the habit of de-duping redundant words in the full name, but let me know if you want that.

For the compactable_stm interface stuff, thread is here, and the bottom of the PR description covers the change. That commit was deleted--for now at least.

ajfabbri · 2022-08-26T01:16:16Z

Force-push: (compare link should show as empty since end result is the same):

Split compaction config patch into multiple commits: 1. add max offset field to config, 2. change compaction to honor the config, 3. Remove unused include. 4. Comments
Fix bisect-ability for 1 above (chunk was misplaced in later commit when rebasing).

ajfabbri · 2022-08-26T03:27:37Z

Current CI failures are unrelated #6237..

dotnwat · 2022-08-29T21:51:51Z

random_generators::make_random_iobuf()

good point, i didn't notice it was in this namespace!

dotnwat

lgtm.

we should have @rystsov (the author of the feature request), and/or @mmaslankaprv approve too.

dotnwat · 2022-08-31T00:01:52Z

src/v/storage/tests/utils/disk_log_builder.h

@@ -26,7 +26,6 @@
 #include <seastar/core/sstring.hh>
 #include <seastar/util/bool_class.hh>

-#include <optional>


nit: later in the file:

std::optional<size_t> max_partition_retention_size);

dotnwat · 2022-08-31T00:13:34Z

src/v/model/tests/random_batch.h

@@ -36,6 +36,9 @@ struct record_batch_spec {
    std::optional<model::timestamp> timestamp;
 };

+// Makes an random alphanumeric string, encoded in an iobuf.
+iobuf make_iobuf(size_t);


nit: it looks to me like the fully qualified name is model::test::make_iobuf. i'd move it to random generators or bytes test helpers. can be moved later.

dotnwat · 2022-08-31T00:16:27Z

src/v/storage/tests/storage_e2e_test.cc

+
+    // If adjacent segment compaction worked in order from oldest to newest, we
+    // could use this assert.
+#if 0


this seems reasonable, but using a multi-line comment instead of an if 0 block would be more idiomatic in the code base. this particular case seems very clear and the if 0/endif kind of highlights the aspect of it being code that the comment is referring to

LenaAn

great change! just small nits

LenaAn · 2022-08-31T14:48:44Z

src/v/cluster/rm_stm.cc

@@ -251,6 +251,7 @@ rm_stm::rm_stm(
 }

 bool rm_stm::check_tx_permitted() {
+    // TODO support compaction


can you please open a github issue with a context what exactly need to be done? and put the link here

It already exists #4824

can you please open a github issue with a context what exactly need to be done? and put the link here

thanks @LenaAn it is nice to have the link to the github issue in the code comment itself. it's usually a big lift to reverse index from a code comment back to a github review comment.

LenaAn · 2022-08-31T14:51:24Z

src/v/random/generators.cc

+    return s;
+}
+
+} // namespace random_generators


nit: no newline at the end of the file

LenaAn · 2022-08-31T15:31:39Z

src/v/storage/tests/utils/log_gap_analysis.h

+ * minimum and maximum offsets that were missing.
+ * This is useful for observing which offsets were affected by compaction.
+ */
+struct log_gap_analysis {


nit: another useful thing would be the analysis of records read and unique keys read. For example, we produce 10000 records with 200 unique keys and consume 200 records with 200 unique keys. So we see how much we compact data.

rystsov

Looks good! The cover letter, the commit messages and the comments make it so easy to understand.

rystsov · 2022-09-06T05:33:02Z

src/v/random/generators.h

 ss::sstring gen_alphanum_string(size_t n);

+static constexpr size_t alphanum_max_distinct_strlen = 32;


nit: as i understand gen_alphanum_max_distinct returns random string with fixed length equal to alphanum_max_distinct_strlen; because the string length is fixed it's better to rename it to alphanum_distinct_strlen

rystsov · 2022-09-06T05:42:23Z

src/v/storage/tests/storage_e2e_test.cc

+
+    // If adjacent segment compaction worked in order from oldest to newest, we
+    // could use this assert.
+#if 0


#if false feels like a merge-broken version check or something, agree with Noah, it's better to use a multi-line comment

ajfabbri · 2022-09-06T17:54:05Z

Thanks for the reviews! I will address remaining nits (change preprocessor conditional to comment, etc) in followup PR coming shortly.

ajfabbri requested review from rystsov, dotnwat, LenaAn, VadimPlh, mmaslankaprv, ztlpn and jcsp as code owners July 2, 2022 00:54

github-actions bot added the area/redpanda label Jul 2, 2022

ajfabbri commented Jul 2, 2022

View reviewed changes

src/v/storage/segment.h Outdated Show resolved Hide resolved

ajfabbri commented Jul 2, 2022

View reviewed changes

src/v/storage/types.h Outdated Show resolved Hide resolved

ajfabbri commented Jul 2, 2022

View reviewed changes

src/v/storage/compaction_reducers.cc Show resolved Hide resolved

dotnwat reviewed Jul 2, 2022

View reviewed changes

rystsov reviewed Jul 3, 2022

View reviewed changes

src/v/storage/types.h Outdated Show resolved Hide resolved

src/v/storage/types.h Outdated Show resolved Hide resolved

rystsov reviewed Jul 3, 2022

View reviewed changes

mmaslankaprv reviewed Jul 4, 2022

View reviewed changes

src/v/storage/types.h Outdated Show resolved Hide resolved

mmaslankaprv reviewed Jul 4, 2022

View reviewed changes

dotnwat reviewed Jul 6, 2022

View reviewed changes

ajfabbri force-pushed the transpaction1 branch from 5b73794 to fdc076e Compare July 6, 2022 05:37

ajfabbri marked this pull request as draft July 6, 2022 05:42

ajfabbri force-pushed the transpaction1 branch from 9dbde2c to 1c6661f Compare July 7, 2022 19:43

ajfabbri commented Jul 7, 2022

View reviewed changes

src/v/storage/tests/storage_e2e_test.cc Outdated Show resolved Hide resolved

ajfabbri force-pushed the transpaction1 branch 5 times, most recently from 293f985 to ab3a858 Compare July 8, 2022 19:17

Aaron Fabbri added 12 commits August 25, 2022 17:51

storage: add max_collectible_offset to compaction_config

1d84806

- Use model::offset::max() as a default value; this means you may delete or compact up to any offset.

storage: modify compaction to honor max_collectible_offset

7898681

- Compaction now takes max_collectible_offset into account. - No effect yet, yet since it is still set to max() in the compaction config.

storage/tests: remove unused include

0001701

storage,cluster: comment changes only

09d62cf

random: de-inline nontrivial generators

f289b2d

Code is moved as-is, except where I cheated and moved a constant up to file scope.

storage/tests: add key-limited random batch generator

5b345c0

For testing compaction, we want a variant of append_random_batches() which limits the total number of distinct keys used. Unlike fully-random keys, this ensures that compaction can actually do some work.

model/tests: expose random_generators::make_iobuf

f62fa8b

s/tests: add debug output to log_gap_analysis

ce45069

ajfabbri force-pushed the transpaction1 branch from 7902cf3 to ce45069 Compare August 26, 2022 01:12

ajfabbri requested review from dotnwat and mmaslankaprv August 26, 2022 01:17

dotnwat approved these changes Aug 31, 2022

View reviewed changes

LenaAn approved these changes Aug 31, 2022

View reviewed changes

mmedenjak assigned ajfabbri Sep 1, 2022

rystsov reviewed Sep 6, 2022

View reviewed changes

ajfabbri merged commit 2f661c5 into redpanda-data:dev Sep 6, 2022

ajfabbri mentioned this pull request Sep 6, 2022

Followup nit fixes #6323

Merged

abhijat mentioned this pull request Sep 8, 2022

Prevent adjacent segment compaction below max collectible offset #6335

Closed

abhijat mentioned this pull request Sep 16, 2022

SI: Enable archival for compaction enabled partitions #6408

Merged

6 tasks

bharathv mentioned this pull request Sep 21, 2022

compaction: Extend max_collectible_offset check to adjacent compaction #6487

Merged

6 tasks

		std::vector<ss::shared_ptr<snapshotable_stm>> _stms;
		ss::shared_ptr<compactable_stm> _comp_stm;

		ss::sstring gen_alphanum_string(size_t n);

		static constexpr size_t alphanum_max_distinct_strlen = 32;

Allow rm_stm to control max compaction offset #5318

Allow rm_stm to control max compaction offset #5318

Conversation

ajfabbri commented Jul 2, 2022 • edited Loading

Highlights:

storage/tests: add key-limited random batch generator

s/test/util: basic log gap analysis for compaction tests

storage/test: integration test for max compaction (max_collectible) offset

rm_stm: use last stable offset to limit compaction

v1 PR

Release notes

ajfabbri left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dotnwat Jul 3, 2022 • edited Loading

Choose a reason for hiding this comment

rystsov Jul 3, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rystsov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajfabbri Jul 6, 2022 • edited Loading

Choose a reason for hiding this comment

dotnwat Jul 6, 2022 • edited Loading

Choose a reason for hiding this comment

ajfabbri Jul 6, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rystsov Jul 15, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajfabbri Aug 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajfabbri commented Jul 6, 2022 • edited Loading

ajfabbri commented Jul 6, 2022 • edited Loading

ajfabbri commented Jul 7, 2022 • edited Loading

ajfabbri commented Aug 26, 2022

ajfabbri commented Aug 26, 2022 • edited Loading

ajfabbri commented Aug 26, 2022 • edited Loading

ajfabbri commented Aug 26, 2022 • edited Loading

dotnwat commented Aug 29, 2022

dotnwat left a comment

Choose a reason for hiding this comment

dotnwat Aug 31, 2022 • edited Loading

Choose a reason for hiding this comment

dotnwat Aug 31, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LenaAn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rystsov left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajfabbri commented Sep 6, 2022

ajfabbri commented Jul 2, 2022 •

edited

Loading

ajfabbri left a comment •

edited

Loading

dotnwat Jul 3, 2022 •

edited

Loading

rystsov Jul 3, 2022 •

edited

Loading

ajfabbri Jul 6, 2022 •

edited

Loading

dotnwat Jul 6, 2022 •

edited

Loading

ajfabbri Jul 6, 2022 •

edited

Loading

rystsov Jul 15, 2022 •

edited

Loading

ajfabbri Aug 18, 2022 •

edited

Loading

ajfabbri commented Jul 6, 2022 •

edited

Loading

ajfabbri commented Jul 6, 2022 •

edited

Loading

ajfabbri commented Jul 7, 2022 •

edited

Loading

ajfabbri commented Aug 26, 2022 •

edited

Loading

ajfabbri commented Aug 26, 2022 •

edited

Loading

ajfabbri commented Aug 26, 2022 •

edited

Loading

dotnwat Aug 31, 2022 •

edited

Loading

dotnwat Aug 31, 2022 •

edited

Loading

rystsov left a comment •

edited

Loading