Add shuffle-sharding for the compactor #4433

ac1214 · 2021-08-19T02:01:02Z

What this PR does:

Adds shuffle-sharding for the plans generated by the grouper. Depends on #4432. Once #4432 is merged the diff for this PR will be reduced.

Implements Proposal Parallel Compaction by Time Interval

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

bboreham

Seems plausible. Will review again after the preceding PRs are merged.

pkg/compactor/compactor.go

edma2 · 2021-11-15T18:25:54Z

@ac1214 Any plans to resume progress on this amazing work?
We observed great improvement to compaction speed after running this branch. Our biggest tenant is ~ 40-50M time series and we were able to fully compact it down to 24h blocks in a few days.

ac1214 · 2021-11-15T22:43:48Z

@ac1214 Any plans to resume progress on this amazing work? We observed great improvement to compaction speed after running this branch. Our biggest tenant is ~ 40-50M time series and we were able to fully compact it down to 24h blocks in a few days.

Thanks for checking/testing out these changes @edma2!

I haven't had much time to work on these changes, but I believe that @roystchiang might be taking over this work.

edma2 · 2021-11-16T22:36:45Z

pkg/compactor/compactor.go

-	if !c.compactorCfg.ShardingEnabled {
+	// Always owned if sharding is disabled or if using shuffle-sharding as shard ownership
+	// is determined by the shuffle sharding grouper.
+	if !c.compactorCfg.ShardingEnabled || c.compactorCfg.ShardingStrategy == util.ShardingStrategyShuffle {
 		return true, nil


Every Compactor now "owns" all the users, even if it doesn't actually compact any blocks for most users. One particular impact I saw is a huge growth of metadata syncs. Instead, maybe we can compute ownUser based on the shuffle shard of a tenant.

Quick fix that seems to work: b0c2c2a

hey @edma2 thanks for taking a look at the compactor PR. We are actively testing this branch in our beta environment, and I'm glad to hear that it is working for you.

We have a similar fix to your commit, and I can confirm that we also saw an improvement in the metadata syncs.

However, on the tenant clean up side, we were running into issues where the deleted tenant directory was left dangling. While compactor-A is deleting the deletion markers, compactor-B is also trying to sync the data, and re-uploads the block index. We currently provide an override for the tenant shard size on the clean-up path so that only 1 compactor owns the cleanup for a given tenant.

thanks for the tip on the cleanup side. Makes sense that you don't want multiple compactors repeating the same work and potentially conflicting.

edma2 · 2021-12-09T20:11:02Z

@roystchiang with this PR have you started seeing any errors that look like overlapping sources detected for plan? I usually delete one of the culprit blocks to resolve the issue, but I was wondering if this is something you also noticed on your end, and if so would support the theory that this is related to parallel compaction.

roystchiang · 2021-12-09T21:11:51Z

@roystchiang with this PR have you started seeing any errors that look like overlapping sources detected for plan? I usually delete one of the culprit blocks to resolve the issue, but I was wondering if this is something you also noticed on your end, and if so would support the theory that this is related to parallel compaction.

We've been running this change for a bigger workspace, but we have not seen this error yet

Are you able to provide more detail? Is it a level-1 -> level-2 compaction? and what level blocks are these?

edma2 · 2021-12-11T02:24:05Z

Currently, my theory is that this happens in the following situation:

For a vertical compaction group e.g. (10:00 - 12:00), one Ingester has not yet uploaded its block (for some reason).
L1->L2 compaction proceeds anyway and compactor A outputs A1 as a L2 block.
Compactor B then starts L2 -> L3 compaction (e.g. 00:00 - 12:00) which includes block A1.
The late Ingester uploads the missing block M (for 10:00 - 12:00).
Compactor A does compaction again because there are > 1 block for that (10:00 - 12:00) compaction group: [A1, M].
Compactor B finishes compaction and outputs block B1 (00:00 - 12:00), which includes A1 as a source.
Compactor A finishes compaction and outputs block A2 (10:00 - 12:00), which includes A1 as a source.

Compactor B tries to compact [B1, A2] within the (00:00 - 12:00) compaction group and fails since both include A1 as the source. I don't think this issue would happen if only one compactor was running for a tenant.

wilfriedroset · 2021-12-13T14:32:24Z

I've been testing successfully this PR on my compactor on a cluster with a single tenant having 200M actives series.

I'm running 6 compactors with the following configuration.

compactor:
  block_deletion_marks_migration_enabled: false
  block_sync_concurrency: 10
  compaction_interval: 30m0s
  data_dir: /var/lib/cortex/data
  meta_sync_concurrency: 10
  sharding_enabled: true
  sharding_ring:
    instance_interface_names:
    - eth1
    kvstore:
      prefix: cortex/collectors/
  sharding_strategy: shuffle-sharding
limits:
  compactor_tenant_shard_size: 6
target: compactor

Here are a couple of message from the logs which can raise concern.
The first one has been discussed above

Dec 13 13:51:02 cortex-compactor-6 cortex[8188]: ts=2021-12-13T13:51:02.855219667Z caller=log.go:168 component=compactor level=info msg="Found overlapping blocks during compaction"
I'm not sure how bad this one is as the retry seems to be working
Dec 13 13:53:38 cortex-compactor-6 cortex[8188]: level=debug ts=2021-12-13T13:53:38.781769674Z caller=client.go:191 msg="error CASing, trying again" key=cortex/collectors/compactor index=20677798

The last one is more concerning, seen it 20 time, only on 1 of my 6 compactors
Dec 11 04:53:42 cortex-compactor-6 cortex[8188]: level=warn ts=2021-12-11T04:53:42.954993545Z caller=block.go:191 component=compactor org_id=fake groupKey=29059490200@5679675083797525161 rangeStart="2021-11-27 10:00:00 +0000 UTC" rangeEnd="2021-11-27 12:00:00 +0000 UTC" externalLabels="{__org_id__=\"fake\"}" downsampleResolution=0 msg="requested to mark for deletion, but file already exists; this should not happen; investigate" err="file 01FNGQS0QF8RXE7YK043ZNPH3J/deletion-mark.json already exists in bucket"

roystchiang · 2022-01-06T20:31:40Z

@edma2, what you suggest makes sense, and it can definitely happen. How often does this issue happen to do you?

Let me see if there's a way to avoid this. With a coordinator for compactors, we can block the compaction of A2 until B1 is done. However, we don't have this right now. Thanos played around with this idea in thanos-io/thanos#4458, and it would be nice to unify this logic for both Thanos and Cortex.

@wilfriedroset, for requested to mark for deletion, but file already exists; this should not happen; investigate" err="file 01FNGQS0QF8RXE7YK043ZNPH3J/deletion-mark.json already exists in bucket, is this a part of the user retention period cleanup? or regular compaction?

wilfriedroset · 2022-01-07T13:41:46Z

The error message I got was during a regular compaction

pkg/compactor/shuffle_sharding_grouper.go

docs/guides/shuffle-sharding.md

pkg/compactor/compactor.go

pkg/compactor/shuffle_sharding_grouper.go

pkg/compactor/compactor.go

edma2 · 2022-01-19T18:30:37Z

what you suggest makes sense, and it can definitely happen. How often does this issue happen to do you?

not very often, I'd say a corrupt block is created less than once a week

alanprot · 2022-01-25T18:54:54Z

Compactor B then starts L2 -> L3 compaction (e.g. 00:00 - 12:00) which includes block A1.

Could we lock the block A1 at this point? So it would not be used as source to any other compaction?
If we do that we will not have the following problem anymore right?

Compactor B tries to compact [B1, A2] within the (00:00 - 12:00) compaction group and fails since both include A1 as the source. I don't think this issue would happen if only one compactor was running for a tenant.

The only problem with locks ofc is what to do if the compactor dies, we need to have a lock timeout and some kinda and compactor would need to keep updating the validUntil to now+10m or something like that?

@edma2 Could you prove that that's what happened on your case? Maybe looking at the timestamps of when the source blocks got updated on s3?

edma2 · 2022-02-03T21:25:01Z

My original theory was based on what I saw in the compactor logs. I'll look at this again and see if I can find any recent compactor logs that tell this story.

nschad · 2022-02-16T21:45:06Z

We tested a few blocks starting from level 1 compaction (~1TB) on a new bucket. To test it it, I essentially ran the same config as @wilfriedroset . Couldn't find any major issues, seem to have compacted just fine.

Extra: Took this pr and rebased it against current master and then build the image.

Update: Compacted 10TiB this way, zero Issues

Signed-off-by: Albert <ac1214@users.noreply.github.com>

Signed-off-by: Alvin Lin <alvinlin@amazon.com>

…ctor via ring, instead of returning true if shuffle-sharding is enabled Signed-off-by: Roy Chiang <roychi@amazon.com>

…nt at once, which results in dangling bucket index Signed-off-by: Roy Chiang <roychi@amazon.com>

… it as plans get generated Signed-off-by: Roy Chiang <roychi@amazon.com>

Signed-off-by: Roy Chiang <roychi@amazon.com>

alanprot · 2022-04-07T18:17:26Z

I've been running this for some time and we did not see the problem happening. We could skip the problematic block using #4707

We could also propose a change on Thanos to skip those blocks similarly what was introduced thanos-io/thanos#4469

pull-request-size bot added the size/XXL label Aug 19, 2021

bboreham reviewed Sep 15, 2021

View reviewed changes

pkg/compactor/compactor.go Outdated Show resolved Hide resolved

edma2 reviewed Nov 16, 2021

View reviewed changes

ac1214 mentioned this pull request Dec 17, 2021

Add metrics for shuffle sharding #4432

Closed

3 tasks

alvinlin123 force-pushed the add-sharding branch from d8a21d5 to d1cdef6 Compare January 15, 2022 03:20

pull-request-size bot added size/XL and removed size/XXL labels Jan 15, 2022

alvinlin123 force-pushed the add-sharding branch from d1cdef6 to ff79a09 Compare January 15, 2022 03:25

alvinlin123 reviewed Jan 15, 2022

View reviewed changes

pkg/compactor/shuffle_sharding_grouper.go Outdated Show resolved Hide resolved

roystchiang force-pushed the add-sharding branch 2 times, most recently from 9c428f2 to f9a93f7 Compare January 18, 2022 00:22

alvinlin123 reviewed Jan 18, 2022

View reviewed changes

docs/guides/shuffle-sharding.md Outdated Show resolved Hide resolved

alvinlin123 reviewed Jan 18, 2022

View reviewed changes

docs/guides/shuffle-sharding.md Outdated Show resolved Hide resolved

alvinlin123 reviewed Jan 18, 2022

View reviewed changes

pkg/compactor/compactor.go Outdated Show resolved Hide resolved

pkg/compactor/compactor.go Outdated Show resolved Hide resolved

alvinlin123 reviewed Jan 18, 2022

View reviewed changes

pkg/compactor/shuffle_sharding_grouper.go Outdated Show resolved Hide resolved

pkg/compactor/compactor.go Outdated Show resolved Hide resolved

pkg/compactor/compactor.go Outdated Show resolved Hide resolved

roystchiang force-pushed the add-sharding branch 2 times, most recently from a04eb07 to 74d182c Compare January 18, 2022 23:33

alvinlin123 approved these changes Jan 19, 2022

View reviewed changes

pull-request-size bot added size/XS and removed size/XXL labels Apr 6, 2022

roystchiang force-pushed the add-sharding branch from bb6b026 to bf10c94 Compare April 6, 2022 22:56

pull-request-size bot added size/XXL and removed size/XS labels Apr 6, 2022

ac1214 and others added 14 commits April 6, 2022 15:59

Add metrics for remaining planned compactions

776bbce

Signed-off-by: Albert <ac1214@users.noreply.github.com>

fix unit tests

e8881fb

Signed-off-by: Albert <ac1214@users.noreply.github.com>

Add shuffle sharding for compactor

0da3e9b

Signed-off-by: Albert <ac1214@users.noreply.github.com>

update changelog

acb1866

Signed-off-by: Albert <ac1214@users.noreply.github.com>

fix linting

2063e96

Signed-off-by: Albert <ac1214@users.noreply.github.com>

Fix build errors

1c8d4d8

Signed-off-by: Alvin Lin <alvinlin@amazon.com>

Fix up change log

23e79a0

Signed-off-by: Alvin Lin <alvinlin@amazon.com>

Fix linting error

eafb14e

Signed-off-by: Alvin Lin <alvinlin@amazon.com>

Remove use of nolint

fb15202

Signed-off-by: Alvin Lin <alvinlin@amazon.com>

Compactor.ownUser now determines whether the user is owned by a compa…

ccbf001

…ctor via ring, instead of returning true if shuffle-sharding is enabled Signed-off-by: Roy Chiang <roychi@amazon.com>

fix bug where multiple compactors are trying to cleanup the same tena…

3cb44ee

…nt at once, which results in dangling bucket index Signed-off-by: Roy Chiang <roychi@amazon.com>

set all remaining compation in one go, instead of slowly incrementing…

dd9681b

… it as plans get generated Signed-off-by: Roy Chiang <roychi@amazon.com>

rename ownUser function for better readability

3ea1e36

Signed-off-by: Roy Chiang <roychi@amazon.com>

address comments

1a7f65a

Signed-off-by: Roy Chiang <roychi@amazon.com>

roystchiang force-pushed the add-sharding branch from bf10c94 to ff60ff1 Compare April 6, 2022 22:59

pull-request-size bot added size/XL and removed size/XXL labels Apr 6, 2022

roystchiang force-pushed the add-sharding branch from ff60ff1 to 08782d4 Compare April 6, 2022 23:03

fixed rebase issues

3cd209b

Signed-off-by: Roy Chiang <roychi@amazon.com>

roystchiang force-pushed the add-sharding branch from 08782d4 to 3cd209b Compare April 6, 2022 23:08

roystchiang and others added 2 commits April 6, 2022 18:28

fix tests

18a8e70

Signed-off-by: Roy Chiang <roychi@amazon.com>

Merge branch 'master' into add-sharding

137a313

alanprot merged commit 1e229ce into cortexproject:master Apr 7, 2022

alexqyle mentioned this pull request Jul 28, 2022

Introduced lock file to shuffle sharding grouper #4805

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add shuffle-sharding for the compactor #4433

Add shuffle-sharding for the compactor #4433

ac1214 commented Aug 19, 2021 •

edited

Loading

bboreham left a comment

edma2 commented Nov 15, 2021

ac1214 commented Nov 15, 2021

edma2 Nov 16, 2021 •

edited

Loading

edma2 Nov 17, 2021

roystchiang Nov 17, 2021

edma2 Nov 17, 2021

edma2 commented Dec 9, 2021

roystchiang commented Dec 9, 2021

edma2 commented Dec 11, 2021

wilfriedroset commented Dec 13, 2021

roystchiang commented Jan 6, 2022

wilfriedroset commented Jan 7, 2022

edma2 commented Jan 19, 2022

alanprot commented Jan 25, 2022 •

edited

Loading

edma2 commented Feb 3, 2022 •

edited

Loading

nschad commented Feb 16, 2022 •

edited

Loading

alanprot commented Apr 7, 2022

Add shuffle-sharding for the compactor #4433

Add shuffle-sharding for the compactor #4433

Conversation

ac1214 commented Aug 19, 2021 • edited Loading

bboreham left a comment

Choose a reason for hiding this comment

edma2 commented Nov 15, 2021

ac1214 commented Nov 15, 2021

edma2 Nov 16, 2021 • edited Loading

Choose a reason for hiding this comment

edma2 Nov 17, 2021

Choose a reason for hiding this comment

roystchiang Nov 17, 2021

Choose a reason for hiding this comment

edma2 Nov 17, 2021

Choose a reason for hiding this comment

edma2 commented Dec 9, 2021

roystchiang commented Dec 9, 2021

edma2 commented Dec 11, 2021

wilfriedroset commented Dec 13, 2021

roystchiang commented Jan 6, 2022

wilfriedroset commented Jan 7, 2022

edma2 commented Jan 19, 2022

alanprot commented Jan 25, 2022 • edited Loading

edma2 commented Feb 3, 2022 • edited Loading

nschad commented Feb 16, 2022 • edited Loading

alanprot commented Apr 7, 2022

ac1214 commented Aug 19, 2021 •

edited

Loading

edma2 Nov 16, 2021 •

edited

Loading

alanprot commented Jan 25, 2022 •

edited

Loading

edma2 commented Feb 3, 2022 •

edited

Loading

nschad commented Feb 16, 2022 •

edited

Loading