Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add shuffle-sharding for the compactor #4433

Merged
merged 17 commits into from
Apr 7, 2022

Conversation

ac1214
Copy link
Contributor

@ac1214 ac1214 commented Aug 19, 2021

What this PR does:

Adds shuffle-sharding for the plans generated by the grouper. Depends on #4432. Once #4432 is merged the diff for this PR will be reduced.

Implements Proposal Parallel Compaction by Time Interval

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Copy link
Contributor

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems plausible. Will review again after the preceding PRs are merged.

pkg/compactor/compactor.go Outdated Show resolved Hide resolved
@edma2
Copy link
Contributor

edma2 commented Nov 15, 2021

@ac1214 Any plans to resume progress on this amazing work?
We observed great improvement to compaction speed after running this branch. Our biggest tenant is ~ 40-50M time series and we were able to fully compact it down to 24h blocks in a few days.

@ac1214
Copy link
Contributor Author

ac1214 commented Nov 15, 2021

@ac1214 Any plans to resume progress on this amazing work? We observed great improvement to compaction speed after running this branch. Our biggest tenant is ~ 40-50M time series and we were able to fully compact it down to 24h blocks in a few days.

Thanks for checking/testing out these changes @edma2!

I haven't had much time to work on these changes, but I believe that @roystchiang might be taking over this work.

if !c.compactorCfg.ShardingEnabled {
// Always owned if sharding is disabled or if using shuffle-sharding as shard ownership
// is determined by the shuffle sharding grouper.
if !c.compactorCfg.ShardingEnabled || c.compactorCfg.ShardingStrategy == util.ShardingStrategyShuffle {
return true, nil
Copy link
Contributor

@edma2 edma2 Nov 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every Compactor now "owns" all the users, even if it doesn't actually compact any blocks for most users. One particular impact I saw is a huge growth of metadata syncs. Instead, maybe we can compute ownUser based on the shuffle shard of a tenant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick fix that seems to work: b0c2c2a

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @edma2 thanks for taking a look at the compactor PR. We are actively testing this branch in our beta environment, and I'm glad to hear that it is working for you.

We have a similar fix to your commit, and I can confirm that we also saw an improvement in the metadata syncs.

However, on the tenant clean up side, we were running into issues where the deleted tenant directory was left dangling. While compactor-A is deleting the deletion markers, compactor-B is also trying to sync the data, and re-uploads the block index. We currently provide an override for the tenant shard size on the clean-up path so that only 1 compactor owns the cleanup for a given tenant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the tip on the cleanup side. Makes sense that you don't want multiple compactors repeating the same work and potentially conflicting.

@edma2
Copy link
Contributor

edma2 commented Dec 9, 2021

@roystchiang with this PR have you started seeing any errors that look like overlapping sources detected for plan? I usually delete one of the culprit blocks to resolve the issue, but I was wondering if this is something you also noticed on your end, and if so would support the theory that this is related to parallel compaction.

@roystchiang
Copy link
Contributor

@roystchiang with this PR have you started seeing any errors that look like overlapping sources detected for plan? I usually delete one of the culprit blocks to resolve the issue, but I was wondering if this is something you also noticed on your end, and if so would support the theory that this is related to parallel compaction.

We've been running this change for a bigger workspace, but we have not seen this error yet

Are you able to provide more detail? Is it a level-1 -> level-2 compaction? and what level blocks are these?

@edma2
Copy link
Contributor

edma2 commented Dec 11, 2021

Currently, my theory is that this happens in the following situation:

For a vertical compaction group e.g. (10:00 - 12:00), one Ingester has not yet uploaded its block (for some reason).
L1->L2 compaction proceeds anyway and compactor A outputs A1 as a L2 block.
Compactor B then starts L2 -> L3 compaction (e.g. 00:00 - 12:00) which includes block A1.
The late Ingester uploads the missing block M (for 10:00 - 12:00).
Compactor A does compaction again because there are > 1 block for that (10:00 - 12:00) compaction group: [A1, M].
Compactor B finishes compaction and outputs block B1 (00:00 - 12:00), which includes A1 as a source.
Compactor A finishes compaction and outputs block A2 (10:00 - 12:00), which includes A1 as a source.

Compactor B tries to compact [B1, A2] within the (00:00 - 12:00) compaction group and fails since both include A1 as the source. I don't think this issue would happen if only one compactor was running for a tenant.

@wilfriedroset
Copy link

I've been testing successfully this PR on my compactor on a cluster with a single tenant having 200M actives series.

I'm running 6 compactors with the following configuration.

compactor:
  block_deletion_marks_migration_enabled: false
  block_sync_concurrency: 10
  compaction_interval: 30m0s
  data_dir: /var/lib/cortex/data
  meta_sync_concurrency: 10
  sharding_enabled: true
  sharding_ring:
    instance_interface_names:
    - eth1
    kvstore:
      prefix: cortex/collectors/
  sharding_strategy: shuffle-sharding
limits:
  compactor_tenant_shard_size: 6
target: compactor

Here are a couple of message from the logs which can raise concern.
The first one has been discussed above

  • Dec 13 13:51:02 cortex-compactor-6 cortex[8188]: ts=2021-12-13T13:51:02.855219667Z caller=log.go:168 component=compactor level=info msg="Found overlapping blocks during compaction"
    I'm not sure how bad this one is as the retry seems to be working
  • Dec 13 13:53:38 cortex-compactor-6 cortex[8188]: level=debug ts=2021-12-13T13:53:38.781769674Z caller=client.go:191 msg="error CASing, trying again" key=cortex/collectors/compactor index=20677798

The last one is more concerning, seen it 20 time, only on 1 of my 6 compactors
Dec 11 04:53:42 cortex-compactor-6 cortex[8188]: level=warn ts=2021-12-11T04:53:42.954993545Z caller=block.go:191 component=compactor org_id=fake groupKey=29059490200@5679675083797525161 rangeStart="2021-11-27 10:00:00 +0000 UTC" rangeEnd="2021-11-27 12:00:00 +0000 UTC" externalLabels="{__org_id__=\"fake\"}" downsampleResolution=0 msg="requested to mark for deletion, but file already exists; this should not happen; investigate" err="file 01FNGQS0QF8RXE7YK043ZNPH3J/deletion-mark.json already exists in bucket"

@ac1214 ac1214 mentioned this pull request Dec 17, 2021
3 tasks
@roystchiang
Copy link
Contributor

@edma2, what you suggest makes sense, and it can definitely happen. How often does this issue happen to do you?

Let me see if there's a way to avoid this. With a coordinator for compactors, we can block the compaction of A2 until B1 is done. However, we don't have this right now. Thanos played around with this idea in thanos-io/thanos#4458, and it would be nice to unify this logic for both Thanos and Cortex.

@wilfriedroset, for requested to mark for deletion, but file already exists; this should not happen; investigate" err="file 01FNGQS0QF8RXE7YK043ZNPH3J/deletion-mark.json already exists in bucket, is this a part of the user retention period cleanup? or regular compaction?

@wilfriedroset
Copy link

The error message I got was during a regular compaction

@roystchiang roystchiang force-pushed the add-sharding branch 2 times, most recently from 9c428f2 to f9a93f7 Compare January 18, 2022 00:22
pkg/compactor/compactor.go Outdated Show resolved Hide resolved
pkg/compactor/compactor.go Outdated Show resolved Hide resolved
pkg/compactor/shuffle_sharding_grouper.go Outdated Show resolved Hide resolved
pkg/compactor/compactor.go Outdated Show resolved Hide resolved
pkg/compactor/compactor.go Outdated Show resolved Hide resolved
@roystchiang roystchiang force-pushed the add-sharding branch 2 times, most recently from a04eb07 to 74d182c Compare January 18, 2022 23:33
@edma2
Copy link
Contributor

edma2 commented Jan 19, 2022

what you suggest makes sense, and it can definitely happen. How often does this issue happen to do you?

not very often, I'd say a corrupt block is created less than once a week

@alanprot
Copy link
Member

alanprot commented Jan 25, 2022

Compactor B then starts L2 -> L3 compaction (e.g. 00:00 - 12:00) which includes block A1.

Could we lock the block A1 at this point? So it would not be used as source to any other compaction?
If we do that we will not have the following problem anymore right?

Compactor B tries to compact [B1, A2] within the (00:00 - 12:00) compaction group and fails since both include A1 as the source. I don't think this issue would happen if only one compactor was running for a tenant.

The only problem with locks ofc is what to do if the compactor dies, we need to have a lock timeout and some kinda and compactor would need to keep updating the validUntil to now+10m or something like that?

@edma2 Could you prove that that's what happened on your case? Maybe looking at the timestamps of when the source blocks got updated on s3?

@edma2
Copy link
Contributor

edma2 commented Feb 3, 2022

My original theory was based on what I saw in the compactor logs. I'll look at this again and see if I can find any recent compactor logs that tell this story.

@nschad
Copy link
Contributor

nschad commented Feb 16, 2022

We tested a few blocks starting from level 1 compaction (~1TB) on a new bucket. To test it it, I essentially ran the same config as @wilfriedroset . Couldn't find any major issues, seem to have compacted just fine.

Extra: Took this pr and rebased it against current master and then build the image.

Update: Compacted 10TiB this way, zero Issues

ac1214 and others added 14 commits April 6, 2022 15:59
Signed-off-by: Albert <ac1214@users.noreply.github.com>
Signed-off-by: Albert <ac1214@users.noreply.github.com>
Signed-off-by: Albert <ac1214@users.noreply.github.com>
Signed-off-by: Albert <ac1214@users.noreply.github.com>
Signed-off-by: Albert <ac1214@users.noreply.github.com>
Signed-off-by: Alvin Lin <alvinlin@amazon.com>
Signed-off-by: Alvin Lin <alvinlin@amazon.com>
Signed-off-by: Alvin Lin <alvinlin@amazon.com>
Signed-off-by: Alvin Lin <alvinlin@amazon.com>
…ctor via ring, instead of returning true if shuffle-sharding is enabled

Signed-off-by: Roy Chiang <roychi@amazon.com>
…nt at once, which results in dangling bucket index

Signed-off-by: Roy Chiang <roychi@amazon.com>
… it as plans get generated

Signed-off-by: Roy Chiang <roychi@amazon.com>
Signed-off-by: Roy Chiang <roychi@amazon.com>
Signed-off-by: Roy Chiang <roychi@amazon.com>
Signed-off-by: Roy Chiang <roychi@amazon.com>
roystchiang and others added 2 commits April 6, 2022 18:28
Signed-off-by: Roy Chiang <roychi@amazon.com>
@alanprot
Copy link
Member

alanprot commented Apr 7, 2022

I've been running this for some time and we did not see the problem happening. We could skip the problematic block using #4707

We could also propose a change on Thanos to skip those blocks similarly what was introduced thanos-io/thanos#4469

@alanprot alanprot merged commit 1e229ce into cortexproject:master Apr 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants