Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thanos Receive can create overlapping blocks #5461

Open
fpetkovski opened this issue Jul 2, 2022 · 5 comments
Open

Thanos Receive can create overlapping blocks #5461

fpetkovski opened this issue Jul 2, 2022 · 5 comments

Comments

@fpetkovski
Copy link
Contributor

Thanos Receiver can fail head compaction due to overlapping blocks. We have seen this happen in production and it caused an instance of Receiver to completely halt compaction and accumulate memory forever.

The root cause could be a race condition bug in TSDB itself: prometheus/prometheus#8055

It is possible to allow overlapping blocks in TSDB, but maybe this should be enabled by default since Thanos Query can deal with this situation. This option for enabling vertical blocks was added here: #3792

Cortex seems to solve the issue by preventing ingestion during head compaction: cortexproject/cortex#3422

Thanos version used: v0.26

What happened:
Receiver failed to compact the head. This graph shows that memory accumulation started at 01:00:00, exactly after compaction is triggered.

image

What you expected to happen:
Receiver successfully compacts the head and frees up memory.

How to reproduce it (as minimally and precisely as possible):
Haven't been able to reproduce it locally yet.

Full logs to relevant components:

log: level=error ts=2022-07-02T10:33:04.167671636Z caller=db.go:824 component=receive component=multi-tsdb tenant=[redacted] msg="compaction failed" err="compact head: reloadBlocks blocks: invalid block sequence: block time ranges overlap: [mint: 1656689707030, maxt: 1656689715305, range: 8s, blocks: 2]: <ulid: 01G6X6DTHR6A73K6M3HR6DYRSQ, mint: 1656688461617, maxt: 1656689715305, range: 20m53s>, <ulid: 01G6Z7GZ7P6T8YVFAGAJJX2PQ6, mint: 1656689707030, maxt: 1656691200000, range: 24m52s>"

@stale
Copy link

stale bot commented Sep 21, 2022

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Sep 21, 2022
@fpetkovski fpetkovski removed the stale label Sep 21, 2022
@Ygshvr
Copy link

Ygshvr commented Dec 15, 2022

@fpetkovski are you seeing increase in persistent volume utilization as well for this particular receiver instance/pod?

@fpetkovski
Copy link
Contributor Author

Unfortunately I don't have the data anymore, and we resolve the problem by enabling overlapping blocks.

@ahurtaud
Copy link
Contributor

It looks like receive overlapping blocks is enbaled by default with prometheus 2.39+ dependencies:
561a113#diff-3d45f9beb5b670e94fe2715e21477a3fb8f5db7bf97f01cddf6e108bb98761a6

I think it would be worth to cleanup the flag now.

BTW I am experiencing some compaction issues with blocks uploaded to buckets.
Is it possible to have 1 block uploaded by receive, then receive compact an overlapped block and re-upload it to the store.
Then compact face 2 overlapping blocks (one has the from-out-of-order hint in the meta.json)

And then, what can we do with this on compactor? is it a new issue I should open?

ts=2023-07-19T07:51:35.265651075Z caller=compact.go:487 level=error msg="critical error detected; halting" err="compaction: group 0@1932730964536117780: pre compaction overlap check: overlaps found while gathering blocks. [mint: 1689645600000, maxt: 1689652800000, range: 2h0m0s, blocks: 2]: <ulid: 01H5KNX6BW1XRZD5R23JCN2AAA, mint: 1689645600000, maxt: 1689652800000, range: 2h0m0s>, <ulid: 01H5KNXJ70R3PPQP079N2NAPWK, mint: 1689645600000, maxt: 1689652800000, range: 2h0m0s>\n[mint: 1689609600005, maxt: 1689616800000, range: 1h59m59s, blocks: 2]: <ulid: 01H5JCQ84D2EDD2CECD24CP481, mint: 1689609600000, maxt: 1689616800000, range: 2h0m0s>, <ulid: 01H5JKJHZFQ3E698G008GQ3GZC, mint: 1689609600005, maxt: 1689616800000, range: 1h59m59s>"

thank you,
Alban

@ahurtaud
Copy link
Contributor

It looks like receive overlapping blocks is enbaled by default with prometheus 2.39+ dependencies: 561a113#diff-3d45f9beb5b670e94fe2715e21477a3fb8f5db7bf97f01cddf6e108bb98761a6

I think it would be worth to cleanup the flag now.

BTW I am experiencing some compaction issues with blocks uploaded to buckets. Is it possible to have 1 block uploaded by receive, then receive compact an overlapped block and re-upload it to the store. Then compact face 2 overlapping blocks (one has the from-out-of-order hint in the meta.json)

And then, what can we do with this on compactor? is it a new issue I should open?

ts=2023-07-19T07:51:35.265651075Z caller=compact.go:487 level=error msg="critical error detected; halting" err="compaction: group 0@1932730964536117780: pre compaction overlap check: overlaps found while gathering blocks. [mint: 1689645600000, maxt: 1689652800000, range: 2h0m0s, blocks: 2]: <ulid: 01H5KNX6BW1XRZD5R23JCN2AAA, mint: 1689645600000, maxt: 1689652800000, range: 2h0m0s>, <ulid: 01H5KNXJ70R3PPQP079N2NAPWK, mint: 1689645600000, maxt: 1689652800000, range: 2h0m0s>\n[mint: 1689609600005, maxt: 1689616800000, range: 1h59m59s, blocks: 2]: <ulid: 01H5JCQ84D2EDD2CECD24CP481, mint: 1689609600000, maxt: 1689616800000, range: 2h0m0s>, <ulid: 01H5JKJHZFQ3E698G008GQ3GZC, mint: 1689609600005, maxt: 1689616800000, range: 1h59m59s>"

thank you, Alban

hum it looks like I have to ignore compaction for OOO blocks with flag introduced with #3442

skip-block-with-out-of-order-chunks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants