Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compactor - "write postings: exceeding max size of 64GiB" #1424

Closed
claytono opened this issue Aug 15, 2019 · 35 comments · Fixed by #3409 or #3410
Closed

compactor - "write postings: exceeding max size of 64GiB" #1424

claytono opened this issue Aug 15, 2019 · 35 comments · Fixed by #3409 or #3410

Comments

@claytono
Copy link
Contributor

claytono commented Aug 15, 2019

Thanos, Prometheus and Golang version used

Prometheus version: 2.11.1

thanos, version 0.6.0 (branch: master, revision: 579671460611d561fcf991182b3f60e91ea04843)
  build user:       root@packages-build-1127f72e-908b-40dc-b9e1-f95dcacaaa62-sck6t
  build date:       20190806-16:13:48
  go version:       go1.12.7

What happened

When running the compactor on we get the following output (sanitized to remove internal hostnames):

2019-08-13T15:13:54-07:00 observability-compactor-123456 user:notice thanos-compact-observability-036[20057]: level=error ts=2019-08-13T22:13:54.054751118Z caller=main.go:199 msg="running command failed" err="compaction failed: compaction failed for group 0@{host=\"observability-prom-A98765\"}: compact blocks [thanos-compact-tDlEs/compact/0@{host=\"observability-prom-A98765\"}/01DHRV2ADD0DGW2V0WB84D472P thanos-compact-tDlEs/compact/0@{host=\"observability-prom-B98765\"}/01DHTN0B5GK4M9K1YF344C599J thanos-compact-tDlEs/compact/0@{host=\"observability-prom-B98765\"}/01DHTYSE4G3CPN6PT73JD4NFPC thanos-compact-tDlEs/compact/0@{host=\"observability-prom-B98765\"}/01DHV863B7P32AR6CM46ZJCJDR thanos-compact-tDlEs/compact/0@{host=\"observability-prom-B98765\"}/01DHY92FF48P4CPQMGPVMEVJE8 thanos-compact-tDlEs/compact/0@{host=\"observability-prom-B98765\"}/01DHYJM1E1F58STJERMF6PWY0M]: 2 errors: write compaction: write postings: write postings: exceeding max size of 64GiB; exceeding max size o
2019-08-13T15:13:54-07:00 observability-compactor-123456 user:notice thanos-compact-observability-036[20057]: f 64GiB"

What you expected to happen

Compaction would succeed, or skip these block without error, since they cannot be compacted.

How to reproduce it (as minimally and precisely as possible):

Try to compact blocks that result in the postings part of the resulting index to exceed 64G

Full logs to relevant components

See log lines above. Let me know if anything else would be useful, but I think that's all we have that is relevant

Anything else we need to know

These blocks are from a pair of prometheus hosts we have scraping all targets in one of our data centers. Here is more information on the blocks themselves:

  • 01DHRV2ADD0DGW2V0WB84D472P
    • Total size: 90.173 GBytes (96822298627 Bytes)
    • Index size 19.146 GBytes (20558092904 Bytes)
    • Postings (unique label pairs): 289667
    • Postings entries (total label pairs): 1593491890
  • 01DHTN0B5GK4M9K1YF344C599J
    • Total size: 103.220 GBytes (110831397125 Bytes)
    • Index size: 22.640 GBytes (24309975943 Bytes)
    • Postings (unique label pairs): 347394
    • Postings entries (total label pairs): 1914640990
  • 01DHTYSE4G3CPN6PT73JD4NFPC
    • Total size: 102.934 GBytes (110524213458 Bytes)
    • Index size: 23.570 GBytes (25308181103 Bytes)
    • Postings (unique label pairs): 321638
    • Postings entries (total label pairs): 2014402216
  • 01DHV863B7P32AR6CM46ZJCJDR
    • Total size: 90.703 GBytes (97391214724 Bytes)
    • Index size19.473 GBytes (20909065106 Bytes)
    • Postings (unique label pairs): 295153
    • Postings entries (total label pairs): 1625440000
  • 01DHY92FF48P4CPQMGPVMEVJE8
    • Total size: 103.066 GBytes (110666647216 Bytes)
    • Index size: 24.510 GBytes (26317160818 Bytes)
    • Postings (unique label pairs): 383836
    • Postings entries (total label pairs): 2092985560
  • 01DHYJM1E1F58STJERMF6PWY0M
    • Total size: 97.399 GBytes (104581383915 Bytes)
    • Index size: 21.636 GBytes (23231951641 Bytes)
    • Postings (unique label pairs): 322795
    • Postings entries (total label pairs): 1818507147

Each of these blocks represent 8 hours of data.

This appears to be related to prometheus/prometheus#5868. I'm not sure if the limit is going to change in the TSDB library, but it would be nice in the mean time if this was considered a warning, or if we could somehow specify a threshold to prevent it from trying to compact these blocks, since they're already fairly large.

@claytono
Copy link
Contributor Author

@jojohappy I'm not sure question is the right tag for this. I believe this is a bug.

@jojohappy
Copy link
Member

@claytono Sorry for the delay. You are right here. I wonder how to check that error? Maybe it is hard to work around. I'm sorry that I'm not very familiar with tsdb.

@claytono
Copy link
Contributor Author

@jojohappy I think this is a tricky one unless the issue is fixed in the TSDB code. I suspect this doesn't crop up with prometheus itself since it's less aggressive about compaction. Right now my best guess on how thanos might handle this would be to have a maximum block size and if a block is already that big, it won't compact it any further. This might also help with some scenarios using time based partitioning

@claytono
Copy link
Contributor Author

It looks like this is being addressed upstream: prometheus/prometheus#5884

@jojohappy
Copy link
Member

jojohappy commented Sep 6, 2019

Sorry for the delay! @claytono

IMO, I agree with you, but also have some questions:

how thanos might handle this would be to have a maximum block size

How can we set the maximum block size? In your case, the size of each index files are less than 30GB, but they actually exceeded the max size of 64GB during compacting. Before finishing compact, we don't know whether the new size of index is more than 64GB or not, right? I think it is difficult to find the right size for checking. Maybe I'm wrong.

This might also help with some scenarios using time based partitioning

Sorry, I can't catch your point. Could you explain it?

It looks like this is being addressed upstream: prometheus/prometheus#5884

Yes, if it is fixed, that will be great.

@claytono
Copy link
Contributor Author

claytono commented Sep 6, 2019

How can we set the maximum block size? In your case, the size of each index files are less than 30GB, but they actually exceeded the max size of 64GB during compacting. Before finishing compact, we don't know whether the new size of index is more than 64GB or not, right? I think it is difficult to find the right size for checking. Maybe I'm wrong.

I don't think we really can. I think the best we could do would be to have a flag that allowed us to specify a maximum size and index needs to be, and past that we don't try to compact it any more.

This might also help with some scenarios using time based partitioning

Sorry, I can't catch your point. Could you explain it?

We're planning to use the new time based partitioning. As part of this we expect we'll want to rebalance the partitions occasionally as compaction and downsampling occurs to keep the amount of memory usage roughly equivalent between thanos store instances. The limiting factor for us for thanos-store capacity is memory usage. That's roughly proportional to index size, so we'd like to ensure we don't end up with thanos stores serving blocks so big that it's just a few blocks. If a thanos-store process is serving 2 blocks and that consumes all the memory on the host, then our options for rebalancing aren't as good as if we limit index size and the thanos-store process is serving a dozen blocks. Right now I think the best options we have is to limit max compaction, but that's a fairly coarse tool. Ideally we'd like to just tell the compactor something like "Don't compact index files that are already x GB, they're big enough".

That said, this isn't really related to the issue I opened, it would just be a nice side effect if this needed to be fixed or worked around in Thanos.

@jojohappy
Copy link
Member

Awesome! Thank you for your detailed case.

we don't try to compact it any more.

How about downsampling? If we skip compaction, maybe we could not do downsampling, because of lack of enough series in the blocks. Further information is here.

Don't compact index files that are already x GB

What's your idea for the x? I think we are difficult to find the right x.

If a thanos-store process is serving 2 blocks and that consumes all the memory on the host, then our options for rebalancing aren't as good as if we limit index size and the thanos-store process is serving a dozen blocks

I think it is not our goal for compaction, did you compare the time cost for long time period querying between those two ways?

Now @bwplotka is following the issue #1471 about reducing store memory usage, I think you can also follow that if you would like to.

@stale
Copy link

stale bot commented Jan 11, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jan 11, 2020
@stale stale bot closed this as completed Jan 18, 2020
@Lemmons
Copy link

Lemmons commented Jan 31, 2020

I'm currently running into this and am curious if there is something I can do to help mitigate the issue. Currently compact errors out with this issue and restarts every few hours.

@jojohappy jojohappy reopened this Feb 2, 2020
@stale stale bot removed the stale label Feb 2, 2020
@stale
Copy link

stale bot commented Mar 3, 2020

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

@stale stale bot added the stale label Mar 3, 2020
@stale stale bot closed this as completed Mar 10, 2020
@ipstatic
Copy link
Contributor

We just hit this issue as well. This limit is due to the use of 32 bit integers within Prometheus' TSDB code. See here for a good explanation. There is an open bug report as well as PR that will hopefully resolve the issue.

@ipstatic
Copy link
Contributor

Could this be re-opened until this is resolved?

@pracucci pracucci reopened this Apr 20, 2020
@stale stale bot removed the stale label Apr 20, 2020
@bwplotka
Copy link
Member

Yea, let's solve this.

I have another suggestion: I would suggest actually not waiting for a fix here and rather set upper limit for block and start splitting them based on size. This is tracked by #2340

Especially on large setups and WITH some deduplication (vertical or offline) enabled and planned we will have a problem with huge TB size blocks pretty soon. At some point indexing does not scale well with number of series, so we might want to split it....

cc @brancz @pracucci @squat

@ipstatic
Copy link
Contributor

That sounds good to me and was another fear of mine as well. Sure we could increase the index limit but we will reach that again some day.

@stale
Copy link

stale bot commented May 21, 2020

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label May 21, 2020
@stale stale bot added the stale label Oct 24, 2020
@ipstatic
Copy link
Contributor

This is still a big issue for us. We now have several compactor instances turned off due to this limit.

@stale stale bot removed the stale label Oct 24, 2020
@SuperQ
Copy link
Contributor

SuperQ commented Oct 29, 2020

We're also now running into this (https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11757)

@bwplotka
Copy link
Member

bwplotka commented Oct 29, 2020 via email

@SuperQ
Copy link
Contributor

SuperQ commented Nov 1, 2020

As a workaround, would it be possible to add a json tag to the meta.json so that the compactor can be told to skip it?

Something like this:

{
  "thanos": {
    "no_compaction": true
  }
}

@bwplotka
Copy link
Member

bwplotka commented Nov 2, 2020

The main problem with this approach is that this block might be one portion of a bigger block that "have" to be created in order to get to 2w compaction level. Need to think about it more, how to make this work, but something like this could be introduced ASAP, yes.

I am working on some block size capped splitting, so compaction will result in two blocks, not one etc. Having blocks capped at XXX GB is not only nice for this issue but also caps how much you need to download for manual interactions, potential deletions, downsampling, and other operations.

Just curious: @SuperQ can you tell what size of the block is right now? (mostly sizes of index I am interested in) Probably more than one in order to get 64GB+ postings.

@bwplotka
Copy link
Member

bwplotka commented Nov 2, 2020

I should have some workaround today.

@SuperQ
Copy link
Contributor

SuperQ commented Nov 2, 2020

Here's some metdata on what's failing:

  8.96 GiB  2020-09-20T17:41:23Z  gs://gitlab-gprd-prometheus/01EJP50AZ3P1EN2NWMFPMC5BA7/index
  3.78 GiB  2020-09-21T05:56:53Z  gs://gitlab-gprd-prometheus/01EJQHMW5K9VP7FWW7VD3M9FQB/index
  8.19 GiB  2020-09-23T06:57:05Z  gs://gitlab-gprd-prometheus/01EJWQJQ0M2F33JV47CEFKFK5N/index
 14.08 GiB  2020-09-25T08:00:39Z  gs://gitlab-gprd-prometheus/01EK1XNFBHZ0MB86HBEZ9NFSVV/index
 12.72 GiB  2020-09-27T07:21:45Z  gs://gitlab-gprd-prometheus/01EK711SSHZHPDH5V274M8ESAY/index
 11.35 GiB  2020-09-29T07:32:43Z  gs://gitlab-gprd-prometheus/01EKC6VXQ4A3ZANP0E30KY5AWM/index
 13.31 GiB  2020-10-23T10:03:34Z  gs://gitlab-gprd-prometheus/01ENA7V7YW31BVM1QFRJ57TDVX/index

And an example meta.json

{
	"ulid": "01EKC6VXQ4A3ZANP0E30KY5AWM",
	"minTime": 1601164800000,
	"maxTime": 1601337600000,
	"stats": {
		"numSamples": 24454977719,
		"numSeries": 49668912,
		"numChunks": 229130151
	},
	"compaction": {
		"level": 3,
		"sources": [
			"01EK6R1EW9Y76QPSHE2S2GZ1SC",
			"01EK6YX646B7RPN4FFFCDR62KP",
			"01EK75RXC6HZBKGRB4N8APP638",
			"01EK7CMMM611CJ4TJD04BBG6MM",
			"01EK7KGBWNC515RFKBDMX4HJSP",
			"01EK7TC3463VCS3Z4CXZMEET0X",
			"01EK817TC6KVQ59R73D48BRAC9",
			"01EK883HM7X7V1SFCH8NZY6FW4",
			"01EK8EZ8W5N2JK5ZZY2V4058YG",
			"01EK8NV048MXK605DPJ710DMKS",
			"01EK8WPQC7ZY51T32WHFMBNPKS",
			"01EK93JEN36YBXK58KMTWMPJCK",
			"01EK9AE5X3WR35X0FGP6Y1A0PR",
			"01EK9H9X586N62EGE972PWPT95",
			"01EK9R5MD2B11YRBPE62ETQJX8",
			"01EK9Z1BN8DFE5H765E8JTDEJC",
			"01EKA5X2XE7959NND514559190",
			"01EKACRT59F6KDA9WJ9GP3NZSW",
			"01EKAKMHD7VA5WXK10S8EH4274",
			"01EKATG8N9Q6EYRR3MYHSE3H10",
			"01EKB1BZX6SZRVKMMG83R1SZ2J",
			"01EKB87Q5DHDGVK0MGS5BE8PFH",
			"01EKBF3ED853GQ0TBWT5YTCTGS",
			"01EKBNZ5NNE8GACZDW32RTBWR0"
		],
		"parents": [
			{
				"ulid": "01EK89TVSX9ZZAET1FYXC1WHH6",
				"minTime": 1601164800000,
				"maxTime": 1601193600000
			},
			{
				"ulid": "01EK8HESS4GD6YS55ASCCXGA2N",
				"minTime": 1601193600000,
				"maxTime": 1601222400000
			},
			{
				"ulid": "01EK9CXXF25T12TBV0XRGRXD63",
				"minTime": 1601222400000,
				"maxTime": 1601251200000
			},
			{
				"ulid": "01EKA8B4MF78WYY3FM115Q4C6T",
				"minTime": 1601251200000,
				"maxTime": 1601280000000
			},
			{
				"ulid": "01EKB3XJWV3FMJXSMQ493JV6WN",
				"minTime": 1601280000000,
				"maxTime": 1601308800000
			},
			{
				"ulid": "01EKBZBX8MCBM2T05Y739KDRGV",
				"minTime": 1601308800000,
				"maxTime": 1601337600000
			}
		]
	},
	"version": 1,
	"thanos": {
		"labels": {
			"cluster": "gprd-gitlab-gke",
			"env": "gprd",
			"environment": "gprd",
			"monitor": "default",
			"prometheus": "monitoring/gitlab-monitoring-promethe-prometheus",
			"prometheus_replica": "prometheus-gitlab-monitoring-promethe-prometheus-0",
			"provider": "gcp",
			"region": "us-east1"
		},
		"downsample": {
			"resolution": 0
		},
		"source": "compactor"
	}
}

Due to the interaction of the operator, and helm, we have an excess of overly long labels.

bwplotka added a commit that referenced this issue Nov 2, 2020
…k size).

Related to #1424 and #3068

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
bwplotka added a commit that referenced this issue Nov 4, 2020
The planner algo was adapted to avoid unnecessary changes to
blocks caused by excluded blocks, so we can quickly switch to
different planning logic in next iteration.

Fixes: #1424

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
bwplotka added a commit that referenced this issue Nov 4, 2020
The planner algo was adapted to avoid unnecessary changes to
blocks caused by excluded blocks, so we can quickly switch to
different planning logic in next iteration.

Fixes: #1424

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
bwplotka added a commit that referenced this issue Nov 4, 2020
The planner algo was adapted to avoid unnecessary changes to
blocks caused by excluded blocks, so we can quickly switch to
different planning logic in next iteration.

Fixes: #1424

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
bwplotka added a commit that referenced this issue Nov 4, 2020
…e over 64GB.

Fixes: #1424

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
bwplotka added a commit that referenced this issue Nov 4, 2020
…e over 64GB.

Fixes: #1424

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
@bwplotka
Copy link
Member

bwplotka commented Nov 5, 2020

Update. The code is done and in review to fix this issue.

bwplotka added a commit that referenced this issue Nov 6, 2020
* compact: Added support for no-compact markers in planner.

The planner algo was adapted to avoid unnecessary changes to
blocks caused by excluded blocks, so we can quickly switch to
different planning logic in next iteration.

Fixes: #1424

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

* Addressed comments.

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
@bwplotka
Copy link
Member

bwplotka commented Nov 6, 2020

You can manually exclude blocks from compaction, but PR for automatic flow for this is still in review: #3410 will close this issue once merged.

@bwplotka bwplotka reopened this Nov 6, 2020
bwplotka added a commit that referenced this issue Nov 6, 2020
…e over 64GB.

Fixes: #1424

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
bwplotka added a commit that referenced this issue Nov 9, 2020
…e over 64GB.

Fixes: #1424

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
bwplotka added a commit that referenced this issue Nov 9, 2020
…e over 64GB. (#3410)

* compact: Added index size limiting planner detecting output index size over 64GB.

Fixes: #1424

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

* Addressed comments; added changelog.

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

* Skipped flaky test.

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Oghenebrume50 pushed a commit to Oghenebrume50/thanos that referenced this issue Dec 7, 2020
…3409)

* compact: Added support for no-compact markers in planner.

The planner algo was adapted to avoid unnecessary changes to
blocks caused by excluded blocks, so we can quickly switch to
different planning logic in next iteration.

Fixes: thanos-io#1424

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

* Addressed comments.

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Oghenebrume50 <raphlbrume@gmail.com>
Oghenebrume50 pushed a commit to Oghenebrume50/thanos that referenced this issue Dec 7, 2020
…e over 64GB. (thanos-io#3410)

* compact: Added index size limiting planner detecting output index size over 64GB.

Fixes: thanos-io#1424

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

* Addressed comments; added changelog.

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

* Skipped flaky test.

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Oghenebrume50 <raphlbrume@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment