Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sidecar: Greatly increased Thanos sidecar memory usage from 0.32.2 to 0.32.3, still exists in 0.35.0 #7395

Open
mkrull opened this issue May 28, 2024 · 4 comments

Comments

@mkrull
Copy link

mkrull commented May 28, 2024

Thanos, Prometheus and Golang version used:

thanos, version 0.32.3 (branch: HEAD, revision: 3d98d7ce7a254b893e4c8ee8122f7f6edd3174bd)
  build user:       root@0b3c549e9dae
  build date:       20230920-07:27:32
  go version:       go1.20.8
  platform:         linux/amd64
  tags:             netgo

Object Storage Provider:

AWS S3

What happened:

After upgrading from 0.31.0 to 0.35.0 we saw greatly increased sidecar memory usage and narrowed it down to a change between 0.32.2 and 0.32.3 (the Prometheus update maybe?).

The memory usage shoots up for certain queries, for us likely recording rules by the ruler, thus constantly high usage was observed.

What you expected to happen:

No significant change in memory usage.

How to reproduce it (as minimally and precisely as possible):

Run {job=".+"} on Prometheus with some metrics for either version and compare memory usage.

Full logs to relevant components:

Anything else we need to know:

Heap profiles for 0.32.2 and 0.32.3 with the same query on the same Prometheus node:

thanos-0 32 2-heap

thanos-0 32 3-heap

@mkrull
Copy link
Author

mkrull commented May 28, 2024

This comment probably refers to the same issue: #6744 (comment)

@GiedriusS
Copy link
Member

I think it's a consequence of #6706. We had to fix a correctness bug and as a consequence, responses need to be sorted in memory before being sent off. Unfortunately, but Prometheus sometimes produces not a sorted response and that needs to be fixed upstream. Or external labels functionality has to be completely reworked. See prometheus/prometheus#12605

@mkrull
Copy link
Author

mkrull commented May 28, 2024

Ouch, I see. Upgrading in environments like Kubernetes comes with a considerable new risk of OOMs for pods running Prometheus with Thanos sidecar because it gets really hard to estimate max memory requirements for the sidecar containers 🤔

@mazad01
Copy link

mazad01 commented Aug 5, 2024

Still happening in 0.36.0.

Substantial mem usage after going from 0.28.1 -> 0.36.0

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants