Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query: possible goroutine leak seen between queriers on v0.35.1 #7553

Closed
jdgeisler opened this issue Jul 16, 2024 · 5 comments
Closed

Query: possible goroutine leak seen between queriers on v0.35.1 #7553

jdgeisler opened this issue Jul 16, 2024 · 5 comments

Comments

@jdgeisler
Copy link

Thanos, Prometheus and Golang version used:
Thanos version: 0.35.1
Prometheus version: 2.48.1

Object Storage Provider:
AWS S3

What happened:

We are running Thanos using query chaining between tls and non-tls endpoints.

On Thanos v0.35.1, we are seeing what looks like a goroutine leak between our two thanos querier deployments. This behavior was not seen until we implemented query chaining on 6/18. From then on, you can see the steady rise and sawtooth pattern emerge in the goroutines. Along with this, we have seen an increase in timeouts and latency in our queries.

Screenshot 2024-07-16 at 4 20 42 PM

This was reproduced in our dev environment, and when downgrading to Thanos v0.35.0 (still using the same exact configuration with query chaining), we then see goroutines return to normal.

Screenshot 2024-07-16 at 4 24 13 PM

Possibly related to this is that when testing queries on v0.35.0, I see on certain queries we are hitting our --store.response-timeout of 10s between our non-tls querier and tls querier. This results in the following error when the load time is over 10s.

receive series from Addr: thanos-mscprod-query-tls.thanos.svc.cluster.local:10901 
MinTime: 1692727200000 MaxTime: 9223372036854775807: rpc error: code = Canceled desc = context canceled

Interestingly, on Thanos v0.35.1, instead of failing after 10s when we hit the --store.response-timeout, the queries instead hang indefinitely until we hit our response_header_timeout in the query-frontend. This is paired with the buildup of goroutines shown earlier.

What you expected to happen:

We expected to be able to utilize query chaining without seeing the large steady increase in goroutines building up. Additionally, we are also wondering what could also be causing the --store.response-timeout to be hit (or not hit) in this situation as well.

How to reproduce it (as minimally and precisely as possible):

Thanos querier config that has the sharded stores added and the thanos tls querier

- args:
        - query
        - --query.metadata.default-time-range=24h
        - --query.max-concurrent-select=10
        - --query.auto-downsampling
        - --grpc-compression=snappy
        - --query.max-concurrent=80
        - --store.response-timeout=10000ms
        - --endpoint-strict=thanos-store-short.thanos.svc.cluster.local:10901
        - --endpoint-strict=thanos-store-mid.thanos.svc.cluster.local:10901
        - --endpoint-strict=thanos-store-long.thanos.svc.cluster.local:10901
        - --endpoint-strict=thanos-query-tls.thanos.svc.cluster.local:10901

Thanos tls querier with the prometheus sidecar endpoints in different clusters.

- args:
        - query
        - --grpc-client-tls-secure
        - --query.metadata.default-time-range=24h
        - --query.max-concurrent-select=10
        - --query.auto-downsampling
        - --grpc-compression=snappy
        - --query.max-concurrent=80
        - --store.response-timeout=10000ms
        - --endpoint-strict=prometheus-1:443
        - --endpoint-strict=prometheus-2:443
       - --endpoint-strict=...

With this above configuration on Thanos v0.35.1, we see the described behavior with goroutines and we see queries hang even though the --store.response-timeout seems to be hit. On v0.35.0, we see goroutines return to normal and the same query times out at 10s. We see this mainly on larger clusters against our largest prometheus which has around 30 million active series.

@jdgeisler
Copy link
Author

I have taken some goroutine dumps from our queriers.

goroutine dump from one of our thanos-querier pods

goroutine dump from one of our thanos-querier-tls pods

We also see a correlation between memory usage and goroutines, shown in these dashboards

Screenshot 2024-07-18 at 11 51 01 AM

Screenshot 2024-07-18 at 11 51 13 AM

@wiardvanrij
Copy link
Member

Maybe related to #6948?

cc @thibaultmg , any chance there could be something missing when chaining queriers?

@thibaultmg
Copy link
Contributor

Hey, indeed I did some changes on the eagerRespSet that seems to be the culprit here. IIRC, I removed the check on the context there because it caused a race condition between it and the serverAsClient (implementation of the storepb.Store_SeriesClient interface) in which the eagerRespSet or lazyRespSet would stop calling the Recv() too early, leaving a dangling goroutine.

I would suspect a bad context management

Looking quickly at the code, blockSeriesClient seems to be used by the querier, and in the Recv(), the context is not checked directly but it seems to used in the nextBatch function. There might be a code path where a done context is silently ignored 🤷 . I need to have a deeper look at it. This is just an hypothesis.

@jdgeisler
Copy link
Author

We are now seeing this issue resolved after the following fix released in v0.36.1 #7618

@MichaHoffmann
Copy link
Contributor

We are now seeing this issue resolved after the following fix released in v0.36.1

Awesome, I'll close the issue then!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants