Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Store gateway] Highly increased latency on "get_range" bucket operations on 0.30.2 #6540

Closed
thomas-maurice opened this issue Jul 20, 2023 · 22 comments

Comments

@thomas-maurice
Copy link

Thanos, Prometheus and Golang version used:

  • Thanos 0.30.2
  • Prometheus 2.44.0

Object Storage Provider:

Amazon S3

What happened:

We recently upgraded Thanos from 0.28.1 to 0.30.2 (we didn't upgrade to the newer 0.31.X version because of the deduplication bug #6257 ). After the upgrade we have seen a dramatic increase in the latency for get_range operations on the store gateway. We didn't spot it initially because the environments we tested it in were very low data intensive, however after upgrading our bigger clusters we noticed an increase of the p99 from about a few hundreds milliseconds to sometimes over 2 or even 5 seconds depending on the environment as shown on the graph below

2023-07-20_10-09

Our storegateway pods are configured as follows:

    Args:
      store
      --log.level=info
      --log.format=logfmt
      --grpc-address=0.0.0.0:10901
      --http-address=0.0.0.0:10902
      --data-dir=/data
      --objstore.config-file=/conf/objstore.yml
      --max-time=-7d
      --min-time=-1y

What you expected to happen:

No significant change in the global bucket operations latencies

How to reproduce it (as minimally and precisely as possible):

Not sure how to reproduce, in our case it was a simple matter as upgrading the running container from 0.28.1 to 0.30.2. But we noticed that this behaviour is appearing on our two busiest clusters

Full logs to relevant components:

Anything else we need to know:

We are running on Amazon EKS

@fpetkovski
Copy link
Contributor

Do you have any way to check request latencies in AWS S3 directly? I wonder if the way they are measured has changed between versions. Also has the number of requests gone up after the upgrade?

@thomas-maurice
Copy link
Author

I couldn't fine any AWS side latency metrics for AWS, we seem to be observing this only on 2 of our 6 environments, and these are the environments that are the most loaded in terms of metrics/queries. I don't think this has to do with a change in how measurements are done because I've seen impact on systems like grafana after the upgrade

@kiyanabah
Copy link

After upgrading to Thanos 0.30.2, we have noticed an incresase in sent chunk size, I was wondering what might be related to increase latency on get_renge bucket operation which impact on the operation's performance.
image

@mjimeneznet
Copy link

We've downgrade again Thanos to 0.28.1 and eveything went back to previous values. As example:

Request duration between Query and StoreGateway
image

Duration on operations in buckets in StoreGateway:
image

Merge Duration in StoreGateway:
image

Memory usage in StoreGateway:
image

CPU usage in Thanos Query:
image

Memory usage in Thanos Query:
image

Chunk Size:
image

@douglascamata
Copy link
Contributor

@thomas-maurice could you try 0.29.1 so that we can look at metrics and try to narrow down the exact release that change this behavior?

@mjimeneznet
Copy link

Hello! We have upgraded one of the affected environments to 0.29.0 and looking the graphs I can see that is behaving similar to 0.28.1 , I mean, the performance is good, the timing is good and everything is working as expected.
I'm sharing the same screenshots (red arrow is the rollout):

Request duration between Query and StoreGateway
image

Duration on operations in buckets in StoreGateway:
image

Merge Duration in StoreGateway:
image

Memory usage in StoreGateway:
image

CPU usage in Thanos Query:
image

Memory usage in Thanos Query:
image

Chunk Size:
image

@douglascamata
Copy link
Contributor

@mjimeneznet would you be kind enough to try 0.30.1, please?

@kiyanabah
Copy link

kiyanabah commented Aug 2, 2023

@mjimeneznet would you be kind enough to try 0.30.1, please?

Hi @douglascamata , We've performed upgrades across one environment, moving the Thanos image within both the Prometheus operator and Thanos components from version0.29.0to 0.30.1. Additionally, we've updated the helm chart to version 11.6.8.

Following these upgrades, we've observed a noticeable increase in chunk size and occurrences of the get_range bucket operation. As a result, this increase has triggered the ThanosStoreObjstoreOperationLatencyHigh alert.
store-thanos
memory-usage
cpu-usage
image

operation

@douglascamata
Copy link
Contributor

douglascamata commented Aug 2, 2023

Ok, so as a summary:

  • 0.28.1 = good
  • 0.29 = good
  • 0.30.1 = bad
  • 0.30.2 = bad

Now, onto some investigation: the only notable change to my eyes from 0.29 to 0.30.1 in terms of Thanos Store GW is #5837.

I would recommend to try bumping the hidden CLI flag --debug.series-batch-size on the Store GW. The default is 10_000 (10k), so maybe try 100_000 (100k) and see how it changes latency, resource usage, and api calls to object storage.

@douglascamata
Copy link
Contributor

Kind reminder: do not run 0.30.1 for long on a production environment. You want 0.30.2 there asap to have a very important fix from #6086. Thanks a lot for testing it though.

@kiyanabah
Copy link

Kind reminder: do not run 0.30.1 for long on a production environment. You want 0.30.2 there asap to have a very important fix from #6086. Thanks a lot for testing it though.

I appreciate your attention to the issue. We did downgrade to 0.29.0 as we have it more stable for now. Thank you 😊

@thomas-maurice
Copy link
Author

@douglascamata we tried 0.30.2 with --debug.series-batch-size 100000 and it didnt make any notable difference, we were still observing a huge amount of requests and latency

image

We also saw a big increase in data fetches for some reason
image

As well as a big spike in memory
image

@fpetkovski
Copy link
Contributor

@thomas-maurice could you post the contents of the /conf/objstore.yml file?

@thomas-maurice
Copy link
Author

@fpetkovski it is pretty stock, we didn't customise it

---
type: s3
config:
  bucket: [BUCKET NAME]
  endpoint: [S3 ENDPOINT]
  aws_sdk_auth: true

And that's it, so I am assuming the storegateway is working with the default config values

@thomas-maurice
Copy link
Author

@fpetkovski any ideas of things we could try to troubleshoot further more ?

@fpetkovski
Copy link
Contributor

The only thing that comes to mind is to try out 0.32 once we release it. We are waiting on #6317 before we can cut a new release, but there have been many changes since 0.30.2 and the issue might have been addressed already.

@thomas-maurice
Copy link
Author

Okay :) Will do !

@lasermoth
Copy link

@thomas-maurice with v0.32.0 / v0.32.2, I was wondering if you had a change to test this yet?

@thomas-maurice
Copy link
Author

@lasermoth not yet, I'll update the issue when we had time to try it out !

@neitrinoweb
Copy link

Hello! Any information if this has been fixed in 32.5?

@MichaHoffmann
Copy link
Contributor

MichaHoffmann commented Dec 11, 2023

Hello! Any information if this has been fixed in 32.5?

There were some improvements but the general method to iterate the bucket has not changed. There is the idea of defaulting to previous behaviour and enabling current one ( for cases where it helps ) with a hidden flag and for long term add a bucket index so we can sync cheaply but no work has been done yet.

EDIT: im a potato, this was about something else, please disregard

@thomas-maurice
Copy link
Author

thomas-maurice commented Feb 5, 2024

Hello ! Sorry for the late reply but yes this was fixed in subsequent Thanos versions!

I'm closing this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants