[Store gateway] Highly increased latency on "get_range" bucket operations on 0.30.2 #6540

thomas-maurice · 2023-07-20T08:26:24Z

Thanos, Prometheus and Golang version used:

Thanos 0.30.2
Prometheus 2.44.0

Object Storage Provider:

Amazon S3

What happened:

We recently upgraded Thanos from 0.28.1 to 0.30.2 (we didn't upgrade to the newer 0.31.X version because of the deduplication bug #6257 ). After the upgrade we have seen a dramatic increase in the latency for get_range operations on the store gateway. We didn't spot it initially because the environments we tested it in were very low data intensive, however after upgrading our bigger clusters we noticed an increase of the p99 from about a few hundreds milliseconds to sometimes over 2 or even 5 seconds depending on the environment as shown on the graph below

Our storegateway pods are configured as follows:

    Args:
      store
      --log.level=info
      --log.format=logfmt
      --grpc-address=0.0.0.0:10901
      --http-address=0.0.0.0:10902
      --data-dir=/data
      --objstore.config-file=/conf/objstore.yml
      --max-time=-7d
      --min-time=-1y

What you expected to happen:

No significant change in the global bucket operations latencies

How to reproduce it (as minimally and precisely as possible):

Not sure how to reproduce, in our case it was a simple matter as upgrading the running container from 0.28.1 to 0.30.2. But we noticed that this behaviour is appearing on our two busiest clusters

Full logs to relevant components:

Anything else we need to know:

We are running on Amazon EKS

The text was updated successfully, but these errors were encountered:

fpetkovski · 2023-07-21T04:18:59Z

Do you have any way to check request latencies in AWS S3 directly? I wonder if the way they are measured has changed between versions. Also has the number of requests gone up after the upgrade?

thomas-maurice · 2023-07-24T09:15:22Z

I couldn't fine any AWS side latency metrics for AWS, we seem to be observing this only on 2 of our 6 environments, and these are the environments that are the most loaded in terms of metrics/queries. I don't think this has to do with a change in how measurements are done because I've seen impact on systems like grafana after the upgrade

kiyanabah · 2023-07-24T12:08:02Z

After upgrading to Thanos 0.30.2, we have noticed an incresase in sent chunk size, I was wondering what might be related to increase latency on get_renge bucket operation which impact on the operation's performance.

mjimeneznet · 2023-07-24T15:27:22Z

We've downgrade again Thanos to 0.28.1 and eveything went back to previous values. As example:

Request duration between Query and StoreGateway

Duration on operations in buckets in StoreGateway:

Merge Duration in StoreGateway:

Memory usage in StoreGateway:

CPU usage in Thanos Query:

Memory usage in Thanos Query:

Chunk Size:

douglascamata · 2023-07-25T13:43:45Z

@thomas-maurice could you try 0.29.1 so that we can look at metrics and try to narrow down the exact release that change this behavior?

mjimeneznet · 2023-08-01T15:16:42Z

Hello! We have upgraded one of the affected environments to 0.29.0 and looking the graphs I can see that is behaving similar to 0.28.1 , I mean, the performance is good, the timing is good and everything is working as expected.
I'm sharing the same screenshots (red arrow is the rollout):

Request duration between Query and StoreGateway

Duration on operations in buckets in StoreGateway:

Merge Duration in StoreGateway:

Memory usage in StoreGateway:

CPU usage in Thanos Query:

Memory usage in Thanos Query:

Chunk Size:

douglascamata · 2023-08-01T16:38:04Z

@mjimeneznet would you be kind enough to try 0.30.1, please?

kiyanabah · 2023-08-02T10:24:31Z

@mjimeneznet would you be kind enough to try 0.30.1, please?

Hi @douglascamata , We've performed upgrades across one environment, moving the Thanos image within both the Prometheus operator and Thanos components from version0.29.0to 0.30.1. Additionally, we've updated the helm chart to version 11.6.8.

Following these upgrades, we've observed a noticeable increase in chunk size and occurrences of the get_range bucket operation. As a result, this increase has triggered the ThanosStoreObjstoreOperationLatencyHigh alert.

douglascamata · 2023-08-02T13:28:45Z

Ok, so as a summary:

0.28.1 = good
0.29 = good
0.30.1 = bad
0.30.2 = bad

Now, onto some investigation: the only notable change to my eyes from 0.29 to 0.30.1 in terms of Thanos Store GW is #5837.

I would recommend to try bumping the hidden CLI flag --debug.series-batch-size on the Store GW. The default is 10_000 (10k), so maybe try 100_000 (100k) and see how it changes latency, resource usage, and api calls to object storage.

douglascamata · 2023-08-02T13:30:44Z

Kind reminder: do not run 0.30.1 for long on a production environment. You want 0.30.2 there asap to have a very important fix from #6086. Thanks a lot for testing it though.

kiyanabah · 2023-08-02T17:47:04Z

Kind reminder: do not run 0.30.1 for long on a production environment. You want 0.30.2 there asap to have a very important fix from #6086. Thanks a lot for testing it though.

I appreciate your attention to the issue. We did downgrade to 0.29.0 as we have it more stable for now. Thank you 😊

thomas-maurice · 2023-08-03T09:09:03Z

@douglascamata we tried 0.30.2 with --debug.series-batch-size 100000 and it didnt make any notable difference, we were still observing a huge amount of requests and latency

We also saw a big increase in data fetches for some reason

As well as a big spike in memory

fpetkovski · 2023-08-03T09:56:20Z

@thomas-maurice could you post the contents of the /conf/objstore.yml file?

thomas-maurice · 2023-08-03T10:06:22Z

@fpetkovski it is pretty stock, we didn't customise it

---
type: s3
config:
  bucket: [BUCKET NAME]
  endpoint: [S3 ENDPOINT]
  aws_sdk_auth: true

And that's it, so I am assuming the storegateway is working with the default config values

thomas-maurice · 2023-08-08T09:16:59Z

@fpetkovski any ideas of things we could try to troubleshoot further more ?

fpetkovski · 2023-08-09T09:10:14Z

The only thing that comes to mind is to try out 0.32 once we release it. We are waiting on #6317 before we can cut a new release, but there have been many changes since 0.30.2 and the issue might have been addressed already.

thomas-maurice · 2023-08-10T09:53:03Z

Okay :) Will do !

lasermoth · 2023-09-05T06:25:38Z

@thomas-maurice with v0.32.0 / v0.32.2, I was wondering if you had a change to test this yet?

thomas-maurice · 2023-09-11T08:53:23Z

@lasermoth not yet, I'll update the issue when we had time to try it out !

neitrinoweb · 2023-12-11T10:25:40Z

Hello! Any information if this has been fixed in 32.5?

MichaHoffmann · 2023-12-11T10:39:56Z

Hello! Any information if this has been fixed in 32.5?

There were some improvements but the general method to iterate the bucket has not changed. There is the idea of defaulting to previous behaviour and enabling current one ( for cases where it helps ) with a hidden flag and for long term add a bucket index so we can sync cheaply but no work has been done yet.

EDIT: im a potato, this was about something else, please disregard

thomas-maurice · 2024-02-05T10:24:46Z

Hello ! Sorry for the late reply but yes this was fixed in subsequent Thanos versions!

I'm closing this

douglascamata added component: store needs-more-info needs-investigation labels Jul 25, 2023

thomas-maurice closed this as completed Feb 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Store gateway] Highly increased latency on "get_range" bucket operations on 0.30.2 #6540

[Store gateway] Highly increased latency on "get_range" bucket operations on 0.30.2 #6540

thomas-maurice commented Jul 20, 2023

fpetkovski commented Jul 21, 2023

thomas-maurice commented Jul 24, 2023

kiyanabah commented Jul 24, 2023

mjimeneznet commented Jul 24, 2023

douglascamata commented Jul 25, 2023

mjimeneznet commented Aug 1, 2023

douglascamata commented Aug 1, 2023

kiyanabah commented Aug 2, 2023 •

edited

Loading

douglascamata commented Aug 2, 2023 •

edited

Loading

douglascamata commented Aug 2, 2023

kiyanabah commented Aug 2, 2023

thomas-maurice commented Aug 3, 2023

fpetkovski commented Aug 3, 2023

thomas-maurice commented Aug 3, 2023

thomas-maurice commented Aug 8, 2023

fpetkovski commented Aug 9, 2023

thomas-maurice commented Aug 10, 2023

lasermoth commented Sep 5, 2023

thomas-maurice commented Sep 11, 2023

neitrinoweb commented Dec 11, 2023

MichaHoffmann commented Dec 11, 2023 •

edited

Loading

thomas-maurice commented Feb 5, 2024 •

edited

Loading

[Store gateway] Highly increased latency on "get_range" bucket operations on 0.30.2 #6540

[Store gateway] Highly increased latency on "get_range" bucket operations on 0.30.2 #6540

Comments

thomas-maurice commented Jul 20, 2023

fpetkovski commented Jul 21, 2023

thomas-maurice commented Jul 24, 2023

kiyanabah commented Jul 24, 2023

mjimeneznet commented Jul 24, 2023

douglascamata commented Jul 25, 2023

mjimeneznet commented Aug 1, 2023

douglascamata commented Aug 1, 2023

kiyanabah commented Aug 2, 2023 • edited Loading

douglascamata commented Aug 2, 2023 • edited Loading

douglascamata commented Aug 2, 2023

kiyanabah commented Aug 2, 2023

thomas-maurice commented Aug 3, 2023

fpetkovski commented Aug 3, 2023

thomas-maurice commented Aug 3, 2023

thomas-maurice commented Aug 8, 2023

fpetkovski commented Aug 9, 2023

thomas-maurice commented Aug 10, 2023

lasermoth commented Sep 5, 2023

thomas-maurice commented Sep 11, 2023

neitrinoweb commented Dec 11, 2023

MichaHoffmann commented Dec 11, 2023 • edited Loading

thomas-maurice commented Feb 5, 2024 • edited Loading

kiyanabah commented Aug 2, 2023 •

edited

Loading

douglascamata commented Aug 2, 2023 •

edited

Loading

MichaHoffmann commented Dec 11, 2023 •

edited

Loading

thomas-maurice commented Feb 5, 2024 •

edited

Loading