Thanos Store Bucket Operation Latency since v0.33.0 #7072

Pidu2 · 2024-01-17T13:39:12Z

Thanos, Prometheus and Golang version used:

Thanos v0.33.0
Prometheus v2.48.1

Object Storage Provider:
Cloudscale.ch

What happened:
Since upgrading thanos to v0.33.0 from v0.32.5, we are running into the ThanosStoreObjstoreOperationLatencyHigh alert and the corresponding metric shows an increase of the 99th percentile by 3-6 times.

What you expected to happen:
No significant increase in latency.

How to reproduce it (as minimally and precisely as possible):
Screenshot of our production environment (the increase corresponds exactly to the upgrade time):

Screenshot of our testing environment (same setup, less load) after we downgraded back to 0.32.5 (the decrease corresponds exactly to the downgrade time):

Full logs to relevant components:

Some logs AFTER upgrade

ts=2024-01-17T13:06:13.343281201Z caller=fetcher.go:529 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=1.903627604s duration_ms=1903 cached=904 returned=594 partial=0
ts=2024-01-17T13:09:13.762500525Z caller=fetcher.go:529 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=2.322083442s duration_ms=2322 cached=904 returned=594 partial=0
ts=2024-01-17T13:12:13.239154955Z caller=fetcher.go:529 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=1.798598508s duration_ms=1798 cached=905 returned=595 partial=0
ts=2024-01-17T13:15:13.725025768Z caller=fetcher.go:529 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=2.285317139s duration_ms=2285 cached=905 returned=595 partial=0
ts=2024-01-17T13:18:13.574275454Z caller=fetcher.go:529 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=2.134137495s duration_ms=2134 cached=905 returned=595 partial=0
ts=2024-01-17T13:21:13.266295648Z caller=fetcher.go:529 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=1.826640613s duration_ms=1826 cached=905 returned=595 partial=0
ts=2024-01-17T13:24:13.620279818Z caller=fetcher.go:529 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=2.179145307s duration_ms=2179 cached=905 returned=595 partial=0
ts=2024-01-17T13:27:13.411437655Z caller=fetcher.go:529 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=1.971582294s duration_ms=1971 cached=905 returned=595 partial=0

Some logs BEFORE upgrade

ts=2024-01-09T22:11:42.717109681Z caller=fetcher.go:487 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=407.652353ms duration_ms=407 cached=626 returned=326 partial=0
ts=2024-01-09T22:10:53.586476697Z caller=fetcher.go:487 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=1.943787221s duration_ms=1943 cached=846 returned=546 partial=0
ts=2024-01-09T22:08:42.481675919Z caller=fetcher.go:487 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=172.403147ms duration_ms=172 cached=626 returned=326 partial=0
ts=2024-01-09T22:07:53.490727796Z caller=fetcher.go:487 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=1.848686114s duration_ms=1848 cached=846 returned=546 partial=0
ts=2024-01-09T22:05:42.608920073Z caller=fetcher.go:487 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=299.362529ms duration_ms=299 cached=626 returned=326 partial=0
ts=2024-01-09T22:04:53.447103042Z caller=fetcher.go:487 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=1.804983733s duration_ms=1804 cached=846 returned=546 partial=0
ts=2024-01-09T22:02:42.469191789Z caller=fetcher.go:487 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=159.853342ms duration_ms=159 cached=626 returned=326 partial=0

Anything else we need to know:
args

    - store            
    - --data-dir=/data  
    - --index-cache-size=500MB    
    - --chunk-pool-size=500MB
    - --objstore.config-file=/etc/thanos/thanos-s3-credentials/thanos-s3-credentials.yaml
    - --store.caching-bucket.config-file=/etc/thanos/cache/bucket-cache-config.yaml

caching-bucket.config-file:

type: MEMCACHED # Case-insensitive
config:
  addresses:
    - dns+thanos-store-memcached-headless:11211
  timeout: 450ms~ $

The text was updated successfully, but these errors were encountered:

hp197 · 2024-01-19T12:00:38Z

We went from bitnami chart 12.13.0 (thanos 0.32.1) to chart 12.20.1 (thanos 0.33) and see exactly the same behavior and we also use S3 objstore as backend:

version:

  build user:       root@293b7b0c32a8
  build date:       20231219-10:37:39
  go version:       go1.21.5
  platform:         linux/amd64
  tags:             netgo

args:

      --log.level=info
      --log.format=logfmt
      --grpc-address=0.0.0.0:10901
      --http-address=0.0.0.0:10902
      --data-dir=/data
      --objstore.config-file=/conf/objstore.yml

*** 2nd jan is the date of version change, so it really looks like it is something in this version

pharaujo · 2024-01-19T15:39:07Z

We also observed this exact issue when we upgraded from 0.32.5 to 0.33.0, and rolled back. I had a look at the changes between versions that could have caused this, and I'm now suspecting this might not be an issue per-se, but actually something that wasn't being correctly measured: thanos-io/objstore#79 instrumented the iter operation, which was previously always at 0. In my environment, this operation took a lot longer than all the other ones, which made the p99 latency of bucket operations increase dramatically (purple line is the p99 of iter, green one is overall p99):

Maybe it's the 2s threshold of the ThanosStoreObjstoreOperationLatencyHigh alert that needs to be tweaked to account for the newly instrumented operation?

hp197 · 2024-01-19T15:44:33Z

We haven't looked enough in it yet, though also questioned ourself if simply the reporting is changed. This feeling is fuelled by the facts that in all the other metrics we have, we dont see any behavioural change in S3.

Pidu2 · 2024-02-29T12:13:57Z

FYI: After upgrading to v0.34.1 the latency spike has gone down a bit, enough that it stopped triggering the alert for now (as the for: 10m is never reached):

Upgrade happened at around ~11:00:

douglascamata added component: store package: objstore Mark PRs and issues needs to be migrated to thanos-io/objstore labels Jan 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thanos Store Bucket Operation Latency since v0.33.0 #7072

Thanos Store Bucket Operation Latency since v0.33.0 #7072

Pidu2 commented Jan 17, 2024

hp197 commented Jan 19, 2024 •

edited

Loading

pharaujo commented Jan 19, 2024

hp197 commented Jan 19, 2024

Pidu2 commented Feb 29, 2024

Thanos Store Bucket Operation Latency since v0.33.0 #7072

Thanos Store Bucket Operation Latency since v0.33.0 #7072

Comments

Pidu2 commented Jan 17, 2024

hp197 commented Jan 19, 2024 • edited Loading

pharaujo commented Jan 19, 2024

hp197 commented Jan 19, 2024

Pidu2 commented Feb 29, 2024

hp197 commented Jan 19, 2024 •

edited

Loading