Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thanos Store Bucket Operation Latency since v0.33.0 #7072

Open
Pidu2 opened this issue Jan 17, 2024 · 4 comments
Open

Thanos Store Bucket Operation Latency since v0.33.0 #7072

Pidu2 opened this issue Jan 17, 2024 · 4 comments
Labels
component: store package: objstore Mark PRs and issues needs to be migrated to thanos-io/objstore

Comments

@Pidu2
Copy link

Pidu2 commented Jan 17, 2024

Thanos, Prometheus and Golang version used:

  • Thanos v0.33.0
  • Prometheus v2.48.1

Object Storage Provider:
Cloudscale.ch

What happened:
Since upgrading thanos to v0.33.0 from v0.32.5, we are running into the ThanosStoreObjstoreOperationLatencyHigh alert and the corresponding metric shows an increase of the 99th percentile by 3-6 times.

What you expected to happen:
No significant increase in latency.

How to reproduce it (as minimally and precisely as possible):
Screenshot of our production environment (the increase corresponds exactly to the upgrade time):
Screenshot from 2024-01-17 13-55-14
Screenshot of our testing environment (same setup, less load) after we downgraded back to 0.32.5 (the decrease corresponds exactly to the downgrade time):
Screenshot from 2024-01-17 13-54-11

Full logs to relevant components:

Some logs AFTER upgrade

ts=2024-01-17T13:06:13.343281201Z caller=fetcher.go:529 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=1.903627604s duration_ms=1903 cached=904 returned=594 partial=0
ts=2024-01-17T13:09:13.762500525Z caller=fetcher.go:529 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=2.322083442s duration_ms=2322 cached=904 returned=594 partial=0
ts=2024-01-17T13:12:13.239154955Z caller=fetcher.go:529 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=1.798598508s duration_ms=1798 cached=905 returned=595 partial=0
ts=2024-01-17T13:15:13.725025768Z caller=fetcher.go:529 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=2.285317139s duration_ms=2285 cached=905 returned=595 partial=0
ts=2024-01-17T13:18:13.574275454Z caller=fetcher.go:529 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=2.134137495s duration_ms=2134 cached=905 returned=595 partial=0
ts=2024-01-17T13:21:13.266295648Z caller=fetcher.go:529 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=1.826640613s duration_ms=1826 cached=905 returned=595 partial=0
ts=2024-01-17T13:24:13.620279818Z caller=fetcher.go:529 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=2.179145307s duration_ms=2179 cached=905 returned=595 partial=0
ts=2024-01-17T13:27:13.411437655Z caller=fetcher.go:529 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=1.971582294s duration_ms=1971 cached=905 returned=595 partial=0

Some logs BEFORE upgrade

ts=2024-01-09T22:11:42.717109681Z caller=fetcher.go:487 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=407.652353ms duration_ms=407 cached=626 returned=326 partial=0
ts=2024-01-09T22:10:53.586476697Z caller=fetcher.go:487 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=1.943787221s duration_ms=1943 cached=846 returned=546 partial=0
ts=2024-01-09T22:08:42.481675919Z caller=fetcher.go:487 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=172.403147ms duration_ms=172 cached=626 returned=326 partial=0
ts=2024-01-09T22:07:53.490727796Z caller=fetcher.go:487 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=1.848686114s duration_ms=1848 cached=846 returned=546 partial=0
ts=2024-01-09T22:05:42.608920073Z caller=fetcher.go:487 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=299.362529ms duration_ms=299 cached=626 returned=326 partial=0
ts=2024-01-09T22:04:53.447103042Z caller=fetcher.go:487 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=1.804983733s duration_ms=1804 cached=846 returned=546 partial=0
ts=2024-01-09T22:02:42.469191789Z caller=fetcher.go:487 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=159.853342ms duration_ms=159 cached=626 returned=326 partial=0

Anything else we need to know:
args

    - store            
    - --data-dir=/data  
    - --index-cache-size=500MB    
    - --chunk-pool-size=500MB
    - --objstore.config-file=/etc/thanos/thanos-s3-credentials/thanos-s3-credentials.yaml
    - --store.caching-bucket.config-file=/etc/thanos/cache/bucket-cache-config.yaml

caching-bucket.config-file:

type: MEMCACHED # Case-insensitive
config:
  addresses:
    - dns+thanos-store-memcached-headless:11211
  timeout: 450ms~ $ 
@hp197
Copy link

hp197 commented Jan 19, 2024

We went from bitnami chart 12.13.0 (thanos 0.32.1) to chart 12.20.1 (thanos 0.33) and see exactly the same behavior and we also use S3 objstore as backend:

version:

  build user:       root@293b7b0c32a8
  build date:       20231219-10:37:39
  go version:       go1.21.5
  platform:         linux/amd64
  tags:             netgo

args:

      --log.level=info
      --log.format=logfmt
      --grpc-address=0.0.0.0:10901
      --http-address=0.0.0.0:10902
      --data-dir=/data
      --objstore.config-file=/conf/objstore.yml

image

*** 2nd jan is the date of version change, so it really looks like it is something in this version

@pharaujo
Copy link
Contributor

We also observed this exact issue when we upgraded from 0.32.5 to 0.33.0, and rolled back. I had a look at the changes between versions that could have caused this, and I'm now suspecting this might not be an issue per-se, but actually something that wasn't being correctly measured: thanos-io/objstore#79 instrumented the iter operation, which was previously always at 0. In my environment, this operation took a lot longer than all the other ones, which made the p99 latency of bucket operations increase dramatically (purple line is the p99 of iter, green one is overall p99):

Screenshot 2024-01-19 at 15 34 24

Maybe it's the 2s threshold of the ThanosStoreObjstoreOperationLatencyHigh alert that needs to be tweaked to account for the newly instrumented operation?

@hp197
Copy link

hp197 commented Jan 19, 2024

We haven't looked enough in it yet, though also questioned ourself if simply the reporting is changed. This feeling is fuelled by the facts that in all the other metrics we have, we dont see any behavioural change in S3.

@douglascamata douglascamata added component: store package: objstore Mark PRs and issues needs to be migrated to thanos-io/objstore labels Jan 23, 2024
@Pidu2
Copy link
Author

Pidu2 commented Feb 29, 2024

FYI: After upgrading to v0.34.1 the latency spike has gone down a bit, enough that it stopped triggering the alert for now (as the for: 10m is never reached):

Upgrade happened at around ~11:00:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: store package: objstore Mark PRs and issues needs to be migrated to thanos-io/objstore
Projects
None yet
Development

No branches or pull requests

4 participants