Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store Gateway: Unexpected postings length #6402

Closed
yeya24 opened this issue May 31, 2023 · 8 comments
Closed

Store Gateway: Unexpected postings length #6402

yeya24 opened this issue May 31, 2023 · 8 comments

Comments

@yeya24
Copy link
Contributor

yeya24 commented May 31, 2023

Thanos, Prometheus and Golang version used:

Latest version of Thanos

What happened:

ts=2023-05-27T19:03:40.876442873Z caller=grpc_logging.go:64 level=warn method=/gatewaypb.StoreGateway/Series duration=24.56289ms err="rpc error: code = Aborted desc = fetch series for block 01H14YNTKA70WYGBBZ7ZD1SFQM: expanded matching posting: get postings: decode postings: unexpected postings length, should be 6741151740 bytes for 1685287935 postings, got 2431 bytes" msg=gRPC

This was actually an error log of Cortex store gateway component. Cortex store gateway is basically a wrapper of Thanos store gateway.
The error message for Thanos part was actually fetch series for block 01H14YNTKA70WYGBBZ7ZD1SFQM: expanded matching posting: get postings: decode postings: unexpected postings length, should be 6741151740 bytes for 1685287935 postings, got 2431 bytes.

We got a lot of blocks throwing almost the same error when decoding fetched postings from cache, with the same number of expected postings and length.

I checked code and found out that the error was actually from here https://github.com/thanos-io/thanos/blob/main/pkg/store/bucket.go#L2429.

The weird thing is that 1685287935 number of postings is an odd number and none of our blocks have that many series. The number of postings is actually the first uin32 number of the data fetched in the cache so I think something might go wrong with the caching layer. We are using memcached.

This error happened once and after that we are unable to see this error again.

What you expected to happen:

No such issue.

@yeya24
Copy link
Contributor Author

yeya24 commented May 31, 2023

Closing as we found out the root cause was due to #6303.
Previously we were running a thanos version with this pr and it changes the compression scheme to our cache.

Then we deployed another image with an older thanos version which doesn't include that pr. Then when store gatway tried to read data from cache, it failed because it cannot understand the compression scheme there.

@yeya24
Copy link
Contributor Author

yeya24 commented May 31, 2023

We think the same issue might happen during the rollout of this change as well.
Some store gateways are running with the streamd snappy version while some are not. If the cached data is encoded using the previous way, which is snappy encoded, then both store gateways are able to read it because the new version is backward compatible.

However, if the cached data is streamed snappy encoded then the older version store gateway will fail to decode it.

@yeya24
Copy link
Contributor Author

yeya24 commented May 31, 2023

One idea I have so far to improve the rollout:

Use different cache keys for snappy and streamed snappy encoding. So during rollout, store gateway with streamed snappy encoding will cache miss and try to fetch data from S3. Older version of store gateway can still use the existing cache key. Finally, all cache keys will be using the new format due to cache TTL. The issue is that it might consume more items/memory of our cache but it is the most seamless way.

@fpetkovski
Copy link
Contributor

We had a similar case when adding native histograms to query frontend. Maybe errors from cache retrieval should lead to invalidating the key?

@yeya24
Copy link
Contributor Author

yeya24 commented May 31, 2023

Maybe errors from cache retrieval should lead to invalidating the key?

In this case, if the cached content is encoded using streamed snappy, then it is valid and we shouldn't invalidate it. If it is old version of store gateway in this case, I think it can ignore the decoding error and fetch data from S3 without setting any caches. But I feel it might have some edge cases as well so using different cache keys should be easier. WDYT?

Btw, I am not aware of that we have a way to invalidate a key in a remote cache. You mean we set the key to a predefined value to represent invalid data?

@fpetkovski
Copy link
Contributor

Hm I see, so an old version of store-gw can't read new cached postings right?

@yeya24
Copy link
Contributor Author

yeya24 commented Jun 1, 2023

@fpetkovski Yeah... So it might be a problem during rollout and we currently throw error if failed to decode

@yeya24
Copy link
Contributor Author

yeya24 commented Jun 22, 2023

Close this one as I think we are able to fix this by using a different cache key

@yeya24 yeya24 closed this as completed Jun 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants