-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add proposal for store instance high availability #404
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,131 @@ | ||
# High-availability for store instances | ||
|
||
Status: draft | **in-review** | rejected | accepted | complete | ||
|
||
Proposal author: [@mattbostock](https://github.com/mattbostock) | ||
Implementation owner: [@mattbostock](https://github.com/mattbostock) | ||
|
||
## Motivation | ||
|
||
Thanos store instances currently have no explicit support for | ||
high-availability; query instances treat all store instances equally. If | ||
multiple store instances are used as gateways to a single bucket in an object | ||
store, Thanos query instances will wait for all instances to respond (subject | ||
to timeouts) before returning a response. | ||
|
||
## Goals | ||
|
||
- Explicitly support and document high availability for store instances. | ||
|
||
- Reduce the query latency incurred by failing store instances when other store | ||
instances could return the same response faster. | ||
|
||
## Proposal | ||
|
||
Thanos supports deduplication of metrics retrieved from multiple Prometheus | ||
servers to avoid gaps in query responses where a single Prometheus server | ||
failed but similar data was recorded by another Prometheus server in the same | ||
failure domain. To support deduplication, Thanos must wait for all Thanos | ||
sidecar servers to return their data (subject to timeouts) before returning a | ||
response to a client. | ||
|
||
When retrieving data from Thanos bucket store instances, however, the desired | ||
behaviour is different; we want Thanos use the first successful response it | ||
receives, on the assumption that all bucket store instances that communicate | ||
with the same bucket have access to the same data. | ||
|
||
To support the desired behaviour for bucket store instances while still | ||
allowing for deduplication, we propose to expand the [InfoResponse | ||
Protobuf](https://github.com/improbable-eng/thanos/blob/b67aa3a709062be97215045f7488df67a9af2c66/pkg/store/storepb/rpc.proto#L28-L32) | ||
used by the Store API by adding a single boolean field to indicate whether the | ||
store instance in question is acting as a 'gateway': | ||
|
||
```diff | ||
--- before 2018-07-02 15:49:09.000000000 +0100 | ||
+++ after 2018-07-02 15:49:13.000000000 +0100 | ||
@@ -1,5 +1,6 @@ | ||
message InfoResponse { | ||
repeated Label labels = 1 [(gogoproto.nullable) = false]; | ||
int64 min_time = 2; | ||
int64 max_time = 3; | ||
+ bool gateway = 4; | ||
} | ||
``` | ||
|
||
Thanos bucket store instances (i.e. store instances that act as 'gateways' to | ||
AWS S3 or Google Cloud Storage) will set `gateway` to `true`. A `bool` type in | ||
Protobuf defaults to false, so the behaviour of other existing store instances | ||
that do not explicitly set a value for `gateway` will not be affected. | ||
|
||
If a store instance is a gateway, query instances will treat each store | ||
instance in a label group as having access to the same data. Query instances | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How to tell if stores are from same label groups if they do not put any labels in current logic? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is related to the alternative mentioned below. will comment there |
||
will randomly pick any two store instances[1][] from the same gateway group and | ||
use the first response returned. | ||
|
||
[1]: https://www.eecs.harvard.edu/~michaelm/postscripts/mythesis.pdf | ||
|
||
Otherwise, query instances will wait for all replicas within the same | ||
label group to respond (subject to existing timeouts) before returning a | ||
response, consistent with the current behaviour. | ||
|
||
### Scope | ||
|
||
Horizontal scaling should be handled separately and is out of scope for this proposal. | ||
|
||
## User experience | ||
|
||
From a user's point of view, query responses should be faster and more reliable: | ||
|
||
- Running multiple bucket store instances will allow the query to be served even | ||
if a single store instance fails. | ||
|
||
- Query latency should be lower since the response will be served from the | ||
first bucket store instance to reply. | ||
|
||
The user experience for query responses involving only Thanos sidecars will be | ||
unaffected. | ||
|
||
## Alternatives considered | ||
|
||
### Implicitly relying on store labels | ||
|
||
Rather than expanding the `InfoResponse` Protobuf, we had originally considered relying on | ||
an empty set of store labels to determine that a store instance was acting as a gateway. | ||
|
||
We decided against this approach as it would make debugging harder due to its | ||
implicit nature, and is likely to cause bugs in future. | ||
|
||
## Open issues | ||
|
||
### Querying from multiple buckets or object stores | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think I personally prefer this approach, would allow us to run queries that span multiple buckets. Example - we collect info from two tenants each of which gets uploaded to their own bucket A further use case of this could be to have buckets that are historical / archive data and some that are within x days and would mean that you could spin up store nodes quicker for the recent bucket as would require less overhead at start. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agree here with @domgreen. Maybe slight change in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it makes sense having a string instead of bool There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree, I think using a string identifier is the way forward. The Protobuf definition could be:
The It'd be up to the specific implementation to determine what the The logic would work like this:
Stores would be grouped by Since boolean fields are false by default in Protobufs, If you agree, I'll update the proposal to match. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we really need a redundant There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we can account only non-zero string in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think relying on an empty string is too implicit as it alters the behaviour of how responses are treated. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think this's a disadvantage, if you describe the behaviour in the documentation and the help. Otherwise you have to bear in mind both options to be working as expected. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry for the massive delay in reviews and thanks for this (:
However, what about another idea:
I am proposing this because I am trying to figure out how to make deduplication flow consistent across ALL components. Including not yet solved: Basically, with this we can deduct correct behavior everywhere:
I am not saying we want solve ruler with this proposal, just that we might want to reuse same thing and move ALL to Tricky part here: What do you think? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the comments all. @Bplotka: I think your proposal makes sense, I'll update this PR to use that. |
||
|
||
Thanos users may wish to use multiple buckets or multiple object stores | ||
concurrently. | ||
|
||
Bucket store instances expose a empty set of labels, so there is no way for the | ||
query instance to distinguish between buckets or object stores. | ||
|
||
We may wish to solve this by exposing a distinguishing identifier for the | ||
bucket store in the `InfoResponse`. This identifier field might supplement or | ||
replace the `gateway` boolean field (i.e. a null identifier would mean that the | ||
store instance is not a gateway). | ||
|
||
## Glossary | ||
|
||
### Label group | ||
|
||
A 'label group' is a group of store instances having an identical set of labels, | ||
including the same label names and label values. | ||
|
||
## Related future work | ||
|
||
### Sharing data between store instances | ||
|
||
Thanos bucket stores download index and metadata from the object store on | ||
start-up. If multiple instances of a bucket store are used to provide high | ||
availability, each instance will download the same files for its own use. These | ||
file sizes can be in the order of gigabytes. | ||
|
||
Ideally, the overhead of each store instance downloading its own data would be | ||
avoided. We decided that it would be more appropriate to tackle sharing data as | ||
part of future work to support the horizontal scaling of store instances. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is worth to mention, that we DO not ignore these timeouts, even with the HA sidecar scenario we treat one replica being inaccessible as an "error case" - which might be wrong or confusing and definitely different to store gateway case (:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good shout, I'll make that explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thoughts, I'm not sure if we need to expand on this as I want to avoid confusing the reader by giving too much detail. Maybe I should remove the 'subject to timeouts' part?