Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add proposal for store instance high availability #404

Merged
merged 2 commits into from
Oct 26, 2018
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
131 changes: 131 additions & 0 deletions docs/proposals/201807_store_instance_high_availability.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
# High-availability for store instances

Status: draft | **in-review** | rejected | accepted | complete

Proposal author: [@mattbostock](https://github.com/mattbostock)
Implementation owner: [@mattbostock](https://github.com/mattbostock)

## Motivation

Thanos store instances currently have no explicit support for
high-availability; query instances treat all store instances equally. If
multiple store instances are used as gateways to a single bucket in an object
store, Thanos query instances will wait for all instances to respond (subject
to timeouts) before returning a response.

## Goals

- Explicitly support and document high availability for store instances.

- Reduce the query latency incurred by failing store instances when other store
instances could return the same response faster.

## Proposal

Thanos supports deduplication of metrics retrieved from multiple Prometheus
servers to avoid gaps in query responses where a single Prometheus server
failed but similar data was recorded by another Prometheus server in the same
failure domain. To support deduplication, Thanos must wait for all Thanos
sidecar servers to return their data (subject to timeouts) before returning a
response to a client.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is worth to mention, that we DO not ignore these timeouts, even with the HA sidecar scenario we treat one replica being inaccessible as an "error case" - which might be wrong or confusing and definitely different to store gateway case (:

Copy link
Contributor Author

@mattbostock mattbostock Jul 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good shout, I'll make that explicit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thoughts, I'm not sure if we need to expand on this as I want to avoid confusing the reader by giving too much detail. Maybe I should remove the 'subject to timeouts' part?


When retrieving data from Thanos bucket store instances, however, the desired
behaviour is different; we want Thanos use the first successful response it
receives, on the assumption that all bucket store instances that communicate
with the same bucket have access to the same data.

To support the desired behaviour for bucket store instances while still
allowing for deduplication, we propose to expand the [InfoResponse
Protobuf](https://github.com/improbable-eng/thanos/blob/b67aa3a709062be97215045f7488df67a9af2c66/pkg/store/storepb/rpc.proto#L28-L32)
used by the Store API by adding a single boolean field to indicate whether the
store instance in question is acting as a 'gateway':

```diff
--- before 2018-07-02 15:49:09.000000000 +0100
+++ after 2018-07-02 15:49:13.000000000 +0100
@@ -1,5 +1,6 @@
message InfoResponse {
repeated Label labels = 1 [(gogoproto.nullable) = false];
int64 min_time = 2;
int64 max_time = 3;
+ bool gateway = 4;
}
```

Thanos bucket store instances (i.e. store instances that act as 'gateways' to
AWS S3 or Google Cloud Storage) will set `gateway` to `true`. A `bool` type in
Protobuf defaults to false, so the behaviour of other existing store instances
that do not explicitly set a value for `gateway` will not be affected.

If a store instance is a gateway, query instances will treat each store
instance in a label group as having access to the same data. Query instances
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to tell if stores are from same label groups if they do not put any labels in current logic?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is related to the alternative mentioned below. will comment there

will randomly pick any two store instances[1][] from the same gateway group and
use the first response returned.

[1]: https://www.eecs.harvard.edu/~michaelm/postscripts/mythesis.pdf

Otherwise, query instances will wait for all replicas within the same
label group to respond (subject to existing timeouts) before returning a
response, consistent with the current behaviour.

### Scope

Horizontal scaling should be handled separately and is out of scope for this proposal.

## User experience

From a user's point of view, query responses should be faster and more reliable:

- Running multiple bucket store instances will allow the query to be served even
if a single store instance fails.

- Query latency should be lower since the response will be served from the
first bucket store instance to reply.

The user experience for query responses involving only Thanos sidecars will be
unaffected.

## Alternatives considered

### Implicitly relying on store labels

Rather than expanding the `InfoResponse` Protobuf, we had originally considered relying on
an empty set of store labels to determine that a store instance was acting as a gateway.

We decided against this approach as it would make debugging harder due to its
implicit nature, and is likely to cause bugs in future.

## Open issues

### Querying from multiple buckets or object stores
Copy link
Contributor

@domgreen domgreen Jul 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I personally prefer this approach, would allow us to run queries that span multiple buckets.

Example - we collect info from two tenants each of which gets uploaded to their own bucket tenantA and tenantB if we wanted to query the metrics for both adding the a Identifier to the InfoResponse we could have the bucket name on there and it would be valid to wait for both responses.

A further use case of this could be to have buckets that are historical / archive data and some that are within x days and would mean that you could spin up store nodes quicker for the recent bucket as would require less overhead at start.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree here with @domgreen. Maybe slight change in gateway: bool into something like gateway string with bucket name would allow us for basic functional sharding of stores in addition HA. What do you think @mattbostock ?

Copy link
Contributor

@xjewer xjewer Jul 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it makes sense having a string instead of bool

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I think using a string identifier is the way forward.

The Protobuf definition could be:

bool deduplicated = 4
string store_group_id = 5

The deduplicated boolean would determine how the query instance handles retrieving the data (i.e. whether to wait for all results). I think I prefer deduplicated over gateway as it's more specific to the functionality we care about in this case. Sidecar instances would have deduplicated set to false. Bucket stores would have deduplicated set to true.

It'd be up to the specific implementation to determine what the store_group_id string is. For bucket stores, I think concatenating the object store URL and bucket name would make sense. For all other stores, I'd leave it empty for now until we have a good reason to use it.

The logic would work like this:

if deduplicated:
  query_and_use_fastest_response_per_store_group
else
   query_and_wait_for_all_responses

Stores would be grouped by store_group_id. Stores having an empty store_group_id field would be considered all part of the same group.

Since boolean fields are false by default in Protobufs, deduplicated would be false and existing instances would be treated as before for backwards compatibility (i.e. the query instance will wait for all responses, subject to timeouts).

If you agree, I'll update the proposal to match.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need a redundant deduplicated option? Isn't it enough having specified store_group_id as a object store+bucket_name as identifier? Thus, we don't have to wait for all responses if stores have the same store_group_id.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can account only non-zero string in store_group_id for the deduplication, hence older version would work as before

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think relying on an empty string is too implicit as it alters the behaviour of how responses are treated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this's a disadvantage, if you describe the behaviour in the documentation and the help. Otherwise you have to bear in mind both options to be working as expected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the massive delay in reviews and thanks for this (:

store_group_id makes sense to me. deduplicated is tricky. I don't like it because both store gateways with query_and_use_fastest_response_per_store_group and sidecars with query_and_wait_for_all_responses logic are kind of duplicated, so we can work on naming definitely.

However, what about another idea:
In Info let's add:

  • store_group_id
  • peer_type (like it is in gossip cluster.PeerState.Type).

I am proposing this because I am trying to figure out how to make deduplication flow consistent across ALL components. Including not yet solved: Ruler deduplication!

Basically, with this we can deduct correct behavior everywhere:

  • Store gateway: store_group_id would be as @mattbostock suggested and peerType=store so query can perform query_and_use_fastest_response_per_store_group logic.
  • Sidecars: store_group_id would be all external labels as it is ID of those. peerType=source so query can assume query_and_wait_for_all_responses flow with some replica label deduplication.
  • Ruler: The tricky part is that we might want these in HA, but it's hard to define external labels for these, so having some arbitrary store_group_id would be perfect. Plus ruler is peerType=source so same replica deduplication might apply here as well.

I am not saying we want solve ruler with this proposal, just that we might want to reuse same thing and move ALL to store_group_id for deduplication logic for consistency.

Tricky part here:
4 Rules (in 2 different groups) - it would be not trivial to implement this, because the query_and_wait_for_all_responses + dedup is done truly by removing (or not) replica label. But that is something not for proposal anyway.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments all.

@Bplotka: I think your proposal makes sense, I'll update this PR to use that.


Thanos users may wish to use multiple buckets or multiple object stores
concurrently.

Bucket store instances expose a empty set of labels, so there is no way for the
query instance to distinguish between buckets or object stores.

We may wish to solve this by exposing a distinguishing identifier for the
bucket store in the `InfoResponse`. This identifier field might supplement or
replace the `gateway` boolean field (i.e. a null identifier would mean that the
store instance is not a gateway).

## Glossary

### Label group

A 'label group' is a group of store instances having an identical set of labels,
including the same label names and label values.

## Related future work

### Sharing data between store instances

Thanos bucket stores download index and metadata from the object store on
start-up. If multiple instances of a bucket store are used to provide high
availability, each instance will download the same files for its own use. These
file sizes can be in the order of gigabytes.

Ideally, the overhead of each store instance downloading its own data would be
avoided. We decided that it would be more appropriate to tackle sharing data as
part of future work to support the horizontal scaling of store instances.