Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

store: add ability to limit max num of samples / concurrent queries #798

Merged
merged 57 commits into from
Mar 23, 2019

Conversation

GiedriusS
Copy link
Member

@GiedriusS GiedriusS commented Feb 1, 2019

Add the ability to limit the number of samples that could be retrieved in a single query. We do this by checking how many chunks would have to be downloaded to satisfy any kind of query and multiplying it by 120 -- the new maxSamplesPerChunk constant. Why it was chosen to be 120 is written above it and in the comments down below.

Add the ability to limit the number of concurrent queries - this is done by using a Prometheus gate. Another type has been added which wraps that gate and stores the metrics so that the users would know how many queries are in the queue, what is the limit, and a summary metric which says how long it takes for queries to get through it.

This is a huge pain point for us as it is impossible to do capacity planning otherwise.

Changes

Two new options:

  • store.grpc.series-sample-limit which lets you control how many samples can be returned at maximum via a single Series() call. Can potentially overestimate the number of samples but that is noted in the command line parameter.
  • store.grpc.series-max-concurrency which lets you control the maximum number of concurrent queries - they get blocked until there is appropriate space in the channel. Users can look at the new metrics to determine if a lot of queries are waiting to be executed.

New metrics:

  • thanos_bucket_store_queries_dropped_total shows how many queries were dropped due to the samples limit;
  • thanos_bucket_store_queries_limit is a constant metric which shows how many concurrent queries can come into Thanos Store;
  • thanos_bucket_store_queries_in_flight_total shows how many queries are currently "in flight" i.e. they are being executed;
  • thanos_bucket_store_gate_duration_seconds shows how many seconds it took for queries to pass through the gate in both cases - when that fails and when it does not.

Verification

E2E tests pass which test the new limits, manual testing locally.

@bwplotka
Copy link
Member

bwplotka commented Feb 1, 2019

Hey, the idea is super nice and we definitely need this, but will time to review only from next week.

I would aim for the same options as Prometheus has, so sample limit but also concurrent queries limit. We should aim for same expierience basicallly (:

Thanks for this!

@GiedriusS GiedriusS changed the title [WIP] store: add ability to limit max num of samples store: add ability to limit max num of samples / concurrent queries Feb 4, 2019
@GiedriusS
Copy link
Member Author

GiedriusS commented Feb 4, 2019

Sorry for a lot of commits - I noticed after the fact that the index reader only returns chunk references but not data about them. I moved it way down to the place where the retrieved chunks are converted into the "proper" format -- it seems like Prometheus does the same, I took a look at it too. I guess it would be nice to add extra stuff to the e2e test to test out this place at this point. Plus, a new metric which shows how many queries there are at the moment would probably be nice too.

Convert raw chunks into XOR encoded chunks and call the NumSamples()
method on them to calculate the number of samples. Rip out the samples
calculation into a different function because it is used in two
different places.
@GiedriusS GiedriusS force-pushed the feature/store_sample_limit branch 2 times, most recently from cd6cb48 to e87f763 Compare February 5, 2019 09:23
@GiedriusS
Copy link
Member Author

GiedriusS commented Feb 5, 2019

I believe this is ready for a review. My only concern is that AFAICT there can be huge compacted chunks made by thanos compact and then they would be downloaded, and only after the fact we could check that we have exceeded the limit. It seems like the index reader only stores minimum and maximum time, and the chunk ref id but not the other metadata - we have to download that timeseries data first and decode it to check the number of samples. Any thoughts anyone?

@bwplotka
Copy link
Member

bwplotka commented Feb 5, 2019

Well each chunk has max ~120 samples, so I think this is fine if we just estimate this.

@bwplotka
Copy link
Member

bwplotka commented Feb 5, 2019

Compaction compacts blocs by compacting index file. Chunks stay the same (They are just in different "chunk files", but chunks (or segments) length stay the same.

@GiedriusS
Copy link
Member Author

GiedriusS commented Feb 6, 2019

Thank you for clarifying the part about TSDB/compaction - I wasn't so sure since I don't have a good mental model in my brain of how it all works, yet. Then it is all good as we will not have to download huge chunks.

As for the first message - are you saying that we should hardcode the number 120 inside the code? It doesn't seem like a good idea to me. Let me explain why:

  • I think it is much better if we overestimate the number of samples that would be in the response by ~120 than underestimate it by thousands. It seems to me that a lot of metrics nowadays are short-lived or IOW they are spread over a lot of different chunks - especially in the cases of "service discovery". For example, Consul nodes could only stay for about ~10 minutes, go away, and come back after about ~2 hours. Potentially, at worst we could be off by ~119 samples for each chunk which seems like a lot to me, especially if the query's range is 1.5 years, for example. What is also important to me and users, IMHO, is getting the bang for the buck - I wouldn't want our servers to sit unused just because the software that we use overestimates the load :(
  • Also, what happens if at some point in the future the default length of a TSDB block from around ~2 hours changes to some other duration? The estimation would be even more off;
  • Also, I don't see a big problem with downloading the chunks until the limit has been reached. It seems to that with the other option --chunk-pool-size we could achieve a very predictable RAM usage since the Go's GC process should quickly determine that RAM had been deallocated because we aren't allocating ad-hoc objects in this process. The limit is RAM, not network bandwidth; Furthermore, I think the most important thing here is that the query gets stopped at the lowest stage possible and Thanos Query doesn't receive this huge block of data.

Let me know what you think of my comments. If you really think this is the way to go forward and it's the only way it could get merged then I can change it.

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great overall, some suggestions though. 3 major ones regarding:

  • unnecessary decoding and even downloading.
  • gate queing time metric
  • sample limit code

As for the first message - are you saying that we should hardcode the number 120 inside the code? It doesn't seem like a good idea to me.

wouldn't want our servers to sit unused just because the software that we use overestimates the load :(

Agree, but look on my comment, we can estimate with great confidence checking lenght of bytes.

  • Also, what happens if at some point in the future the default length of a TSDB block from around ~2 hours changes to some other duration? The estimation would be even more off;

Block lenght can be 200 years but chunks will be still 120 samples maxium unless in TSDB we change that. If we do we can think about better solution (: But it won't happen anytime soon. See why here: prometheus-junkyard/tsdb#397

  • Also, I don't see a big problem with downloading the chunks until the limit has been reached. It seems to that with the other option --chunk-pool-size we could achieve a very predictable RAM usage since the Go's GC process should quickly determine that RAM had been deallocated because we aren't allocating ad-hoc objects in this process. The limit is RAM, not network bandwidth; Furthermore, I think the most important thing here is that the query gets stopped at the lowest stage possible and Thanos Query doesn't receive this huge block of data.

Sure, but we can avoid downloading all, see my comment.

pkg/store/bucket.go Outdated Show resolved Hide resolved
pkg/store/bucket.go Outdated Show resolved Hide resolved
pkg/store/bucket.go Outdated Show resolved Hide resolved
pkg/store/bucket.go Outdated Show resolved Hide resolved
pkg/store/bucket.go Outdated Show resolved Hide resolved
pkg/store/bucket.go Outdated Show resolved Hide resolved
pkg/store/gate.go Outdated Show resolved Hide resolved
Copy link
Member

@FUSAKLA FUSAKLA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally ok from me and definitely useful feature to have, great! :)

Just few nits and and suggestions.

cmd/thanos/store.go Outdated Show resolved Hide resolved
cmd/thanos/store.go Outdated Show resolved Hide resolved
pkg/store/gate.go Outdated Show resolved Hide resolved
pkg/store/gate.go Outdated Show resolved Hide resolved
pkg/store/bucket.go Outdated Show resolved Hide resolved
pkg/store/bucket.go Outdated Show resolved Hide resolved
pkg/store/bucket_e2e_test.go Show resolved Hide resolved
@GiedriusS
Copy link
Member Author

GiedriusS commented Feb 7, 2019

Ah, I see now where that 120 value comes from now after looking at prometheus-junkyard/tsdb#397. Now it makes much more sense to me and I agree 👍 just that I will add a stern note to that option so that users would know that it may unlikely overestimate the number of samples it may take for a single Series() request. Plus I didn't want to put more work into maintaining this in the future but now I'm sure that it will almost certainly stay the same. Will rework that part to calculate with this assumption.
As for the gating tests - I'm afraid it would be impossible to do that since it is unfeasible to generate a big enough TSDB in the unit tests due to increased times and load. I'm sure that we can trust that code since it had been battle tested in Prometheus and we are reusing it completely just with another addition of a metric.
I agree with all of the other comments - will fix them :) Thank you again for the reviews! Will cook up the second version of this change soon.

@GiedriusS GiedriusS force-pushed the feature/store_sample_limit branch 2 times, most recently from cd6cb48 to 9d0b8a7 Compare February 8, 2019 09:29
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

efficiency we take 120 as number of samples in
chunk, so the actual number of samples might be
lower, even though maximum could be hit. Cannot
be bigger than 120.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused. You say it cannot be bigger than 120 but default is 50000000 ?

Copy link
Member Author

@GiedriusS GiedriusS Mar 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means that the upper-bound of samples per chunk is 120. Maybe we should move this part into parentheses in the former sentence: NOTE: efficiency we take 120 as the number of samples in a chunk (it cannot be bigger than that), so the actual number of samples might be lower, even though maximum could be hit.
Would it be clearer?

Help: "Number of queries that were dropped due to the sample limit.",
})
m.queriesLimit = prometheus.NewGauge(prometheus.GaugeOpts{
Name: "thanos_bucket_store_queries_limit",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prometheus' team use prometheus_engine_queries_concurrent_max. Naming is hard but I figure we could be consistent with them ?

https://github.com/prometheus/prometheus/blob/8155cc49924cd48ec552506b4e5519ebafa8b722/promql/engine.go#L234-L239

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Should rename this to thanos_bucket_store_queries_concurrent_max.

Copy link
Contributor

@domgreen domgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff really like adding this ... think in a future PR we could probably apply it to some of the wider components (query could probably do with limiting) so making the flags more generic would help with this in the future.
Comments are mainly nits around conforming with some of the styles that are already in the Prometheus codebase or conventions.
With the default limit we could also set it to 0 meaning no limit and then do a phased rollout ... set it to 0 for the initial release so we know we will not break anyone and make sure it is in the release notes that the following release will set a limit ... this way people can do some planning and testing before it is set hard.

@@ -36,6 +36,12 @@ func registerStore(m map[string]setupFunc, app *kingpin.Application, name string
chunkPoolSize := cmd.Flag("chunk-pool-size", "Maximum size of concurrently allocatable bytes for chunks.").
Default("2GB").Bytes()

maxSampleCount := cmd.Flag("grpc-sample-limit",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe align the flags with how prom has flags? grpc-sample-limit > {query|store|storage}.grpc.read-sample-limit?

See storage.remote.read-sample-limit https://github.com/prometheus/prometheus/blob/master/cmd/prometheus/main.go#L206

"Maximum amount of samples returned via a single Series call. 0 means no limit. NOTE: for efficiency we take 120 as number of samples in chunk, so the actual number of samples might be lower, even though maximum could be hit. Cannot be bigger than 120.").
Default("50000000").Uint()

maxConcurrent := cmd.Flag("grpc-concurrent-limit", "Maximum number of concurrent Series calls. 0 means no limit.").Default("20").Int()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above grpc-concurrent-limit > query.grpc.max-concurrency?

See query.max-concurrency https://github.com/prometheus/prometheus/blob/master/cmd/prometheus/main.go#L233

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will rename to store.grpc.max-concurrency as we are talking about Thanos Store here :P

@@ -42,6 +42,14 @@ import (
"google.golang.org/grpc/status"
)

// Approximately this is the max number of samples that we may have in any given chunk. This is needed
// for precalculating the number of samples that we may have to retrieve and decode for any given query
// without downloading them. Please take a look at https://github.com/prometheus/tsdb/pull/397 to know
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think this is fine 👍

pkg/store/bucket.go Show resolved Hide resolved
// Query gate which limits the maximum amount of concurrent queries.
queryGate *Gate

// Samples limiter which limits the number of samples per each Series() call.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Samples limiter which > samplesLimiter

g: gate.New(maxConcurrent),
}
g.inflightQueries = prometheus.NewGauge(prometheus.GaugeOpts{
Name: "queries_in_flight",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

queries_in_flight > thanos_store_queries_in_flight_total?

Subsystem: reg.Subsystem(),
})
g.gateTiming = prometheus.NewHistogram(prometheus.HistogramOpts{
Name: "gate_seconds",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gate_seconds > thanos_store_gate_duration_seconds?

CHANGELOG.md Outdated
New tracing span:
* `store_query_gate_ismyturn` shows how long it took for a query to pass (or not) through the gate.

:warning: **WARNING** :warning: #798 adds new default limits for max samples per one Series() gRPC method call and the maximum number of concurrent Series() gRPC method calls. Consider increasing them if you have a very huge deployment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets do a phased rollout of this ... can we have 0 mean off ... that way we can phase rollout to people by ensuring we warn them that we will be setting a sensible default and they should ensure their system can deal with it or set it explicitly for their needs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, perhaps most users won't want this.

CHANGELOG.md Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
@GiedriusS
Copy link
Member Author

You had already reviewed it again @domgreen when I haven't fixed all of the places yet 😄 I only made naming and default values changes since @bwplotka 's review so I think this is ready for merging.

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I already approved week ago, so feel free to merge. Found one nit though.

pkg/store/bucket.go Outdated Show resolved Hide resolved
Giedrius Statkevičius and others added 5 commits March 22, 2019 17:20
The original error already informs us about what is going wrong.
It's still useful to know that we are talking about samples here
exactly.
Setting it to 0 by default doesn't make sense since the Go channel
becomes unbuffered and all queries will timeout. Set it to 20 by default
since that's the limit on Thanos Query and naturally there won't be more
than 20 by default so it's good.
@GiedriusS
Copy link
Member Author

GiedriusS commented Mar 23, 2019

I reverted the changes to the default value of --store.grpc.series-max-concurrency to make it 20 as it is on Thanos Query and Prometheus. If the value is 0 then an unbuffered channel is created and all queries start failing. That seemed good with @bwplotka some iterations before this so I think it's good , and as long as we mention this in the changelog, and because we still haven't reached 1.0. Also, in general, it doesn't make sense to provide such "unlimited" interfaces.

@GiedriusS
Copy link
Member Author

Thank you all for the reviews and comments, merging (:

@GiedriusS GiedriusS merged commit f24d555 into thanos-io:master Mar 23, 2019
@GiedriusS GiedriusS deleted the feature/store_sample_limit branch March 23, 2019 10:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants