Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Querier Read Limits #5438

Open
bwplotka opened this issue Jun 23, 2022 · 2 comments
Open

Querier Read Limits #5438

bwplotka opened this issue Jun 23, 2022 · 2 comments
Labels
difficulty: hard dont-go-stale Label for important issues which tells the stalebot not to close them help wanted

Comments

@bwplotka
Copy link
Member

With more streamed fanout (#5296) we are finally able to add more construction series / sample limits to Querier for QoS. This is because we know immediately how many series and samples globally we fetch into Querier instead of learning about that only after we computed most of it.

NOTE: This does not mean we don't need limits on other components like Store (there is already some limit there):

image

ref

However, we can control a lot from the Querier when it comes to QoS. The main problem comes from the fact that queries are never uniform. Some are ultra-small (samples over last 5 minutes, 10 series), some ultra large, (samples over year, 100 millions series). This why only only "limit" options we have on Querier are not enough:

image

ref

With query-frontend we can limit this problem to splitting queries to a 1d time window, but this has still some downsides:

  • The same vertical cardinality problem appears (e.g. millions of series with one sample at the same time)
  • We can't leverage downsampled data with such approach.

AC:

  • Short term:
    • Ability to add query limits so w
  • Long:
    • Limits per tenant
    • Dynamic per query limit, based on available capacity in Querier (other queries).

Initial Ideas:

  • Start with a simple approach first - add series/sample limit to Qurier.
  • Find metric that will represent the limiting factor (both mem and latency/CPU) e.g bytes processed or perhaps similar to write DPM

cc @GiedriusS

@bwplotka bwplotka added difficulty: hard help wanted dont-go-stale Label for important issues which tells the stalebot not to close them labels Jun 23, 2022
@douglascamata
Copy link
Contributor

I am also very interested in this.

Is there already a configurable timeout for queries that will cause work cancellation to be pushed downstream?

@douglascamata
Copy link
Contributor

  • Long:
    • Limits per tenant

So there is some interest in making other parts of Thanos tenant-aware via configuration of a tenant label name?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty: hard dont-go-stale Label for important issues which tells the stalebot not to close them help wanted
Projects
None yet
Development

No branches or pull requests

2 participants