-
Notifications
You must be signed in to change notification settings - Fork 512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
apply multidimensional query request queuing: supply queue dimensions from frontend & utilize in scheduler #6772
apply multidimensional query request queuing: supply queue dimensions from frontend & utilize in scheduler #6772
Conversation
782a372
to
af50eaa
Compare
saving this benchmarking output here: I thought this didn't look great at first, but the increase in memory usage from adding the second queue dimension scales sublinearly with the number of tenants. There also is a smaller linear increase with the number of requests due to adding a field to the With our worst case
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall design looks good to me, thanks for working on this @francoposa.
You've already pointed out that there are still some tests you want to add - one in particular that I'd like to see is a test that shows the max queue length is enforced regardless of the distribution of requests across dimensions. (For example, if the max queue length is 100, then it shouldn't matter if there are 100 ingester-only requests waiting, or 50 for ingesters and 50 for store-gateways: either way, another request should be rejected.)
default: | ||
// no query time params to parse; cannot infer query component | ||
return nil, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Main open question here is: do we give this case no additional queue dimension or do we make the "ingester-and-store-gateway" the "default"/"we don't know" queue.
If we do not give it in additional queue dimensions it would get enqueued with only the tenant dimension, not into the subqueues. This works just fine as mentioned before - any requests from the v1 frontend will work this way as well.
But I see a case for doing our best to place everything coming in from the v2 frontend into one of the subqueues, and it seems appropriate to make "ingester-and-store-gateway"
the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What queries would end up in this bucket?
If we're not expecting anything to end up in this bucket, then logging that we got a type of query we didn't understand and assigning them to the tenant-only dimension seems reasonable to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a warning log.
the only thing that would end up in this bucket is a something that's not recognized as a range, instant, labels, or cardinality query, but somehow passed all validation above this in the query-frontend stack. I do not know if such a query exists.
this should be ready, except for the open question on whether the default additional queue dimension should be none or Tests were added to cover the adapter pretty completely, as well as the issue of the multi-dimensional queues respecting the tenant-level max requests limit. |
default: | ||
// no query time params to parse; cannot infer query component | ||
return nil, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What queries would end up in this bucket?
If we're not expecting anything to end up in this bucket, then logging that we got a type of query we didn't understand and assigning them to the tenant-only dimension seems reasonable to me.
"github.com/stretchr/testify/require" | ||
) | ||
|
||
const rangeURLFormat = "/api/v1/query_range?end=%d&query=&start=%d&step=%d" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] Do these need to be constants? Looks like they're only used in one place
084c9bd
to
4280867
Compare
feature flagging, all additional test cases, and warning log should be done. Have continued to run it locally under load gen and with and without the feature flags and everything seems happy. |
…cter handling of query time params, some tests on adapter
…r tests for adapter
…ant queue node and all subqueue nodes
…sions; fix log message
…end and scheduler tests
…to query ingesters within and query store after
6845899
to
6a15ef0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM modulo some minor comments below - thanks for working on this @francoposa!
if qb.additionalQueueDimensionsEnabled { | ||
if schedulerRequest, ok := request.req.(*SchedulerRequest); ok { | ||
return append(QueuePath{string(request.tenantID)}, schedulerRequest.AdditionalQueueDimensions...), nil | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this check (and the corresponding query-scheduler config flag) here? I think the feature flag on query-frontends is enough: if that flag is disabled, frontends won't send any extra dimensions, so the path created here will just have the tenant dimension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
context from slack, just decided to leave it as is for now:
We don’t technically need it, but I had the separate flag because:
a) it seemed odd to just have the flag in the frontend when all the important/complicated stuff is happening in the scheduler and
b) It just seemed more proper/less coupled? If something needs troubleshooting, we could use on flag to stop sending extra dimensions from the frontend or another flag for the scheduler to just ignore the extra dimensions.Don’t feel strongly though. I scanned the config docs and there are other multi-component features that need a flag on both components to work.
The only member of docs metrics is out for some time, so I will need get a Mimir repo admin to override the merge requirements here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tee index file looks good.
@francoposa and @lwandz13, thank you for your work on this while I was on PTO. I have added the |
Changelog to be done once we solidify the shape of this.
This is the second part of our multi-dimensional queuing project to mitigate the issue where a single slow query component (like store-gateway) can also slow down a tenant's requests that don't hit that component (ingesters-only queries), or vice versa.
The first pull request #6533 introduced the TreeQueue datastructures which can handle an arbirary number of queue dimensions, but kept it to the single queue dimension of
tenantID
.This pull request adds code to query-frontend to assign the additional queue dimension query component for a query (ingesters, store-gateway, or both), and adds the handling of that into the scheduler as well.
Various design discussions went into this, but the most critical were the decision that the calculation of additional queue dimensions should take place in the query-frontend.
The idea is that the query-frontend is already Prometheus-aware, that is it understands the shape of Prometheus queries and can make decisions based on Prometheus query attributes. The scheduler needs to utilize some arbitrary information to create additional queue dimensions, but it should not need to know anything about the shape of the items in its queue, Prometheus queries or otherwise.
As noted in the comments below, as of PR submission, this is only applied to the v2 frontend - there were some conflicting ideas on whether the functionality should be added to the v1 frontend.
The multi-dimension queue implementation is robust to enqueuing and fairly dequeuing a tenant's request even if the requests are mixed between having an additional queue dimension or not.
This means the scheduler can tolerate failures to calculate additional queue dimensions as well as receiving requests from a frontend which has not implemented the additional queue dimension calculations at the same time as receiving requests from a frontend which has.
What this PR does
Which issue(s) this PR fixes or relates to
Fixes #
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]
.about-versioning.md
updated with experimental features.