Query-scheduler performance issue after #5880 #6090

pracucci · 2023-09-21T10:53:27Z

We're analyzing some read path latency which recently got worse and we've found that looks like there's an issue with the query-scheduler after #5880, which causes the time it takes to enqueue a query in the query-scheduler to grow over time. Looking when the issue started, it began with the rollout of the weekly release n. 255.

The average enqueue latency is measured using:

sum by(namespace) (rate(cortex_query_scheduler_enqueue_duration_seconds_sum{container="query-scheduler"}[5m]))
/
sum by(namespace) (rate(cortex_query_scheduler_enqueue_duration_seconds_count{container="query-scheduler"}[5m]))
* 1000

We can observe that it grows over time. For example, this is our staging environment:

These are a couple of our production environments:

The text was updated successfully, but these errors were encountered:

pracucci · 2023-09-21T11:29:07Z

In our staging environment, it looks like the issue didn't show after upgrading from weekly release 255 to 256. The weekly release 256 was rolled out on 2023-09-20 at 16:30 UTC:

charleskorn · 2023-09-22T06:15:31Z

I believe #6100 will fix this issue. See the description in the PR for more details.

pracucci assigned charleskorn Sep 21, 2023

charleskorn mentioned this issue Sep 22, 2023

Fix issue where query-schedulers accumulate stale querier connections #6100

Merged

2 tasks

charleskorn closed this as completed in #6100 Sep 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query-scheduler performance issue after #5880 #6090

Query-scheduler performance issue after #5880 #6090

pracucci commented Sep 21, 2023

pracucci commented Sep 21, 2023

charleskorn commented Sep 22, 2023

Query-scheduler performance issue after #5880 #6090

Query-scheduler performance issue after #5880 #6090

Comments

pracucci commented Sep 21, 2023

pracucci commented Sep 21, 2023

charleskorn commented Sep 22, 2023