Fix issue where query-schedulers accumulate stale querier connections #6100

charleskorn · 2023-09-22T06:12:04Z

What this PR does

This PR fixes an issue where query-schedulers could accumulate stale querier connections.

getNextQueueForQuerier would return nil in both the case where there were no queries right now for the querier, and the case where the querier was shutting down and so no queries should be sent.

This meant that the dispatcher loop in RequestQueue wouldn't remove connections for queriers that are shutting down, and so would accumulate more and more querier connections. Every time a new request arrived, it would try to find a request to send to the stale queriers (all of which would be at the beginning of waitingQuerierConnections, as the oldest connections), and then finally send the request to an active connection. Over time, as more and more queriers stopped, more and more stale querier connections would accumulate and so enqueue latency would increase.

I believe this issue was present before #5880, but #5880 made it visible in the enqueue latency:

previously, the enqueue latency measurement only included the time taken to push the request to the tenant's queue channel
after Improve query-scheduler performance under load #5880, the enqueue latency measurement also includes the time taken to attempt to immediately dispatch the request to a querier, and so the time taken attempting to find work for these accumulated stale connections became visible in the enqueue latency

Which issue(s) this PR fixes or relates to

Fixes #6090

Checklist

Tests updated
[n/a] Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

… that have shut down.

…ith NotifyQuerierShutdown and NotifyQuerierShutdown executes first.

jhesketh

lgtm

pracucci

Thanks @charleskorn for working on this! To my understanding, the problem is that we never remove connections from waitingQuerierConnections when the querier disconnects from the query-scheduler.

If my understanding is correct, I have a couple of comments:

I don't think it's just about "GetNextRequestForQuerier calls for this querier that nothing is coming" (all in all GetNextRequestForQuerier() should get context canceled as soon as the querier terminated, because of gRPC context) but I think the problem is that we never cleanup waitingQuerierConnections.
I think we should enforce the cleanup even on unregisterConnection (which is more important than notifyShutdown, because the latter is just a best effort).
Why the issue existed even before? I think we didn't have this issue before. What am I missing?

pracucci · 2023-09-25T08:06:36Z

pkg/scheduler/queue/queue.go

@@ -151,6 +152,9 @@ func (q *RequestQueue) dispatcherLoop() {
 			case notifyShutdown:
 				queues.notifyQuerierShutdown(qe.querierID)
 				needToDispatchQueries = true
+
+				// Tell any waiting GetNextRequestForQuerier calls for this querier that nothing is coming.
+				q.cancelWaitingConnectionsForQuerier(qe.querierID, waitingQuerierConnections)


The shutdown notification is a best effort. It's not guaranteed to be received (e.g. if the querier OOMs or the K8S node where it was running gets abruptly terminated).

If the shutdown notification is not received, we'll try to send a request to the querier (in Scheduler.QuerierLoop()), get an error, call unregisterConnection and not call GetNextRequestForQuerier again - so this is OK.

Just to be on the same page: what you're describing is the logic in this PR, not main, right? Cause I can see what you describe happens with the change proposed in this PR.

Just to be on the same page: what you're describing is the logic in this PR, not main, right? Cause I can see what you describe happens with the change proposed in this PR.

The behaviour in the unclean shutdown case should be the same in both.

I find it a bit difficult to follow this logic. Wouldn't be safer and more clear if we remove a querier from waitingQuerierConnections even when handling unregisterConnection?

unregisterConnection is only called by Scheduler.QuerierLoop(), and so isn't called concurrently with GetNextRequestForQuerier, so removing the pending connection from waitingQuerierConnections is not necessary, because it won't be there.

I would improve the comment to say that this is just a nice to have, but it's OK if we don't receive the shutdown notifications and waitingQuerierConnections will be cleaned up when trying to dispatch a request to the querier the next time.

dimitarvdimitrov · 2023-09-25T09:50:58Z

I agree with @pracucci's comment. I thikn the user queues already manage well queriers that have disconnected all their connections.

Would be nice to have a single management structure for querier connections. Right now they are managed separately in the linked list and in the queues struct. Should we move them out to a separate structure and queues and dispatcherLoop are only clients of that structure? It might be easier to track changes to the linked list, querier map, and querier sorted list if they are closer together (perhaps we can have our own implementation of the linked list where each item belongs to two lists - per querier list and the global querier connections queue). But I realize that may be more work than patching this bug.

charleskorn · 2023-09-25T10:08:15Z

Thanks @charleskorn for working on this! To my understanding, the problem is that we never remove connections from waitingQuerierConnections when the querier disconnects from the query-scheduler.

waitingQuerierConnections is the list of waiting GetNextRequestForQuerier calls (this might need a better name). When a querier informs the scheduler that it is gracefully shutting down, we set querier.shuttingDown to true, but this causes us to get into a situation where we'll never remove the connection from waitingQuerierConnections because we never try to send a request to it.

These are the steps that get us into the bad state:

The querier makes call to scheduler, Scheduler.QuerierLoop() begins.
Scheduler.QuerierLoop() calls RequestQueue.RegisterQuerierConnection()
Scheduler.QuerierLoop() calls RequestQueue.GetNextRequestForQuerier(). Assume there are no pending queries for the querier, so the dispatcher loop does not immediately send a request back to GetNextRequestForQuerier().
The querier begins to shut down, and calls Scheduler.NotifyQuerierShutdown(). This calls RequestQueue.NotifyQuerierShutdown(), which (through the dispatcher loop) calls queues.notifyQuerierShutdown and sets querier.shuttingDown to true.

Once we're in this state, when a query arrives:

Scheduler.FrontendLoop() calls RequestQueue.EnqueueRequest()
The dispatcher loop picks up the request, and tries to find a querier to perform the query
The dispatcher loop calls RequestQueue.dispatchRequestToQuerier, which calls queues.getNextQueueForQuerier. Because querier.shuttingDown is true, this returns a nil queue, so we don't try to send anything to the querier, but we also don't remove the pending connection from waitingQuerierConnections - and we'll keep doing this until the scheduler process stops.

If my understanding is correct, I have a couple of comments:

I don't think it's just about "GetNextRequestForQuerier calls for this querier that nothing is coming" (all in all GetNextRequestForQuerier() should get context canceled as soon as the querier terminated, because of gRPC context) but I think the problem is that we never cleanup waitingQuerierConnections.

My understanding is that the context won't be cancelled until we try to use the stream to send or receive a message and therefore realise the stream is broken, and that won't happen until the call to GetNextRequestForQuerier returns to Scheduler.QuerierLoop(). This is how it behaved in my testing.

I think we should enforce the cleanup even on unregisterConnection (which is more important than notifyShutdown, because the latter is just a best effort).

UnregisterQuerierConnection is only called by Scheduler.QuerierLoop(), which means it won't be called while GetNextRequestForQuerier is running, and so can't help us here unfortunately.

Why the issue existed even before? I think we didn't have this issue before. What am I missing?

I think the previous code assumed the context would be cancelled if the connection was broken, but that's not the case.

charleskorn · 2023-09-25T10:10:23Z

I agree with @pracucci's comment. I thikn the user queues already manage well queriers that have disconnected all their connections.

I agree - the problem in this case is that forgetDisconnectedQueriers() will never remove the querier because UnregisterQuerierConnection() will never be called by QuerierLoop(), because it's blocked on GetNextRequestForQuerier().

charleskorn · 2023-09-25T10:13:24Z

Would be nice to have a single management structure for querier connections. Right now they are managed separately in the linked list and in the queues struct. Should we move them out to a separate structure and queues and dispatcherLoop are only clients of that structure? It might be easier to track changes to the linked list, querier map, and querier sorted list if they are closer together (perhaps we can have our own implementation of the linked list where each item belongs to two lists - per querier list and the global querier connections queue). But I realize that may be more work than patching this bug.

I agree we could simplify this a little. However, they represent slightly different things (and may need better names) - the waitingQuerierConnections linked list is used only for outstanding GetNextRequestForQuerier calls, whereas the queues struct represents all ongoing QuerierLoop calls for a single querier.

pracucci

Ok, I think I now better understand the fix. I propose to improve a couple of comments. Thanks!

pracucci · 2023-09-25T10:19:21Z

pkg/scheduler/queue/queue.go

+	if err != nil {
+		querierConn.sendError(err)
+		return true
+	}


To my understand this is the real fix. cancelWaitingConnectionsForQuerier() called by notifyShutdown is just a "nice to have" to gracefully handle shutting down queriers, but what really fix the issue in every case (which means when the shutdown notify has not been received) is this one. I suggest to add a comment here to explain it, cause I think will not be obvious when we'll read again this code in the future.

pracucci · 2023-09-25T10:20:44Z

pkg/scheduler/queue/queue.go

@@ -151,6 +152,9 @@ func (q *RequestQueue) dispatcherLoop() {
 			case notifyShutdown:
 				queues.notifyQuerierShutdown(qe.querierID)
 				needToDispatchQueries = true
+
+				// Tell any waiting GetNextRequestForQuerier calls for this querier that nothing is coming.
+				q.cancelWaitingConnectionsForQuerier(qe.querierID, waitingQuerierConnections)


I would improve the comment to say that this is just a nice to have, but it's OK if we don't receive the shutdown notifications and waitingQuerierConnections will be cleaned up when trying to dispatch a request to the querier the next time.

…-connections

dimitarvdimitrov · 2023-09-25T11:20:42Z

My understanding is that the context won't be cancelled until we try to use the stream to send or receive a message and therefore realise the stream is broken, and that won't happen until the call to GetNextRequestForQuerier returns to Scheduler.QuerierLoop(). This is how it behaved in my testing.

Does this mean that this context cancelation tracking has no effect - here and here? Should we remove it?

…#6100) * Fix issue where query-scheduler accumulates connections from queriers that have shut down. * Add test to ensure GetNextRequestForQuerier still returns if racing with NotifyQuerierShutdown and NotifyQuerierShutdown executes first. * Add changelog entry. * Address PR feedback: add comments * Clarify comment (cherry picked from commit a6d8fea)

charleskorn · 2023-09-27T05:32:48Z

My understanding is that the context won't be cancelled until we try to use the stream to send or receive a message and therefore realise the stream is broken, and that won't happen until the call to GetNextRequestForQuerier returns to Scheduler.QuerierLoop(). This is how it behaved in my testing.

Does this mean that this context cancelation tracking has no effect - here and here? Should we remove it?

I don't know what I did when I was testing this last week (maybe I was checking the wrong context?), but what I said earlier is incorrect: the context is cancelled when the querier either gracefully ends the stream (eg. due to a graceful shutdown) or when it is terminated abruptly (eg. due to the querier crashing).

So my earlier statement that #6090 occurred before and after #5880 was also wrong - the implementation before #5880 was not susceptible to the same issue because it would always return from GetNextRequestForQuerier when the context was cancelled.

The issue in #6090 only occurs with the changes from #5880: we didn't remove pending GetNextRequestForQuerier calls from waitingQuerierConnections when the querier shuts down gracefully. In any other situation where the context is cancelled, we'll eventually try to send work to the waiting GetNextRequestForQuerier call and realise that its context is cancelled and remove it from waitingQuerierConnections.

We could be more proactive about detecting context cancellation earlier, but I don't think this is worthwhile as it would only complicate things further - we'd now have two places checking for context cancellation. Open to feedback on this if you disagree though @dimitarvdimitrov @pracucci.

The fix in this PR still resolves the issue, so I don't believe any further work is required.

pracucci · 2023-09-27T07:55:57Z

Thanks @charleskorn for the detailed explanation. I'm happy to hear the old version didn't suffer this issue. I'm fine to keep the current fix as is, without complicating more the logic to handle the cancellation earlier.

dimitarvdimitrov · 2023-09-27T08:00:21Z

I think it's ok to only detect a cancelled querier connection when we attempt to send to it since the overhead of retrying the query on another connection shouldn't be high and because we should fairly quickly discover the querier connection is cancelled (within seconds with reasonable load?).

But your response prompted another question. Since the context is cancelled promptly, then do we need the extra cleanup in cancelWaitingConnectionsForQuerier? For all pending querier connections in waitingQuerierConnections we'd eventually hit this condition and then remove the connection from waitingQuerierConnections

pracucci · 2023-09-27T08:08:41Z

But your response prompted another question. Since the context is cancelled promptly, then do we need the extra cleanup in cancelWaitingConnectionsForQuerier? For all pending querier connections in waitingQuerierConnections we'd eventually hit this condition and then remove the connection from waitingQuerierConnections

Agree as well. I'm wondering if we really need that extra logic. Which was the point of a comment I made on this PR, which was rejected because at that point we thought context cancellation wasn't propagated.

charleskorn · 2023-09-28T01:00:39Z

Good point, I've removed that in #6145.

charleskorn added 2 commits September 22, 2023 16:02

Fix issue where query-scheduler accumulates connections from queriers…

d0f7cb5

… that have shut down.

Add test to ensure GetNextRequestForQuerier still returns if racing w…

25fc1a3

…ith NotifyQuerierShutdown and NotifyQuerierShutdown executes first.

charleskorn mentioned this pull request Sep 22, 2023

Query-scheduler performance issue after #5880 #6090

Closed

Add changelog entry.

36268ee

charleskorn force-pushed the charleskorn/fix-scheduler-accumulating-stale-connections branch from 4073871 to 36268ee Compare September 22, 2023 06:19

charleskorn marked this pull request as ready for review September 22, 2023 06:20

charleskorn requested a review from a team as a code owner September 22, 2023 06:20

jhesketh approved these changes Sep 22, 2023

View reviewed changes

dimitarvdimitrov self-requested a review September 22, 2023 11:01

pracucci self-requested a review September 25, 2023 07:52

pracucci reviewed Sep 25, 2023

View reviewed changes

pracucci approved these changes Sep 25, 2023

View reviewed changes

charleskorn and others added 2 commits September 25, 2023 20:26

Address PR feedback: add comments

88ff2fc

Merge branch 'main' into charleskorn/fix-scheduler-accumulating-stale…

290e5a4

…-connections

charleskorn added the backport r257 label Sep 25, 2023

Clarify comment

baf9a32

charleskorn enabled auto-merge (squash) September 25, 2023 10:29

charleskorn merged commit a6d8fea into main Sep 25, 2023
28 checks passed

charleskorn deleted the charleskorn/fix-scheduler-accumulating-stale-connections branch September 25, 2023 11:22

grafanabot mentioned this pull request Sep 25, 2023

[r257] Fix issue where query-schedulers accumulate stale querier connections #6127

Merged

charleskorn mentioned this pull request Sep 28, 2023

Query-scheduler: remove unnecessary cleanup after graceful querier shutdown notification #6145

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issue where query-schedulers accumulate stale querier connections #6100

Fix issue where query-schedulers accumulate stale querier connections #6100

charleskorn commented Sep 22, 2023 •

edited

Loading

jhesketh left a comment

pracucci left a comment

pracucci Sep 25, 2023

charleskorn Sep 25, 2023

pracucci Sep 25, 2023 •

edited

Loading

charleskorn Sep 25, 2023

pracucci Sep 25, 2023

dimitarvdimitrov commented Sep 25, 2023

charleskorn commented Sep 25, 2023

charleskorn commented Sep 25, 2023

charleskorn commented Sep 25, 2023

pracucci left a comment

pracucci Sep 25, 2023

pracucci Sep 25, 2023

dimitarvdimitrov commented Sep 25, 2023

charleskorn commented Sep 27, 2023

pracucci commented Sep 27, 2023

dimitarvdimitrov commented Sep 27, 2023

pracucci commented Sep 27, 2023

charleskorn commented Sep 28, 2023

Fix issue where query-schedulers accumulate stale querier connections #6100

Fix issue where query-schedulers accumulate stale querier connections #6100

Conversation

charleskorn commented Sep 22, 2023 • edited Loading

What this PR does

Which issue(s) this PR fixes or relates to

Checklist

jhesketh left a comment

Choose a reason for hiding this comment

pracucci left a comment

Choose a reason for hiding this comment

pracucci Sep 25, 2023

Choose a reason for hiding this comment

charleskorn Sep 25, 2023

Choose a reason for hiding this comment

pracucci Sep 25, 2023 • edited Loading

Choose a reason for hiding this comment

charleskorn Sep 25, 2023

Choose a reason for hiding this comment

pracucci Sep 25, 2023

Choose a reason for hiding this comment

dimitarvdimitrov commented Sep 25, 2023

charleskorn commented Sep 25, 2023

charleskorn commented Sep 25, 2023

charleskorn commented Sep 25, 2023

pracucci left a comment

Choose a reason for hiding this comment

pracucci Sep 25, 2023

Choose a reason for hiding this comment

pracucci Sep 25, 2023

Choose a reason for hiding this comment

dimitarvdimitrov commented Sep 25, 2023

charleskorn commented Sep 27, 2023

pracucci commented Sep 27, 2023

dimitarvdimitrov commented Sep 27, 2023

pracucci commented Sep 27, 2023

charleskorn commented Sep 28, 2023

charleskorn commented Sep 22, 2023 •

edited

Loading

pracucci Sep 25, 2023 •

edited

Loading