Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue where query-schedulers accumulate stale querier connections #6100

Merged

Conversation

charleskorn
Copy link
Contributor

@charleskorn charleskorn commented Sep 22, 2023

What this PR does

This PR fixes an issue where query-schedulers could accumulate stale querier connections.

getNextQueueForQuerier would return nil in both the case where there were no queries right now for the querier, and the case where the querier was shutting down and so no queries should be sent.

This meant that the dispatcher loop in RequestQueue wouldn't remove connections for queriers that are shutting down, and so would accumulate more and more querier connections. Every time a new request arrived, it would try to find a request to send to the stale queriers (all of which would be at the beginning of waitingQuerierConnections, as the oldest connections), and then finally send the request to an active connection. Over time, as more and more queriers stopped, more and more stale querier connections would accumulate and so enqueue latency would increase.

I believe this issue was present before #5880, but #5880 made it visible in the enqueue latency:

  • previously, the enqueue latency measurement only included the time taken to push the request to the tenant's queue channel
  • after Improve query-scheduler performance under load #5880, the enqueue latency measurement also includes the time taken to attempt to immediately dispatch the request to a querier, and so the time taken attempting to find work for these accumulated stale connections became visible in the enqueue latency

Which issue(s) this PR fixes or relates to

Fixes #6090

Checklist

  • Tests updated
  • [n/a] Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@charleskorn charleskorn force-pushed the charleskorn/fix-scheduler-accumulating-stale-connections branch from 4073871 to 36268ee Compare September 22, 2023 06:19
@charleskorn charleskorn marked this pull request as ready for review September 22, 2023 06:20
@charleskorn charleskorn requested a review from a team as a code owner September 22, 2023 06:20
Copy link
Contributor

@jhesketh jhesketh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @charleskorn for working on this! To my understanding, the problem is that we never remove connections from waitingQuerierConnections when the querier disconnects from the query-scheduler.

If my understanding is correct, I have a couple of comments:

  1. I don't think it's just about "GetNextRequestForQuerier calls for this querier that nothing is coming" (all in all GetNextRequestForQuerier() should get context canceled as soon as the querier terminated, because of gRPC context) but I think the problem is that we never cleanup waitingQuerierConnections.
  2. I think we should enforce the cleanup even on unregisterConnection (which is more important than notifyShutdown, because the latter is just a best effort).
  3. Why the issue existed even before? I think we didn't have this issue before. What am I missing?

@@ -151,6 +152,9 @@ func (q *RequestQueue) dispatcherLoop() {
case notifyShutdown:
queues.notifyQuerierShutdown(qe.querierID)
needToDispatchQueries = true

// Tell any waiting GetNextRequestForQuerier calls for this querier that nothing is coming.
q.cancelWaitingConnectionsForQuerier(qe.querierID, waitingQuerierConnections)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shutdown notification is a best effort. It's not guaranteed to be received (e.g. if the querier OOMs or the K8S node where it was running gets abruptly terminated).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the shutdown notification is not received, we'll try to send a request to the querier (in Scheduler.QuerierLoop()), get an error, call unregisterConnection and not call GetNextRequestForQuerier again - so this is OK.

Copy link
Collaborator

@pracucci pracucci Sep 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be on the same page: what you're describing is the logic in this PR, not main, right? Cause I can see what you describe happens with the change proposed in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be on the same page: what you're describing is the logic in this PR, not main, right? Cause I can see what you describe happens with the change proposed in this PR.

The behaviour in the unclean shutdown case should be the same in both.

I find it a bit difficult to follow this logic. Wouldn't be safer and more clear if we remove a querier from waitingQuerierConnections even when handling unregisterConnection?

unregisterConnection is only called by Scheduler.QuerierLoop(), and so isn't called concurrently with GetNextRequestForQuerier, so removing the pending connection from waitingQuerierConnections is not necessary, because it won't be there.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would improve the comment to say that this is just a nice to have, but it's OK if we don't receive the shutdown notifications and waitingQuerierConnections will be cleaned up when trying to dispatch a request to the querier the next time.

@dimitarvdimitrov
Copy link
Contributor

I agree with @pracucci's comment. I thikn the user queues already manage well queriers that have disconnected all their connections.

Would be nice to have a single management structure for querier connections. Right now they are managed separately in the linked list and in the queues struct. Should we move them out to a separate structure and queues and dispatcherLoop are only clients of that structure? It might be easier to track changes to the linked list, querier map, and querier sorted list if they are closer together (perhaps we can have our own implementation of the linked list where each item belongs to two lists - per querier list and the global querier connections queue). But I realize that may be more work than patching this bug.

@charleskorn
Copy link
Contributor Author

Thanks @charleskorn for working on this! To my understanding, the problem is that we never remove connections from waitingQuerierConnections when the querier disconnects from the query-scheduler.

waitingQuerierConnections is the list of waiting GetNextRequestForQuerier calls (this might need a better name). When a querier informs the scheduler that it is gracefully shutting down, we set querier.shuttingDown to true, but this causes us to get into a situation where we'll never remove the connection from waitingQuerierConnections because we never try to send a request to it.

These are the steps that get us into the bad state:

  1. The querier makes call to scheduler, Scheduler.QuerierLoop() begins.
  2. Scheduler.QuerierLoop() calls RequestQueue.RegisterQuerierConnection()
  3. Scheduler.QuerierLoop() calls RequestQueue.GetNextRequestForQuerier(). Assume there are no pending queries for the querier, so the dispatcher loop does not immediately send a request back to GetNextRequestForQuerier().
  4. The querier begins to shut down, and calls Scheduler.NotifyQuerierShutdown(). This calls RequestQueue.NotifyQuerierShutdown(), which (through the dispatcher loop) calls queues.notifyQuerierShutdown and sets querier.shuttingDown to true.

Once we're in this state, when a query arrives:

  1. Scheduler.FrontendLoop() calls RequestQueue.EnqueueRequest()
  2. The dispatcher loop picks up the request, and tries to find a querier to perform the query
  3. The dispatcher loop calls RequestQueue.dispatchRequestToQuerier, which calls queues.getNextQueueForQuerier. Because querier.shuttingDown is true, this returns a nil queue, so we don't try to send anything to the querier, but we also don't remove the pending connection from waitingQuerierConnections - and we'll keep doing this until the scheduler process stops.

If my understanding is correct, I have a couple of comments:

  1. I don't think it's just about "GetNextRequestForQuerier calls for this querier that nothing is coming" (all in all GetNextRequestForQuerier() should get context canceled as soon as the querier terminated, because of gRPC context) but I think the problem is that we never cleanup waitingQuerierConnections.

My understanding is that the context won't be cancelled until we try to use the stream to send or receive a message and therefore realise the stream is broken, and that won't happen until the call to GetNextRequestForQuerier returns to Scheduler.QuerierLoop(). This is how it behaved in my testing.

  1. I think we should enforce the cleanup even on unregisterConnection (which is more important than notifyShutdown, because the latter is just a best effort).

UnregisterQuerierConnection is only called by Scheduler.QuerierLoop(), which means it won't be called while GetNextRequestForQuerier is running, and so can't help us here unfortunately.

  1. Why the issue existed even before? I think we didn't have this issue before. What am I missing?

I think the previous code assumed the context would be cancelled if the connection was broken, but that's not the case.

@charleskorn
Copy link
Contributor Author

I agree with @pracucci's comment. I thikn the user queues already manage well queriers that have disconnected all their connections.

I agree - the problem in this case is that forgetDisconnectedQueriers() will never remove the querier because UnregisterQuerierConnection() will never be called by QuerierLoop(), because it's blocked on GetNextRequestForQuerier().

@charleskorn
Copy link
Contributor Author

Would be nice to have a single management structure for querier connections. Right now they are managed separately in the linked list and in the queues struct. Should we move them out to a separate structure and queues and dispatcherLoop are only clients of that structure? It might be easier to track changes to the linked list, querier map, and querier sorted list if they are closer together (perhaps we can have our own implementation of the linked list where each item belongs to two lists - per querier list and the global querier connections queue). But I realize that may be more work than patching this bug.

I agree we could simplify this a little. However, they represent slightly different things (and may need better names) - the waitingQuerierConnections linked list is used only for outstanding GetNextRequestForQuerier calls, whereas the queues struct represents all ongoing QuerierLoop calls for a single querier.

Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think I now better understand the fix. I propose to improve a couple of comments. Thanks!

Comment on lines +239 to +242
if err != nil {
querierConn.sendError(err)
return true
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my understand this is the real fix. cancelWaitingConnectionsForQuerier() called by notifyShutdown is just a "nice to have" to gracefully handle shutting down queriers, but what really fix the issue in every case (which means when the shutdown notify has not been received) is this one. I suggest to add a comment here to explain it, cause I think will not be obvious when we'll read again this code in the future.

@@ -151,6 +152,9 @@ func (q *RequestQueue) dispatcherLoop() {
case notifyShutdown:
queues.notifyQuerierShutdown(qe.querierID)
needToDispatchQueries = true

// Tell any waiting GetNextRequestForQuerier calls for this querier that nothing is coming.
q.cancelWaitingConnectionsForQuerier(qe.querierID, waitingQuerierConnections)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would improve the comment to say that this is just a nice to have, but it's OK if we don't receive the shutdown notifications and waitingQuerierConnections will be cleaned up when trying to dispatch a request to the querier the next time.

@charleskorn charleskorn enabled auto-merge (squash) September 25, 2023 10:29
@dimitarvdimitrov
Copy link
Contributor

My understanding is that the context won't be cancelled until we try to use the stream to send or receive a message and therefore realise the stream is broken, and that won't happen until the call to GetNextRequestForQuerier returns to Scheduler.QuerierLoop(). This is how it behaved in my testing.

Does this mean that this context cancelation tracking has no effect - here and here? Should we remove it?

@charleskorn charleskorn merged commit a6d8fea into main Sep 25, 2023
28 checks passed
@charleskorn charleskorn deleted the charleskorn/fix-scheduler-accumulating-stale-connections branch September 25, 2023 11:22
grafanabot pushed a commit that referenced this pull request Sep 25, 2023
…#6100)

* Fix issue where query-scheduler accumulates connections from queriers that have shut down.

* Add test to ensure GetNextRequestForQuerier still returns if racing with NotifyQuerierShutdown and NotifyQuerierShutdown executes first.

* Add changelog entry.

* Address PR feedback: add comments

* Clarify comment

(cherry picked from commit a6d8fea)
@charleskorn
Copy link
Contributor Author

My understanding is that the context won't be cancelled until we try to use the stream to send or receive a message and therefore realise the stream is broken, and that won't happen until the call to GetNextRequestForQuerier returns to Scheduler.QuerierLoop(). This is how it behaved in my testing.

Does this mean that this context cancelation tracking has no effect - here and here? Should we remove it?

I don't know what I did when I was testing this last week (maybe I was checking the wrong context?), but what I said earlier is incorrect: the context is cancelled when the querier either gracefully ends the stream (eg. due to a graceful shutdown) or when it is terminated abruptly (eg. due to the querier crashing).

So my earlier statement that #6090 occurred before and after #5880 was also wrong - the implementation before #5880 was not susceptible to the same issue because it would always return from GetNextRequestForQuerier when the context was cancelled.

The issue in #6090 only occurs with the changes from #5880: we didn't remove pending GetNextRequestForQuerier calls from waitingQuerierConnections when the querier shuts down gracefully. In any other situation where the context is cancelled, we'll eventually try to send work to the waiting GetNextRequestForQuerier call and realise that its context is cancelled and remove it from waitingQuerierConnections.

We could be more proactive about detecting context cancellation earlier, but I don't think this is worthwhile as it would only complicate things further - we'd now have two places checking for context cancellation. Open to feedback on this if you disagree though @dimitarvdimitrov @pracucci.

The fix in this PR still resolves the issue, so I don't believe any further work is required.

@pracucci
Copy link
Collaborator

Thanks @charleskorn for the detailed explanation. I'm happy to hear the old version didn't suffer this issue. I'm fine to keep the current fix as is, without complicating more the logic to handle the cancellation earlier.

@dimitarvdimitrov
Copy link
Contributor

I think it's ok to only detect a cancelled querier connection when we attempt to send to it since the overhead of retrying the query on another connection shouldn't be high and because we should fairly quickly discover the querier connection is cancelled (within seconds with reasonable load?).

But your response prompted another question. Since the context is cancelled promptly, then do we need the extra cleanup in cancelWaitingConnectionsForQuerier? For all pending querier connections in waitingQuerierConnections we'd eventually hit this condition and then remove the connection from waitingQuerierConnections

@pracucci
Copy link
Collaborator

But your response prompted another question. Since the context is cancelled promptly, then do we need the extra cleanup in cancelWaitingConnectionsForQuerier? For all pending querier connections in waitingQuerierConnections we'd eventually hit this condition and then remove the connection from waitingQuerierConnections

Agree as well. I'm wondering if we really need that extra logic. Which was the point of a comment I made on this PR, which was rejected because at that point we thought context cancellation wasn't propagated.

@charleskorn
Copy link
Contributor Author

Good point, I've removed that in #6145.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Query-scheduler performance issue after #5880
4 participants