-
Notifications
You must be signed in to change notification settings - Fork 532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Idle "write" client keeps submitting summaries (Nack: nonexistent client) #8483
Comments
Copying over description of the problem from above mentioned PR: The core of the problem - relay service sends leave op for "write" connection after 5 minutes of inactivity. This change constraints visible changes to DM layer, making other layers believe that we still work with "write" connection (see items below for more details). It's worth raising that we can go in various ways about it, roughly in priority order:
|
I think we should do # 2 option. There is already code that reconnects ("read" -> "write") to avoid nacks (i.e. previously we would sent an op on "read" connection and would get nack from server, now we detect this situation and re-connect to "write" avoiding extra latency of sending an op and getting nack). We likely can reuse existing flow (or add to it) to move in opposite direction ("write" -> "read") |
So, glancing through the code, and telemetry, I think there is some mismatch that I want us to get to the bottom of.
First of all, looking through the code, and specifically - tracking I'd say the goal here is not to fully get rid of these nacks - it's impossible, due to possible race condition. I.e. Server may have issued leave op for this client, but client did not had time to process it. So client sends an op, and eventually gets leave op, followed by nack. But the goal should be understanding why we have so many nacks, given above (code actively tries to avoid it). I'd suggest starting by observing any sample (using npm run start:spo-df flow), Let it sit idle for 5 min with no ops. That should trigger this condition (downgraded connection). Obserever what happens when a new op is sent - do we go through nack flow or not? Does summarizer hit nack or not? In addition to that, we can glance at any session that hit it and observe - did we see preceding ReadConnectionTransition event? Understanding this area will help us better understand what to do next. Maybe nothing, maybe we discover something to fix :) |
Here is a way to find sessions that hit nack, but did not hit ReadConnectionTransition : Office_Fluid_FluidRuntime_Performance Briefly looking at one session - it's possible that these are the race conditions I mentioned. |
Here is another problem that can be observed by opening a file, making single edit and let it sit there - summaries will keep flowing. I think what happens – PUSH will remove client from quorum if client is not active, and that causes same client to generate an op to re-select itself as a leader, as client believes we are still connected as “write” connection. |
@chensixx, I'd suggest to keep existing invariants (and this not make system more complex) and disconnect on receiving of such op. Search for code next to this.downgradedConnection. I'd move this.reconnect() call from where it is today to the place where we set this.downgradedConnection to true. And remove this.downgradedConnection altogether. |
…(Nack: nonexistent client)
Update: I do not think summarizer spending extra CPU cycles is important here, as in most cases (where summarizer is pushed from quorum) there would be either no ops at all, or another client would be elected summarizer, so this summarizer will move to exit.
The reason this issue is important is to help with #7137 and make it a reality that any nack is becoming a critical error.
Update # 2
This issue results in non-stop summaries flowing from clients.
AgentScheduler installs “removeMember“ handler that will react to leave op for a client by attempting to assign task to itself.
In a sample I’ve caught it was “leader” task, and clientId was self.
I think what happens – PUSH will remove client from quorum if client is not active, and that causes same client to generate an op to re-select itself as a leader, as client believes we are still connected as “write” connection.
This causes to upgrade to "write" connection, and given that number of ops were generated, a summary will be generated as well. Cycle repeats and it just keeps going non-stop.
They key problem - connection properties (like “read” vs. “write”, permissions, read-only mode, etc.) are assumed to never change. This invariant is broken, and thus it cases problems in various places.
Old description:
Per description from PR #8468, and earlier issue #7753, it would be great to find a solution where summarizer does not spend cycles creating a summary after its socket connection was idle for 5 minutes.
In such cases relay service already have sent leave op for this client and thus summarizer client can't sent summarize op.
But it does not know about it, as current state of the world (after above mentioned PR undos previous attempt to optimize things) - there is no visible indication to runtime that downgrade of scenario happened.
I think the most ideal solution here is option # 2 from above mentioned PR - actually go through physical disconnect and reconnect when receiving leave op. It's expensive (for FRS, less expensive for ODSP that reuses socket), so we need to weight pros & cons of this approach.
Note that issue is formulated in terms of impact to summarizer (and as result - user & COGS) but it's broader than that.
Punt is an option, of course.
@GaryWilber, @tanviraumi, @pleath - FYI.
The text was updated successfully, but these errors were encountered: