Idle "write" client keeps submitting summaries (Nack: nonexistent client) #8483

vladsud · 2021-12-07T20:07:18Z

Update: I do not think summarizer spending extra CPU cycles is important here, as in most cases (where summarizer is pushed from quorum) there would be either no ops at all, or another client would be elected summarizer, so this summarizer will move to exit.
The reason this issue is important is to help with #7137 and make it a reality that any nack is becoming a critical error.

Update # 2
This issue results in non-stop summaries flowing from clients.
AgentScheduler installs “removeMember“ handler that will react to leave op for a client by attempting to assign task to itself.
In a sample I’ve caught it was “leader” task, and clientId was self.

I think what happens – PUSH will remove client from quorum if client is not active, and that causes same client to generate an op to re-select itself as a leader, as client believes we are still connected as “write” connection.
This causes to upgrade to "write" connection, and given that number of ops were generated, a summary will be generated as well. Cycle repeats and it just keeps going non-stop.

They key problem - connection properties (like “read” vs. “write”, permissions, read-only mode, etc.) are assumed to never change. This invariant is broken, and thus it cases problems in various places.

Old description:
Per description from PR #8468, and earlier issue #7753, it would be great to find a solution where summarizer does not spend cycles creating a summary after its socket connection was idle for 5 minutes.
In such cases relay service already have sent leave op for this client and thus summarizer client can't sent summarize op.
But it does not know about it, as current state of the world (after above mentioned PR undos previous attempt to optimize things) - there is no visible indication to runtime that downgrade of scenario happened.

I think the most ideal solution here is option # 2 from above mentioned PR - actually go through physical disconnect and reconnect when receiving leave op. It's expensive (for FRS, less expensive for ODSP that reuses socket), so we need to weight pros & cons of this approach.

Note that issue is formulated in terms of impact to summarizer (and as result - user & COGS) but it's broader than that.

Punt is an option, of course.

@GaryWilber, @tanviraumi, @pleath - FYI.

vladsud · 2021-12-07T20:09:07Z

Copying over description of the problem from above mentioned PR:

The core of the problem - relay service sends leave op for "write" connection after 5 minutes of inactivity.
This leaves connection in weird state - most of the system believes it's "write" connection, but you can't send ops on that connection. Above mentioned PR attempted (among other things) to remove this discrepancy for loader layer, but I missed that runtime layer does not expect change in properties of connection.

This change constraints visible changes to DM layer, making other layers believe that we still work with "write" connection (see items below for more details).

It's worth raising that we can go in various ways about it, roughly in priority order:

Best option IMHO - service disconnects "write" connection after 5 min of inactivity. Sending leave op and keeping connection alive is weird for client as some layers do not care about such op (DM itself for most part), while others expect properties of connection do not change over lifetime of connection.
Same as above, but client does it (and reconnects as "read") on own join
- I favor this direction as longterm direction at the moment (i.e., I think we need to implement it, this change is more stop-gap solution).
- It has rather big con for R11S stack that does not have (ODSP) socket reuse logic though
Same as above, but disconnect / reconnect is "virtual", i.e. we keep using same connection with same clientId, but expose it to other layers (runtime) as disconnect from "write" connection and reconnect to same connection (same clientId) but as "read".
- I'm not sure I like it as it breaks fundamental gurantee that every new connection gets unique clientId.
This approach - client keeps connection as if it's "write" connection, but DM internally knows it's "read" connection and will go through reconnection logic if any op is sent by client.
- The pro of this change over next - we never see weird nacks like "nonexciting client". The con - summarizer keeps going on such connection without realizing it.
Undo above mentioned PR altogether. Same con as above (plus weird nack/disconnect reasons).

Closes issue #8398 Partial undo of PR #7753 Follow up / potential next steps - issue #8483

vladsud · 2022-01-30T06:38:48Z

I think we should do # 2 option. There is already code that reconnects ("read" -> "write") to avoid nacks (i.e. previously we would sent an op on "read" connection and would get nack from server, now we detect this situation and re-connect to "write" avoiding extra latency of sending an op and getting nack). We likely can reuse existing flow (or add to it) to move in opposite direction ("write" -> "read")

vladsud · 2022-02-17T04:26:11Z

So, glancing through the code, and telemetry, I think there is some mismatch that I want us to get to the bottom of.
First, here is how to observe it:

Office_Fluid_FluidRuntime_Performance
| where Data_eventName contains "_Disconnect"
| where Data_loaderVersion contains "0.54"
| where Data_reason contains "Nack"
| summarize count() by Data_reason

First of all, looking through the code, and specifically - tracking downgradedConnection, I'd think we should see no nacks, or rather small number of nacks here (read on). But counts are rather high (1-2% of all disconnect reasons, depending on what we count disconnect). Basically once we get a leave op for ourselve, we should issue ReadConnectionTransition event, and on any op submitted, we should reconnect, with "Switch to write" disconnect reason, i.e. avoid nack flow altogether.

I'd say the goal here is not to fully get rid of these nacks - it's impossible, due to possible race condition. I.e. Server may have issued leave op for this client, but client did not had time to process it. So client sends an op, and eventually gets leave op, followed by nack.

But the goal should be understanding why we have so many nacks, given above (code actively tries to avoid it).
And partially - why summarizer is hit the most by this (which brings the original motivation / title into question - does summarizer spends resources to produce a summary, only to find out itself that it not "write" client?)

I'd suggest starting by observing any sample (using npm run start:spo-df flow), Let it sit idle for 5 min with no ops. That should trigger this condition (downgraded connection). Obserever what happens when a new op is sent - do we go through nack flow or not? Does summarizer hit nack or not?

In addition to that, we can glance at any session that hit it and observe - did we see preceding ReadConnectionTransition event?

Understanding this area will help us better understand what to do next. Maybe nothing, maybe we discover something to fix :)

vladsud · 2022-02-17T04:33:27Z

Here is a way to find sessions that hit nack, but did not hit ReadConnectionTransition :

Office_Fluid_FluidRuntime_Performance
| where Event_Time > ago(1d)
| where Data_eventName contains "ReadConnectionTransition" or (Data_eventName contains "Disconnect" and Data_reason contains "nonexistent")
| where Data_loaderVersion contains "0.53" or Data_loaderVersion contains "0.54" or Data_loaderVersion contains "0.55"
| summarize countif(Data_eventName contains "ReadConnectionTransition"), countif(Data_eventName notcontains "ReadConnectionTransition") by Data_containerId
| where countif_1 > 0 and countif == 0
| project Data_containerId

Briefly looking at one session - it's possible that these are the race conditions I mentioned.
But I doubt it - even with a race, I'd expect leave op in most cases to be processed by client (and issue ReadConnectionTransition) right before nack. It's hard to see race playing out differently.

vladsud · 2022-03-15T02:03:54Z

Here is another problem that can be observed by opening a file, making single edit and let it sit there - summaries will keep flowing.
What I see in debugger - AgentScheduler installs “removeMember“ handler that will react to leave op for a client by attempting to assign task to itself. In a sample I’ve caught it was “leader” task, and clientId was self.

I think what happens – PUSH will remove client from quorum if client is not active, and that causes same client to generate an op to re-select itself as a leader, as client believes we are still connected as “write” connection.

vladsud · 2022-03-28T21:16:35Z

@chensixx, I'd suggest to keep existing invariants (and this not make system more complex) and disconnect on receiving of such op.

Search for code next to this.downgradedConnection. I'd move this.reconnect() call from where it is today to the place where we set this.downgradedConnection to true. And remove this.downgradedConnection altogether.

…(Nack: nonexistent client)

vladsud added the bug Something isn't working label Dec 7, 2021

vladsud self-assigned this Dec 7, 2021

vladsud mentioned this issue Dec 7, 2021

Address issues from earlier downgraded connection mode change #8468

Merged

vladsud added a commit that referenced this issue Dec 7, 2021

Address issues from dongraded connection change. (#8468)

e843bb9

Closes issue #8398 Partial undo of PR #7753 Follow up / potential next steps - issue #8483

vladsud added this to the February 2022 milestone Dec 13, 2021

vladsud modified the milestones: February 2022, March 2022 Jan 30, 2022

vladsud assigned jatgarg and unassigned vladsud Jan 30, 2022

vladsud mentioned this issue Jan 30, 2022

Hitting 10K unsummarized ops results in client getting into infinite loop of reconnects #7137

Closed

vladsud changed the title ~~Summarizer wastes CPU and bandwidth generating summary on downgraded idle connection~~ Nack: nonexistent client Feb 17, 2022

vladsud assigned chensixx and unassigned jatgarg Feb 17, 2022

vladsud modified the milestones: March 2022, April 2022 Mar 5, 2022

vladsud added the perf label Mar 28, 2022

vladsud changed the title ~~Nack: nonexistent client~~ Idle "write" client keeps submitting summaries (Nack: nonexistent client) Mar 28, 2022

vladsud added a commit to vladsud/FluidFramework that referenced this issue Apr 11, 2022

Issue microsoft#8483: Idle "write" client keeps submitting summaries …

17f5c29

…(Nack: nonexistent client)

vladsud mentioned this issue Apr 11, 2022

Add test coverage for connection transition from "write' to "read" on timeout #9819

Closed

vladsud assigned vladsud and unassigned chensixx Apr 11, 2022

vladsud closed this as completed in db1198e Apr 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idle "write" client keeps submitting summaries (Nack: nonexistent client) #8483

Idle "write" client keeps submitting summaries (Nack: nonexistent client) #8483

vladsud commented Dec 7, 2021 •

edited

Loading

vladsud commented Dec 7, 2021

vladsud commented Jan 30, 2022

vladsud commented Feb 17, 2022

vladsud commented Feb 17, 2022

vladsud commented Mar 15, 2022

vladsud commented Mar 28, 2022

Idle "write" client keeps submitting summaries (Nack: nonexistent client) #8483

Idle "write" client keeps submitting summaries (Nack: nonexistent client) #8483

Comments

vladsud commented Dec 7, 2021 • edited Loading

vladsud commented Dec 7, 2021

vladsud commented Jan 30, 2022

vladsud commented Feb 17, 2022

vladsud commented Feb 17, 2022

vladsud commented Mar 15, 2022

vladsud commented Mar 28, 2022

vladsud commented Dec 7, 2021 •

edited

Loading