Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idle "write" client keeps submitting summaries (Nack: nonexistent client) #8483

Closed
vladsud opened this issue Dec 7, 2021 · 6 comments
Closed
Assignees
Labels
bug Something isn't working perf
Milestone

Comments

@vladsud
Copy link
Contributor

vladsud commented Dec 7, 2021

Update: I do not think summarizer spending extra CPU cycles is important here, as in most cases (where summarizer is pushed from quorum) there would be either no ops at all, or another client would be elected summarizer, so this summarizer will move to exit.
The reason this issue is important is to help with #7137 and make it a reality that any nack is becoming a critical error.

Update # 2
This issue results in non-stop summaries flowing from clients.
AgentScheduler installs “removeMember“ handler that will react to leave op for a client by attempting to assign task to itself.
In a sample I’ve caught it was “leader” task, and clientId was self.

I think what happens – PUSH will remove client from quorum if client is not active, and that causes same client to generate an op to re-select itself as a leader, as client believes we are still connected as “write” connection.
This causes to upgrade to "write" connection, and given that number of ops were generated, a summary will be generated as well. Cycle repeats and it just keeps going non-stop.

They key problem - connection properties (like “read” vs. “write”, permissions, read-only mode, etc.) are assumed to never change. This invariant is broken, and thus it cases problems in various places.

Old description:
Per description from PR #8468, and earlier issue #7753, it would be great to find a solution where summarizer does not spend cycles creating a summary after its socket connection was idle for 5 minutes.
In such cases relay service already have sent leave op for this client and thus summarizer client can't sent summarize op.
But it does not know about it, as current state of the world (after above mentioned PR undos previous attempt to optimize things) - there is no visible indication to runtime that downgrade of scenario happened.

I think the most ideal solution here is option # 2 from above mentioned PR - actually go through physical disconnect and reconnect when receiving leave op. It's expensive (for FRS, less expensive for ODSP that reuses socket), so we need to weight pros & cons of this approach.

Note that issue is formulated in terms of impact to summarizer (and as result - user & COGS) but it's broader than that.

Punt is an option, of course.

@GaryWilber, @tanviraumi, @pleath - FYI.

@vladsud vladsud added the bug Something isn't working label Dec 7, 2021
@vladsud vladsud self-assigned this Dec 7, 2021
@vladsud
Copy link
Contributor Author

vladsud commented Dec 7, 2021

Copying over description of the problem from above mentioned PR:

The core of the problem - relay service sends leave op for "write" connection after 5 minutes of inactivity.
This leaves connection in weird state - most of the system believes it's "write" connection, but you can't send ops on that connection. Above mentioned PR attempted (among other things) to remove this discrepancy for loader layer, but I missed that runtime layer does not expect change in properties of connection.

This change constraints visible changes to DM layer, making other layers believe that we still work with "write" connection (see items below for more details).

It's worth raising that we can go in various ways about it, roughly in priority order:

  1. Best option IMHO - service disconnects "write" connection after 5 min of inactivity. Sending leave op and keeping connection alive is weird for client as some layers do not care about such op (DM itself for most part), while others expect properties of connection do not change over lifetime of connection.
  2. Same as above, but client does it (and reconnects as "read") on own join
    • I favor this direction as longterm direction at the moment (i.e., I think we need to implement it, this change is more stop-gap solution).
    • It has rather big con for R11S stack that does not have (ODSP) socket reuse logic though
  3. Same as above, but disconnect / reconnect is "virtual", i.e. we keep using same connection with same clientId, but expose it to other layers (runtime) as disconnect from "write" connection and reconnect to same connection (same clientId) but as "read".
    • I'm not sure I like it as it breaks fundamental gurantee that every new connection gets unique clientId.
  4. This approach - client keeps connection as if it's "write" connection, but DM internally knows it's "read" connection and will go through reconnection logic if any op is sent by client.
    • The pro of this change over next - we never see weird nacks like "nonexciting client". The con - summarizer keeps going on such connection without realizing it.
  5. Undo above mentioned PR altogether. Same con as above (plus weird nack/disconnect reasons).

vladsud added a commit that referenced this issue Dec 7, 2021
Closes issue #8398 
Partial undo of PR #7753
Follow up / potential next steps - issue #8483
@vladsud vladsud added this to the February 2022 milestone Dec 13, 2021
@vladsud vladsud modified the milestones: February 2022, March 2022 Jan 30, 2022
@vladsud vladsud assigned jatgarg and unassigned vladsud Jan 30, 2022
@vladsud
Copy link
Contributor Author

vladsud commented Jan 30, 2022

I think we should do # 2 option. There is already code that reconnects ("read" -> "write") to avoid nacks (i.e. previously we would sent an op on "read" connection and would get nack from server, now we detect this situation and re-connect to "write" avoiding extra latency of sending an op and getting nack). We likely can reuse existing flow (or add to it) to move in opposite direction ("write" -> "read")

@vladsud vladsud changed the title Summarizer wastes CPU and bandwidth generating summary on downgraded idle connection Nack: nonexistent client Feb 17, 2022
@vladsud vladsud assigned chensixx and unassigned jatgarg Feb 17, 2022
@vladsud
Copy link
Contributor Author

vladsud commented Feb 17, 2022

So, glancing through the code, and telemetry, I think there is some mismatch that I want us to get to the bottom of.
First, here is how to observe it:

Office_Fluid_FluidRuntime_Performance
| where Data_eventName contains "_Disconnect"
| where Data_loaderVersion contains "0.54"
| where Data_reason contains "Nack"
| summarize count() by Data_reason

First of all, looking through the code, and specifically - tracking downgradedConnection, I'd think we should see no nacks, or rather small number of nacks here (read on). But counts are rather high (1-2% of all disconnect reasons, depending on what we count disconnect). Basically once we get a leave op for ourselve, we should issue ReadConnectionTransition event, and on any op submitted, we should reconnect, with "Switch to write" disconnect reason, i.e. avoid nack flow altogether.

I'd say the goal here is not to fully get rid of these nacks - it's impossible, due to possible race condition. I.e. Server may have issued leave op for this client, but client did not had time to process it. So client sends an op, and eventually gets leave op, followed by nack.

But the goal should be understanding why we have so many nacks, given above (code actively tries to avoid it).
And partially - why summarizer is hit the most by this (which brings the original motivation / title into question - does summarizer spends resources to produce a summary, only to find out itself that it not "write" client?)

I'd suggest starting by observing any sample (using npm run start:spo-df flow), Let it sit idle for 5 min with no ops. That should trigger this condition (downgraded connection). Obserever what happens when a new op is sent - do we go through nack flow or not? Does summarizer hit nack or not?

In addition to that, we can glance at any session that hit it and observe - did we see preceding ReadConnectionTransition event?

Understanding this area will help us better understand what to do next. Maybe nothing, maybe we discover something to fix :)

@vladsud
Copy link
Contributor Author

vladsud commented Feb 17, 2022

Here is a way to find sessions that hit nack, but did not hit ReadConnectionTransition :

Office_Fluid_FluidRuntime_Performance
| where Event_Time > ago(1d)
| where Data_eventName contains "ReadConnectionTransition" or (Data_eventName contains "Disconnect" and Data_reason contains "nonexistent")
| where Data_loaderVersion contains "0.53" or Data_loaderVersion contains "0.54" or Data_loaderVersion contains "0.55"
| summarize countif(Data_eventName contains "ReadConnectionTransition"), countif(Data_eventName notcontains "ReadConnectionTransition") by Data_containerId
| where countif_1 > 0 and countif
== 0
| project Data_containerId

Briefly looking at one session - it's possible that these are the race conditions I mentioned.
But I doubt it - even with a race, I'd expect leave op in most cases to be processed by client (and issue ReadConnectionTransition) right before nack. It's hard to see race playing out differently.

@vladsud vladsud modified the milestones: March 2022, April 2022 Mar 5, 2022
@vladsud
Copy link
Contributor Author

vladsud commented Mar 15, 2022

Here is another problem that can be observed by opening a file, making single edit and let it sit there - summaries will keep flowing.
What I see in debugger - AgentScheduler installs “removeMember“ handler that will react to leave op for a client by attempting to assign task to itself. In a sample I’ve caught it was “leader” task, and clientId was self.

I think what happens – PUSH will remove client from quorum if client is not active, and that causes same client to generate an op to re-select itself as a leader, as client believes we are still connected as “write” connection.

@vladsud vladsud added the perf label Mar 28, 2022
@vladsud vladsud changed the title Nack: nonexistent client Idle "write" client keeps submitting summaries (Nack: nonexistent client) Mar 28, 2022
@vladsud
Copy link
Contributor Author

vladsud commented Mar 28, 2022

@chensixx, I'd suggest to keep existing invariants (and this not make system more complex) and disconnect on receiving of such op.

Search for code next to this.downgradedConnection. I'd move this.reconnect() call from where it is today to the place where we set this.downgradedConnection to true. And remove this.downgradedConnection altogether.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working perf
Projects
None yet
Development

No branches or pull requests

3 participants