large initialsync may never complete #11611

richvdh · 2021-12-20T13:32:46Z

For a sufficiently large account, it's possible that we will never be able to serialise the response before the client times out. This has recently been made a lot worse by #11478 which makes the serialisation process much slower, but it was a potential problem even before that.

The intention is that it shouldn't matter too much if it takes a long time to generate a sync response, because once we finally manage it, we will cache the response, which we can then serve up quickly the next time the client makes a request. However, this oversimplifies the situation.

We can basically consider there are four steps in serving a /sync response:

Figure out which rooms and events should be in the response. The end result of this step is a SyncResult object, which is cached.
"Serialise" each event in the response. This is where we turn the events from the internal python objects (FrozenEvent) into plain JSON dicts which match the format specified by the C-S API; in particular we add the unsigned.age field (which is the main reason this is done after caching); as of Include bundled aggregations in /sync and related fixes #11478 it is also the point at which we add bundled aggregations. The result of this step is a (potentially huge) JsonDict (ie, a dict containing other simple json-serializable values)
JSON-serialise the response dict to a bytes.
Finally, we start streaming the response to the client.

Since the result of step 1 is cached (at least on matrix.org), we can effectively neglect it here (though it is important to remember that it is only cached for 2 minutes... more on that later); so the steps of concern to us are steps 2 and 3, which must complete in less than the client's request timeout.

Some notes on timeouts:

Cloudflare times out requests if the server doesn't start sending its response within 100 seconds of the request.
However, haproxy doesn't notice this until we start trying to send back the response - so Synapse still sees the request as active until our haproxy times out the request, which happens after 180 seconds.
Before we start each of steps 2 and 3, we check that the client connection is still up, and skip that step if not. (Step 4 is aborted as soon as the connection drops.)

In short: we need to serialise the FrozenEvents to dicts (step 2) and JSON-serialise the dicts to bytes within 100 seconds.

That should be plenty though, right? Wrong.

The first problem is that #11478 makes step 2 take 224 seconds (assuming all the relevant aggregations are already cached from the database. The first time through, when we have do database activity it takes over 800 seconds):

https://jaeger.int.matrix.org/trace/1a723cd9877a1cc9

(Note that in the above trace, the client connection drops at 180 seconds, so we skip the JSON-serialisation step.)

Still, even with that reverted, we have problems. Consider this example:

https://jaeger.int.matrix.org/trace/2600e970dda6bc65

This request reached haproxy at 12:37:09, but we didn't start processing it until 12:37:56 (a delay of 47 seconds). We then manage encode the response in 58 seconds, and start streaming the response at 12:38:52, but this doesn't reach haproxy until 12:39:37 (a further 45 seconds). By that time, a total of 148 seconds has elapsed, and Cloudflare has long gone.

It's hard to trace those 47s and 45s delays precisely, but event serialisation is a synchronous process which blocks the reactor for up to a minute (eg, the preceding request has an encode_response call which blocks the reactor between 12:36:33 and 12:37:29) - after which we have a bunch of other work to catch up on. Prior experience suggests that all bets are off once the reactor starts ticking that slowly:

Ideas on things which might make this better:

Counter-intuitively, having a long timeout in haproxy might be making things worse, because it means we're doing work serving responses that will be thrown away at Cloudflare. Possibly we should decrease the timeout in haproxy to (say) 120 seconds.
We certainly shouldn't be blocking the reactor for 45 seconds at a time; at the very least, we should stick some sleep(0) calls in sync.encode_response.
It would be great to find a way to put most of the event serialisation work inside the SyncResponse cache.
- Alternatively, maybe we can do it more lazily, so that we can start JSON-encoding without having to event-serialise the entire response.
Encode JSON responses on a thread in C, mk2 #10905 is also part of the problem here: See discussion at Encode JSON responses on a thread in C, mk2 #10905 (comment)).

The text was updated successfully, but these errors were encountered:

clokep · 2022-01-25T14:52:44Z

After #11612 / #11659 a bunch of those is "better" -- the fully calculated data to be serialized is now stored in the ResponseCache. We do not cache the serialized data, however, which means we spend ~1 minute serializing it each time it is requested.

I think the remaining improvements here are:

Reducing the overall number of queries from fetching bundled aggregations, see Improve performance of fetching bundled aggregations #11825.
Ensuring that JSON serialization does not block the reactor for long periods of time. (Note that this is a regression from Encode JSON responses on a thread in C, mk2 #10905).

richvdh mentioned this issue Dec 20, 2021

Disable aggregation bundling on /sync responses #11583

Merged

clokep mentioned this issue Dec 20, 2021

Bundle aggregations outside of the serialization method #11612

Merged

richvdh mentioned this issue Dec 20, 2021

Various opentracing enhancements #11619

Merged

richvdh added S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. labels Dec 23, 2021

This was referenced Dec 29, 2021

Include bundled aggregations in the sync response cache. #11659

Merged

Fetch edits for multiple events in a single query #11660

Merged

clokep mentioned this issue Jan 25, 2022

Improve performance of fetching bundled aggregations #11825

Closed

MadLittleMods added the A-Sync defects related to /sync label Mar 26, 2022

matrixbot mentioned this issue Dec 21, 2023

large initialsync may never complete element-hq/synapse#11611

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

large initialsync may never complete #11611

large initialsync may never complete #11611

richvdh commented Dec 20, 2021

clokep commented Jan 25, 2022

large initialsync may never complete #11611

large initialsync may never complete #11611

Comments

richvdh commented Dec 20, 2021

clokep commented Jan 25, 2022