Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

cache sync results for a while after the request completes #3880

Closed
ara4n opened this issue Sep 15, 2018 · 7 comments · Fixed by #10513
Closed

cache sync results for a while after the request completes #3880

ara4n opened this issue Sep 15, 2018 · 7 comments · Fixed by #10513
Labels
A-Performance Performance, both client-facing and admin-facing z-p2 (Deprecated Label)

Comments

@ara4n
Copy link
Member

ara4n commented Sep 15, 2018

Both @lampholder and @Ralith have reported similar sounding problems where initial /sync takes ~5x longer to complete from Riot than it should.

@lampholder mentioned that he tried calling initial /sync directly from curl against his HS (bypassing nginx) without a clientside timeout and it took 3 mins, but when called by Riot (which times out reqs every 90s or so to avoid network glitches wedging a request) it takes ~15 mins. Meanwhile, @Ralith (on an underpowered server with bad IO) takes 15 mins(!) to generate the sync when called directly, but >1h when called from Riot :/

I haven't seen logs from either, but my theory is that if the client times out the /sync request after the server has generated the response, but before it has sent it to the client, then the next time the client retries, the /sync will be calculated from scratch, and the failure mode may continue.

It looks like the current behaviour on develop caches sync requests via response_cache which blocks additional requests for the same sync parameters until any ongoing one completes, and then they all return with the same result:

res = yield self.response_cache.wrap(
sync_config.request_key,
self._wait_for_sync_for_user,
sync_config, since_token, timeout, full_state,
)
defer.returnValue(res)
.

However, on the matrix-org-hotfixes branch @richvdh changed it to cache sync responses for an additional 2 minutes, so that any retrying clients will get the cached copy rather than starting from scratch: 0ca2857. This is similar to the old /initialSync snapshot_cache from #457 which never seems to have made it over to /sync.

I'm not seeing any additional /sync response cache?

If my theory is correct, it looks like this hotfix needs to be ported back into develop, and sytest fixed so that it doesn't fail if initial /sync returns stale data (perhaps we could add a null filter on or something as a cachebuster).

Another explanation for this could be if Twisted is correctly cancelling the ongoing /sync when the client goes away, thus throwing away the cached response. The same fix would mitigate that, however.

@ara4n ara4n changed the title Initial /sync can take 5x longer to complete than it should Initial /sync can take ~5x longer to complete than it should Sep 15, 2018
@turt2live
Copy link
Member

ftr I did observe this behaviour on my personal homeserver a number of times. It was most evident when a client like riot-android decided it needed to resync the world - the logs showed requests coming in and stacking up, holding the lock quite effectively. After about 5-10 minutes the lock would clear and tens of requests would start getting flagged as processed - this would take 10-30 seconds. Then the client sends a plain sync again, to which the logs say there's no cached response and it tries again.

To break the loop, the client had to be lucky enough to have a request just before the homeserver spent several seconds sending responses.

@Ralith
Copy link
Contributor

Ralith commented Sep 15, 2018

problems where initial /sync takes 10-15 mins to complete from Riot or similar

To be clear, it takes upwards of an hour for riot to sync, despite individual sync requests completing on the order of 15 minutes. Usually what happens is that Riot cancels numerous syncs but leaves a few running; eventually, all currently-running requests laboriously complete, while new ones start over from scratch. This repeats until somehow one finally catches.

@ara4n
Copy link
Member Author

ara4n commented Sep 15, 2018

right. 15m * ~5 = ~1h, so that fits this bug - and @turt2live's description of the observed behaviour precisely matches my theory.

@Ralith: can you cherrypick 0ca2857 and see if it magically fixes itself?

@Ralith
Copy link
Contributor

Ralith commented Sep 15, 2018

With that commit cherry-picked, I did not observe spurious retries on an initial sync, though I may have just gotten lucky on the timing.

@neilisfragile neilisfragile added p1 A-Performance Performance, both client-facing and admin-facing labels Oct 5, 2018
@neilisfragile
Copy link
Contributor

We should offer this as a config option, because

from @erikjohnston

a) its a pain for sytest and b) giving people stale initial syncs makes their client think that all subsequent messages happened "now", so you end up being notified for everything in the last N minutes (both of which could potentially be fixed by caching the initial sync, but then update it with an incremental sync if necessary)

Also sucks up more mem.

@richvdh richvdh added z-p2 (Deprecated Label) and removed p1 labels Feb 6, 2020
@richvdh richvdh changed the title Initial /sync can take ~5x longer to complete than it should cache sync results for a while after the request completes Jul 24, 2020
@richvdh
Copy link
Member

richvdh commented Feb 9, 2021

for links: #8518 seems related here. We should make sure that is investigated before "fixing" this issue.

@ShadowJonathan
Copy link
Contributor

#8518 has been fixed, this issue is unblocked :D

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Performance Performance, both client-facing and admin-facing z-p2 (Deprecated Label)
Projects
None yet
6 participants