Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Outgoing presence exploded in 1.35.0rc1 #10153

Closed
tulir opened this issue Jun 9, 2021 · 15 comments · Fixed by #10163
Closed

Outgoing presence exploded in 1.35.0rc1 #10153

tulir opened this issue Jun 9, 2021 · 15 comments · Fixed by #10163
Labels
A-Presence S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@tulir
Copy link
Member

tulir commented Jun 9, 2021

Description

There seems to be a huge spike in outgoing presence right after 1.35.0rc1 was released. Incoming presence also went up, but not as much.

image

My federation sender is also having trouble with metrics after the 1.36.0rc1 upgrade (which is what I was looking into when I noticed the presence problem), but that might be a separate issue:

image

Version information

  • Version: 1.36.0rc1
  • Install method: Docker
  • Workers: federation sender, 3 generic workers (federation reader, synchrotron, events stream writer), appservice sender, media repo
@deepbluev7
Copy link
Contributor

I seem to be seeing the same and incoming presence seems to suggest, I am not alone in that:

grafik

2 federation senders, 1 event creator and master process (and some other workers).

@anoadragon453
Copy link
Member

The two presence related PRs in v1.35.0 are #10014 and #9823. #10014 attempted to prevent presence updates from being accidentally deleted, whereas #9823 deals with optional functionality to send all known user presence to a local or remote user when requested by a Synapse module.

I'm currently looking at the latter as it may explain a large spike in traffic...

@anoadragon453
Copy link
Member

I don't see anything unusual in either PR.

#10014 fixed the insertion of rows into the presence_stream table, and from what I can tell from my personal homeserver, it seems to be performing as intended (there's no more than one row per user).

#9823 messes about with sending large numbers of user presence out, though I don't see how this could happen unless the users_to_send_full_presence_to database table has some rows in it (which is unlikely).

Attempting to reproduce the bug on a homeserver without a federation sender has failed, while another with a federation sender does exhibit similar behaviour (but only after upgrading it from v1.35.1 to v1.36.0rc1 + #10149)...

@anoadragon453 anoadragon453 added S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. labels Jun 9, 2021
@deepbluev7
Copy link
Contributor

I actually have one duplicated in the presence_stream table, would that cause issues?

@rda0
Copy link

rda0 commented Jun 10, 2021

CPU usage increase (~1.5-2x) since the upgrade 1.34.0 -> 1.35.1

  • 4x sync worker (each one increased from ~12% to 17% cpu usage)
  • 2x federation_sender worker (each one increased from 7% to 21%)

I also see periodic cpu spikes (exactly every minute) accompanied by a lot of presence updates in sync and /_matrix/federation/v1/send in federation_sender:

image

image

Open FDs for federation_sender also looks unusual. Before the upgrade open FDs was ~100/worker, with occasional spikes up to 1500/worker. Now they constantly hold ~1500 open FDs.

Version information

  • Version: 1.35.1
  • Install method: pip
  • Workers: 2 federation_sender, 8 generic workers (4 sync, 1 federation, 1 federation_send, 1 client, 1 send), 1 user_dir

@callahad
Copy link
Contributor

For the sake of completeness, I just heard someone float the idea that federation patterns are likely to change now that the libera.chat bridge is up and running on its own homeserver, rather than being part of matrix.org. But it seems like the observed issues are happening at a variety of times most closely associated with server upgrades.

@deepbluev7
Copy link
Contributor

While I would expect federation patterns to change, I don't think it makes sense, that I am sending 20x as much presence because of an additional server. Inbound traffic barely changed.

@ShadowJonathan
Copy link
Contributor

I'm also seeing an increase of incoming presence. I have presence disabled, but it is taking a large hit on my system, making it struggle keeping up.

@erikjohnston
Copy link
Member

I think #10163 should fix this, but it fixes a bug that I think has been there since v1.33.0?

erikjohnston added a commit that referenced this issue Jun 11, 2021
When using a federation sender we'd send out all local presence updates over
federation even when they shouldn't be.

Fixes #10153.
erikjohnston added a commit that referenced this issue Jun 11, 2021
When using a federation sender we'd send out all local presence updates over
federation even when they shouldn't be.

Fixes #10153.
@erikjohnston
Copy link
Member

I would appreciate if people could try the linked patch (which has now merged to develop) to test if that fixes it for them. It seems to work, but I'm a bit suspicious that people are reporting it broken in v1.35.0, but it fixes a bug introduced in v1.33.0

@deepbluev7
Copy link
Contributor

Will this be in the next rc or in 1.37rc1? If it is the latter, I'll just add the patch to my patchset, otherwise I would wait until rc2.

@deepbluev7
Copy link
Contributor

deepbluev7 commented Jun 11, 2021

Okay, applied the patch, outgoing presence seems to be a lot lower, but needs a long term test still.

grafik

@callahad
Copy link
Contributor

callahad commented Jun 11, 2021

Will this be in the next rc or in 1.37rc1? If it is the latter, I'll just add the patch to my patchset, otherwise I would wait until rc2.

The branch release-v1.36 was created last Tuesday, so under normal circumstances this wouldn't make it into a release candidate until 1.37.0rc1 on June 22nd. However, the patch applies cleanly to the release branch, and it seems like a big enough deal that we should consider backporting to 1.36 and issuing an rc2. Let me think on that.

erikjohnston added a commit that referenced this issue Jun 11, 2021
When using a federation sender we'd send out all local presence updates over
federation even when they shouldn't be.

Fixes #10153.
@davidmehren
Copy link

I can also confirm that applying the patch significantly reduces outgoing transactions and the CPU usage of our federation sender:
grafik
grafik

@erikjohnston
Copy link
Member

Thank you both, we have ended up releasing a v1.36.0rc2 with a fix for this in

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Presence S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants