Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

RoomInitialSyncRestServlet internal server error with "AssertionError synapse.util.linked_list in move_after" #11436

Closed
jaywink opened this issue Nov 26, 2021 · 9 comments · Fixed by #11454
Labels
S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@jaywink
Copy link
Member

jaywink commented Nov 26, 2021

Description

A host started throwing AssertionError synapse.util.linked_list in move_after in GET' uri='/_matrix/client/r0/rooms/!abcdefgh%3Adomain.tld/initialSync?limit=20&access_token=<redacted>.

The particular client calling that URL had been fine for a long time. It seems the error started happening at or some time after a restart of Synapse and the two workers (synchrotron + initial synchrotron).

After restarting everything again, the error disappeared.

Version information

  • Homeserver: EMS customer, please pm for details and times
  • Version: v1.47.1
  • Install method: Docker
@DMRobertson
Copy link
Contributor

Seems to come from

# We assert that both this node and the target node is still "alive".
assert self.prev_node
assert self.next_node
assert node.prev_node
assert node.next_node
assert self is not node

@squahtx
Copy link
Contributor

squahtx commented Nov 26, 2021

Are we able to get a stack trace?

@jaywink
Copy link
Member Author

jaywink commented Nov 29, 2021

We have now, it's happening again, for the same customer, for the same kind of initial synchrotron EMS worker setup. Please ping if customer or other details needed, restarting the worker now.

2021-11-29 09:52:57,042 - synapse.http.server - 97 - ERROR - GET-80- Failed handle request via 'RoomInitialSyncRestServlet': <XForwardedForRequest at 0x7fc86d2242c0 method='GET' uri='/_matrix/client/r0/rooms/!roomid%3Adomain.tld/initialSync?limit=20&access_token=<redacted>' clientproto='HTTP/1.1' site=8009>
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/twisted/internet/defer.py", line 1661, in _inlineCallbacks
    result = current_context.run(gen.send, result)
StopIteration: ([<FrozenEventV3 event_id=redacted, type=m.room.member, state_key=@redacted:matrix.org, outlier=False>, <FrozenEventV3 event_id=$redacted-M, type=m.room.topic, state_key=, outlier=False>, <FrozenEventV3 event_id=$redacted-redacted, type=m.room.message, state_key=None, outlier=False>, <FrozenEventV3 event_id=$redacted, type=m.room.message, state_key=None, outlier=False>, <FrozenEventV3 event_id=$redacted, type=m.room.message, state_key=None, outlier=False>, <FrozenEventV3 event_id=$redacted, type=m.room.member, state_key=@redacted:matrix.redacted.redacted, outlier=False>, <FrozenEventV3 event_id=$redacted, type=m.room.avatar, state_key=, outlier=False>, <FrozenEventV3 event_id=$redacted--redacted, type=m.room.member, state_key=@redacted.redacted:matrix.org, outlier=False>, <FrozenEventV3 event_id=$redacted, type=m.room.member, state_key=@redacted:matrix.org, outlier=False>, <FrozenEventV3 event_id=$redacted, type=m.room.message, state_key=None, outlier=False>, <FrozenEventV3 event_id=$redacted-redacted-redacted, type=m.room.message, state_key=None, outlier=False>, <FrozenEventV3 event_id=$redacted, type=m.room.redaction, state_key=None, outlier=False>, <FrozenEventV3 event_id=$-redacted, type=m.room.message, state_key=None, outlier=False>, <FrozenEventV3 event_id=$d-redacted, type=m.reaction, state_key=None, outlier=False>, <FrozenEventV3 event_id=$redacted-redacted-redacted, type=m.reaction, state_key=None, outlier=False>, <FrozenEventV3 event_id=$redacted-z0, type=m.room.message, state_key=None, outlier=False>, <FrozenEventV3 event_id=$redacted-redacted-redacted, type=m.room.message, state_key=None, outlier=False>, <FrozenEventV3 event_id=$redacted, type=m.room.redaction, state_key=None, outlier=False>, <FrozenEventV3 event_id=$redacted, type=m.room.redaction, state_key=None, outlier=False>, <FrozenEventV3 event_id=$redacted-redacted, type=m.room.member, state_key=@redacted:matrix.org, outlier=False>], RoomStreamToken(topological=45, stream=1547696, instance_map=frozendict({})))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/synapse/http/server.py", line 262, in _async_render_wrapper
    callback_return = await self._async_render(request)
  File "/usr/local/lib/python3.8/site-packages/synapse/http/server.py", line 451, in _async_render
    callback_return = await raw_callback_return
  File "/usr/local/lib/python3.8/site-packages/synapse/rest/client/room.py", line 628, in on_GET
    content = await self.initial_sync_handler.room_initial_sync(
  File "/usr/local/lib/python3.8/site-packages/synapse/handlers/initial_sync.py", line 308, in room_initial_sync
    result = await self._room_initial_sync_joined(
  File "/usr/local/lib/python3.8/site-packages/synapse/handlers/initial_sync.py", line 457, in _room_initial_sync_joined
    messages = await filter_events_for_client(
  File "/usr/local/lib/python3.8/site-packages/synapse/visibility.py", line 84, in filter_events_for_client
    event_id_to_state = await storage.state.get_state_for_events(
  File "/usr/local/lib/python3.8/site-packages/synapse/storage/state.py", line 656, in get_state_for_events
    group_to_state = await self.stores.state._get_state_for_groups(
  File "/usr/local/lib/python3.8/site-packages/synapse/storage/databases/state/store.py", line 257, in _get_state_for_groups
    (member_state, incomplete_groups_m,) = self._get_state_for_groups_using_cache(
  File "/usr/local/lib/python3.8/site-packages/synapse/storage/databases/state/store.py", line 324, in _get_state_for_groups_using_cache
    state_dict_ids, got_all = self._get_state_for_group_using_cache(
  File "/usr/local/lib/python3.8/site-packages/synapse/storage/databases/state/store.py", line 205, in _get_state_for_group_using_cache
    cache_entry = cache.get(group)
  File "/usr/local/lib/python3.8/site-packages/synapse/util/caches/dictionary_cache.py", line 100, in get
    entry = self.cache.get(key, _Sentinel.sentinel)
  File "/usr/local/lib/python3.8/site-packages/synapse/util/caches/lrucache.py", line 405, in inner
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/synapse/util/caches/lrucache.py", line 487, in cache_get
    move_node_to_front(node)
  File "/usr/local/lib/python3.8/site-packages/synapse/util/caches/lrucache.py", line 443, in move_node_to_front
    node.move_to_front(real_clock, list_root)
  File "/usr/local/lib/python3.8/site-packages/synapse/util/caches/lrucache.py", line 287, in move_to_front
    self._list_node.move_after(cache_list_root)
  File "/usr/local/lib/python3.8/site-packages/synapse/util/linked_list.py", line 93, in move_after
    assert self.prev_node
AssertionError

@DMRobertson
Copy link
Contributor

We see these assertions fail in Sentry, but all the examples there are in the server-server API. Curiously this seems to only occur on federation_reader-4. Probably a distinct issue?

https://sentry.matrix.org/sentry/synapse-matrixorg/?query=is%3Aunresolved+linked_list

@DMRobertson
Copy link
Contributor

Seen in 1.38.1 too.

@babolivier
Copy link
Contributor

The error points to an ListNode entry in a linked list having no prev_node. The docstrings say that this reference should only be None for nodes that have been taken out of the list, or for the root node (though on the latter the code seems to say we instead set the prev_node of the root is the root node itself). The code seems to agree with that as far as the dropped node scenario is concerned.
However, there's obviously something that went wrong since we're still trying to access that node, which as per above must have been dropped from the list. I'll investigate a bit more to see what could be happening here (e.g. are we forgetting to remove the dropped node from somewhere at a higher level?).

For context, this code seems to have been introduced by #10205 (1.38.0).

@anoadragon453
Copy link
Member

@jaywink Has the caches.expiry_time option been enabled recently for this customer? If so, disabling that may help in the meantime.

@jaywink
Copy link
Member Author

jaywink commented Nov 29, 2021

@jaywink Has the caches.expiry_time option been enabled recently for this customer? If so, disabling that may help in the meantime.

Yes we enabled it for all hosts.

I'll disable it for this customer as a quick patch, thanks!

@squahtx
Copy link
Contributor

squahtx commented Nov 29, 2021

There's a bug in _Node.drop_from_cache when a size_callback is in use and the total size of the LruCache, as determined by the size_callback, comes out to 0:

if not cache or not cache.pop(self.key, None):
# `cache.pop` should call `drop_from_lists()`, unless this Node had
# already been removed from the cache.
self.drop_from_lists()

LruCache.__len__ is defined as the sum of the sizes of all its entries, so in this rare circumstance, the LruCache is falsey, and the _Node is unlinked without being popped from the cache.

The cache in question is a state group cache:

self._state_group_cache: DictionaryCache[int, StateKey, str] = DictionaryCache(
"*stateGroupCache*",
# TODO: this hasn't been tuned yet
50000,
)

and it has a size_callback of len, which returns 0 for empty dictionaries:

self.cache: LruCache[KT, DictionaryEntry] = LruCache(
max_size=max_entries, cache_name=name, size_callback=len
)

All the errors in sentry are for the None state group, which wouldn't exist in the database, and so would have empty, 0-sized cache values.

@babolivier babolivier added T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. S-Minor Blocks non-critical functionality, workarounds exist. labels Nov 30, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants