multiple PGCoordinators going unhealthy can trigger a storm of DeleteTailnetPeer #12923

spikecurtis · 2024-04-10T07:20:24Z

PGCoordinator has a mechanism where if it is unable to heartbeat over the pub-sub, it declares itself unhealthy, disconnects any coordinatees (agents, server tailnet, CLI), and immediately disconnects any new coordinatees that connect to it.

The purpose of this feature is if a Coordinator loses connection to the pubsub/database thru a network partition, it drops connections so that coordinatees can retry and hopefully land on a healthy peer.

However, if multiple PGCoordinators go unhealthy at the same time, coordinatees can bounce between coordinators.

Furthermore, there is a bug in our implementation such that when we disconnect a coordinatee that has never sent a node binding, we trigger an unnecessary DeleteTailnetPeer query to the database. The query is idempotent, so any individual query does no harm, but since we do it once per connection, this can trigger a storm of queries.

Impact:
Contributing or major factor in production outage at a customer

spikecurtis · 2024-04-10T07:45:16Z

A secondary issue may be that agents probably don't backoff reconnecting to Coderd. They have a backoff, but it operates only at the RPC connection layer, so a successful connection resets the backoff, even if the coordination RPC is immediately rejected. EDIT - we never reset the backoff, so the agents will eventually settle down and only retry dialing Coderd once every 10 seconds.

…2925) fixes #12923 Prevents Coordinate peer connections from generating spurious database queries like DeleteTailnetPeer when the coordinator is unhealthy. It does this by checking the health of the querier before accepting a connection, rather than unconditionally accepting it only for it to get swatted down later.

…2925) fixes #12923 Prevents Coordinate peer connections from generating spurious database queries like DeleteTailnetPeer when the coordinator is unhealthy. It does this by checking the health of the querier before accepting a connection, rather than unconditionally accepting it only for it to get swatted down later. (cherry picked from commit 06eae95)

spikecurtis added s1 Bugs that break core workflows. Only humans may set this. bug Used to filter all bug issues networking Area: networking customer-reported Bugs reported by enterprise customers labels Apr 10, 2024

spikecurtis mentioned this issue Apr 10, 2024

fix: stop sending DeleteTailnetPeer when coordinator is unhealthy #12925

Merged

spikecurtis closed this as completed in #12925 Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multiple PGCoordinators going unhealthy can trigger a storm of DeleteTailnetPeer #12923

multiple PGCoordinators going unhealthy can trigger a storm of DeleteTailnetPeer #12923

spikecurtis commented Apr 10, 2024

spikecurtis commented Apr 10, 2024 •

edited

multiple PGCoordinators going unhealthy can trigger a storm of DeleteTailnetPeer #12923

multiple PGCoordinators going unhealthy can trigger a storm of DeleteTailnetPeer #12923

Comments

spikecurtis commented Apr 10, 2024

spikecurtis commented Apr 10, 2024 • edited

spikecurtis commented Apr 10, 2024 •

edited