Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multiple PGCoordinators going unhealthy can trigger a storm of DeleteTailnetPeer #12923

Closed
spikecurtis opened this issue Apr 10, 2024 · 1 comment · Fixed by #12925
Closed
Labels
bug Used to filter all bug issues customer-reported Bugs reported by enterprise customers networking Area: networking s1 Bugs that break core workflows. Only humans may set this.

Comments

@spikecurtis
Copy link
Contributor

PGCoordinator has a mechanism where if it is unable to heartbeat over the pub-sub, it declares itself unhealthy, disconnects any coordinatees (agents, server tailnet, CLI), and immediately disconnects any new coordinatees that connect to it.

The purpose of this feature is if a Coordinator loses connection to the pubsub/database thru a network partition, it drops connections so that coordinatees can retry and hopefully land on a healthy peer.

However, if multiple PGCoordinators go unhealthy at the same time, coordinatees can bounce between coordinators.

Furthermore, there is a bug in our implementation such that when we disconnect a coordinatee that has never sent a node binding, we trigger an unnecessary DeleteTailnetPeer query to the database. The query is idempotent, so any individual query does no harm, but since we do it once per connection, this can trigger a storm of queries.

Impact:
Contributing or major factor in production outage at a customer

@spikecurtis spikecurtis added s1 Bugs that break core workflows. Only humans may set this. bug Used to filter all bug issues networking Area: networking customer-reported Bugs reported by enterprise customers labels Apr 10, 2024
@spikecurtis
Copy link
Contributor Author

spikecurtis commented Apr 10, 2024

A secondary issue may be that agents probably don't backoff reconnecting to Coderd. They have a backoff, but it operates only at the RPC connection layer, so a successful connection resets the backoff, even if the coordination RPC is immediately rejected. EDIT - we never reset the backoff, so the agents will eventually settle down and only retry dialing Coderd once every 10 seconds.

spikecurtis added a commit that referenced this issue Apr 10, 2024
…2925)

fixes #12923

Prevents Coordinate peer connections from generating spurious database queries like DeleteTailnetPeer when the coordinator is unhealthy.

It does this by checking the health of the querier before accepting a connection, rather than unconditionally accepting it only for it to get swatted down later.
coadler pushed a commit that referenced this issue Apr 17, 2024
…2925)

fixes #12923

Prevents Coordinate peer connections from generating spurious database queries like DeleteTailnetPeer when the coordinator is unhealthy.

It does this by checking the health of the querier before accepting a connection, rather than unconditionally accepting it only for it to get swatted down later.

(cherry picked from commit 06eae95)
coadler pushed a commit that referenced this issue Apr 17, 2024
…2925)

fixes #12923

Prevents Coordinate peer connections from generating spurious database queries like DeleteTailnetPeer when the coordinator is unhealthy.

It does this by checking the health of the querier before accepting a connection, rather than unconditionally accepting it only for it to get swatted down later.

(cherry picked from commit 06eae95)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Used to filter all bug issues customer-reported Bugs reported by enterprise customers networking Area: networking s1 Bugs that break core workflows. Only humans may set this.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant