-
Notifications
You must be signed in to change notification settings - Fork 574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multiple PGCoordinators going unhealthy can trigger a storm of DeleteTailnetPeer #12923
Labels
bug
Used to filter all bug issues
customer-reported
Bugs reported by enterprise customers
networking
Area: networking
s1
Bugs that break core workflows. Only humans may set this.
Comments
spikecurtis
added
s1
Bugs that break core workflows. Only humans may set this.
bug
Used to filter all bug issues
networking
Area: networking
customer-reported
Bugs reported by enterprise customers
labels
Apr 10, 2024
|
spikecurtis
added a commit
that referenced
this issue
Apr 10, 2024
…2925) fixes #12923 Prevents Coordinate peer connections from generating spurious database queries like DeleteTailnetPeer when the coordinator is unhealthy. It does this by checking the health of the querier before accepting a connection, rather than unconditionally accepting it only for it to get swatted down later.
coadler
pushed a commit
that referenced
this issue
Apr 17, 2024
…2925) fixes #12923 Prevents Coordinate peer connections from generating spurious database queries like DeleteTailnetPeer when the coordinator is unhealthy. It does this by checking the health of the querier before accepting a connection, rather than unconditionally accepting it only for it to get swatted down later. (cherry picked from commit 06eae95)
coadler
pushed a commit
that referenced
this issue
Apr 17, 2024
…2925) fixes #12923 Prevents Coordinate peer connections from generating spurious database queries like DeleteTailnetPeer when the coordinator is unhealthy. It does this by checking the health of the querier before accepting a connection, rather than unconditionally accepting it only for it to get swatted down later. (cherry picked from commit 06eae95)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
bug
Used to filter all bug issues
customer-reported
Bugs reported by enterprise customers
networking
Area: networking
s1
Bugs that break core workflows. Only humans may set this.
PGCoordinator has a mechanism where if it is unable to heartbeat over the pub-sub, it declares itself unhealthy, disconnects any coordinatees (agents, server tailnet, CLI), and immediately disconnects any new coordinatees that connect to it.
The purpose of this feature is if a Coordinator loses connection to the pubsub/database thru a network partition, it drops connections so that coordinatees can retry and hopefully land on a healthy peer.
However, if multiple PGCoordinators go unhealthy at the same time, coordinatees can bounce between coordinators.
Furthermore, there is a bug in our implementation such that when we disconnect a coordinatee that has never sent a node binding, we trigger an unnecessary DeleteTailnetPeer query to the database. The query is idempotent, so any individual query does no harm, but since we do it once per connection, this can trigger a storm of queries.
Impact:
Contributing or major factor in production outage at a customer
The text was updated successfully, but these errors were encountered: