Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove all batches related to a peer on disconnect #5969

Merged

Conversation

pawanjay176
Copy link
Member

Issue Addressed

N/A

Proposed Changes

In #5680 , we decoupled a peer disconnect from an rpc error to simplify the logic in sync a bit.
The peer disconnect message from the network was responsible for removing the peer from all the sync data structures.
The rpc error was responsible for re-requesting the failed batch/lookup.

We assumed that an rpc error would always go together with a peer disconnect, however we learned the hard way that there are many edge cases where the rpc error never gets generated when the peer disconnects. e.g. the peer manager disconnects a peer that has its rpc response mid-flight, peer disconnects before a rpc request reaches the network etc. There are many other cases we may not know about.
This leads to multiple bugs in sync where sync stalls because a batch/lookup never gets re-requested on peer disconnects.

#5967 was an attempt at adding timeouts to the rpc requests sent from sync that trigger re-requests when the timer elapses. However, as @AgeManning noted, this would lead to many more edge cases.

This PR triggers re-request of batches/lookups on peer disconnects partially reverting #5680 . This isn't the cleanest but considering that we cannot always rely on lighthouse_network to generate rpc errors to go with peer disconnects, this seems the best option.

I intend to test this extensively on mainnet before merging

@pawanjay176 pawanjay176 added v5.2.1 Patch release for v5.2.0 do-not-merge labels Jun 20, 2024
@dapplion
Copy link
Collaborator

For block lookups we should iterate the active requests here and fail them

blocks_by_root_requests: FnvHashMap<SingleLookupReqId, ActiveBlocksByRootRequest>,
/// A mapping of active BlobsByRoot requests, including both current slot and parent lookups.
blobs_by_root_requests: FnvHashMap<SingleLookupReqId, ActiveBlobsByRootRequest<T::EthSpec>>,

Plus add tests to ensure that lookups can handle getting late errors for a requests that has already been failed for disconnection. Current lookup code is quite strict in that it expects a single event per network request.

@dapplion
Copy link
Collaborator

dapplion commented Jun 20, 2024

Note that sync doesn't need to check if the peer is connected before sending a request. Consider these two cases:

Case 1: (Receive disconnect after request)

  • Network: peer A disconnects
  • Sync: sends request X to peer A
  • Sync: receives disconnect event for peer A -> Fails request X

Case 2: (Receive disconnect before request)

  • Network: peer A disconnects
  • Sync: receives disconnect event for peer A -> Removes peer A from pool of available peers
  • Sync: sends request X to peer some other peer

@michaelsproul michaelsproul changed the base branch from unstable to release-v5.2.1 June 23, 2024 23:00
@michaelsproul
Copy link
Member

michaelsproul commented Jun 23, 2024

I've re-targeted this at release-v5.2.1. See:

@pawanjay176 pawanjay176 added ready-for-review The code is ready for review and removed do-not-merge labels Jun 24, 2024
@AgeManning
Copy link
Member

I think we can probably also now remove the RPC Error logic we added. I think its not that necessary if we are now relying on the PeerDisconnection message. i.e I think we can remove this entire function: https://github.com/sigp/lighthouse/blob/stable/beacon_node/lighthouse_network/src/rpc/handler.rs#L356

@AgeManning
Copy link
Member

I can also remove this extra code in a future PR if we prefer to test this standalone?

@pawanjay176
Copy link
Member Author

pawanjay176 commented Jun 24, 2024

@AgeManning removed in cd17f9f
I have tested this PR on mainnet for forward sync, backfill and lookups. I think its in a good state now.

@dapplion I changed lookups peer_disconnected behaviour such that it drops a lookup only if there are no peers to drive the request forward or if there are no existing downloaded components.
This would mean that there could be parent requests in lookups that have no associated peer. I think this is okay, lmk if you disagree.

Copy link
Member

@AgeManning AgeManning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!!

Copy link
Member

@jxs jxs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for driving this Pawan!

beacon_node/network/src/sync/network_context.rs Outdated Show resolved Hide resolved
beacon_node/network/src/sync/block_lookups/mod.rs Outdated Show resolved Hide resolved
beacon_node/network/src/sync/backfill_sync/mod.rs Outdated Show resolved Hide resolved
beacon_node/network/src/sync/backfill_sync/mod.rs Outdated Show resolved Hide resolved
beacon_node/network/src/sync/range_sync/chain.rs Outdated Show resolved Hide resolved
beacon_node/network/src/sync/block_lookups/tests.rs Outdated Show resolved Hide resolved
beacon_node/network/src/sync/block_lookups/tests.rs Outdated Show resolved Hide resolved
beacon_node/network/src/sync/block_lookups/tests.rs Outdated Show resolved Hide resolved
beacon_node/network/src/sync/block_lookups/tests.rs Outdated Show resolved Hide resolved
beacon_node/network/src/sync/block_lookups/tests.rs Outdated Show resolved Hide resolved
michaelsproul added a commit that referenced this pull request Jun 25, 2024
Squashed commit of the following:

commit 0c8efc1
Merge: 62167fb 758b58c
Author: Michael Sproul <michael@sigmaprime.io>
Date:   Tue Jun 25 11:21:39 2024 +1000

    Merge remote-tracking branch 'origin/release-v5.2.1' into rpc-error-on-disconnect-revert

commit 62167fb
Author: Pawan Dhananjay <pawandhananjay@gmail.com>
Date:   Tue Jun 25 02:30:53 2024 +0530

    Fix test

commit b930c7c
Author: Pawan Dhananjay <pawandhananjay@gmail.com>
Date:   Tue Jun 25 01:58:47 2024 +0530

    Cleanup

commit dfaf238
Author: Pawan Dhananjay <pawandhananjay@gmail.com>
Date:   Tue Jun 25 01:56:29 2024 +0530

    Revert pretty response types

commit 3e1c41a
Author: Pawan Dhananjay <pawandhananjay@gmail.com>
Date:   Tue Jun 25 01:45:20 2024 +0530

    Fix issue raised by lion

commit 6ead176
Author: Pawan Dhananjay <pawandhananjay@gmail.com>
Date:   Mon Jun 24 18:13:24 2024 +0530

    Remove redundant test

commit cd17f9f
Author: Pawan Dhananjay <pawandhananjay@gmail.com>
Date:   Mon Jun 24 15:39:34 2024 +0530

    Remove poll_close on rpc behaviour

commit bc10fb2
Author: Pawan Dhananjay <pawandhananjay@gmail.com>
Date:   Mon Jun 24 15:25:56 2024 +0530

    Fix tests

commit a8f64f2
Author: Pawan Dhananjay <pawandhananjay@gmail.com>
Date:   Mon Jun 24 13:03:30 2024 +0530

    Remove lookup if it cannot progress

commit c7fd21d
Author: Pawan Dhananjay <pawandhananjay@gmail.com>
Date:   Mon Jun 24 10:50:37 2024 +0530

    Fix lints

commit a38caf4
Author: Pawan Dhananjay <pawandhananjay@gmail.com>
Date:   Fri Jun 21 21:43:41 2024 +0530

    fmt

commit 312be2c
Author: Pawan Dhananjay <pawandhananjay@gmail.com>
Date:   Fri Jun 21 21:43:33 2024 +0530

    Pretty response types

commit 70af7d1
Author: Pawan Dhananjay <pawandhananjay@gmail.com>
Date:   Fri Jun 21 18:02:40 2024 +0530

    Allow lookups to continue in case of disconnections

commit 9d90e39
Author: Pawan Dhananjay <pawandhananjay@gmail.com>
Date:   Fri Jun 21 17:34:01 2024 +0530

    Cleanup map entries after disconnect

commit 92a6e77
Author: Pawan Dhananjay <pawandhananjay@gmail.com>
Date:   Thu Jun 20 12:50:19 2024 +0530

    Remove all batches related to a peer on disconnect
@dapplion
Copy link
Collaborator

@pawanjay176 Added some comments relevant to the API expectations of lookup sync. I think they are helpful to understand bug sources and overall behaviour

@jimmygchen
Copy link
Member

@mergify queue

Copy link

mergify bot commented Jun 26, 2024

queue

✅ The pull request has been merged automatically

The pull request has been merged automatically at bf4cbd3

@jimmygchen jimmygchen added ready-for-merge This PR is ready to merge. and removed ready-for-review The code is ready for review labels Jun 26, 2024
mergify bot added a commit that referenced this pull request Jun 26, 2024
@mergify mergify bot merged commit bf4cbd3 into sigp:release-v5.2.1 Jun 26, 2024
28 checks passed
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-for-merge This PR is ready to merge. v5.2.1 Patch release for v5.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants