Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct gossipsub mesh and connected peer inconsistencies #6244

Open
wants to merge 6 commits into
base: unstable
Choose a base branch
from

Conversation

AgeManning
Copy link
Member

Issue Addressed

We are seeing errors in gossipsub related to an inconsistent state between mesh peers and connected peers.

There appears to be a few problems. This PR will be updated as we find all the bugs.

The first I have noticed is that the IDONTWANT messages also look for in-flight gossipsub promises. When a peer disconnects we dont remove the tracking of in-flight gossipsub promises, because it is inefficient search (it's not index'd by peer-id). Therefore we can have disconnected peers in this mapping and if we go to send an IDONTWANT message, we will not have this peer-id in the connected mapping.

Am currently still searching for the others....

@AgeManning AgeManning requested a review from jxs August 8, 2024 05:17
@AgeManning AgeManning marked this pull request as ready for review August 8, 2024 07:36
@AgeManning AgeManning added the ready-for-review The code is ready for review label Aug 8, 2024
@AgeManning
Copy link
Member Author

Ok, found the other bug. This should fix the libp2p issues I think

beacon_node/lighthouse_network/gossipsub/src/behaviour.rs Outdated Show resolved Hide resolved
@@ -2706,8 +2712,8 @@ where

for peer_id in recipient_peers {
let Some(peer) = self.connected_peers.get_mut(peer_id) else {
tracing::error!(peer = %peer_id,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of removing this error, can we instead filter above for the gossip promises? I.e.

// Gossip promises are kept for disconnected peers.
let iwant_peers = self
            .gossip_promises
            .peers_for_message(msg_id)
            .iter()
            .filter(|peer_id| self.connected_peers.contains_key(peer_id));

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about this. I don't mind.
The logic that went through my head was:
Adding this check adds a hashmap lookup for each peer, but a few lines down the page we do the exact same check.

I figured we are already doing the check, by just removing the error, we have no extra hash lookup, however the cost is we don't log an error in case a mesh peer isn't in the connected mapping (but we have these errors elsewhere).

Its such a tiny difference, so happy to do whatever you think is best, just explaining why I removed the error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for elaborating age! Do you know where else we check? It def doesn't make sense to double check this adding another iteration as you refer, I just think we shouldn't stop logging for this event

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So yeah happy to do what you prefer.

I can't think of a way to keep the error without adding an extra hashmap lookup. Its a trivial lookup, and maybe the error is worth it. So I'm happy to go with the soln you have suggested here if you want the error.

Co-authored-by: João Oliveira <hello@jxs.pt>
Copy link
Member

@jxs jxs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw age, can we also add an entry to the changelog?

@@ -2706,8 +2712,8 @@ where

for peer_id in recipient_peers {
let Some(peer) = self.connected_peers.get_mut(peer_id) else {
tracing::error!(peer = %peer_id,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for elaborating age! Do you know where else we check? It def doesn't make sense to double check this adding another iteration as you refer, I just think we shouldn't stop logging for this event

@AgeManning AgeManning force-pushed the supress-invalid-gossipsub-error branch from 86fa0b4 to 47039ef Compare September 3, 2024 04:35
@AgeManning
Copy link
Member Author

I've added the changelog :). Feel free to add the error back if you like.

@AgeManning AgeManning requested a review from jxs September 8, 2024 23:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-for-review The code is ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants