Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collator using wrong address to connect to Validator #1732

Open
2 tasks done
crystalin opened this issue Sep 27, 2023 · 5 comments
Open
2 tasks done

Collator using wrong address to connect to Validator #1732

crystalin opened this issue Sep 27, 2023 · 5 comments
Labels
I2-bug The node fails to follow expected behavior. I10-unconfirmed Issue might be valid, but it's not yet known.

Comments

@crystalin
Copy link

Is there an existing issue?

  • I have searched the existing issues

Experiencing problems? Have you tried our Stack Exchange first?

  • This is not a support question.

Description of bug

The issue is that some of our Collators, in our internal networks are sometimes not able to get their block backed/included.

After some investigation, we realized that the collator maintains a list of knownAddresses for the validator peer id, which also include a bad address (of a server that is not related to our network)

This should not be a problem as it also contains 3 other addresses which are valid (2x dns4 and 1x ip4).

However when it is time to connect to the validator to later send the candidate, the collator tries to dial the bad address, which doesn't reply (which I think triggers a timeout after 1 minute) and so when the collator tries to send the candidate later, it simply fail saying it can't reach the validator.

Steps to reproduce

I don't know really how to reproduce it on purpose, you need to be able to modify the knownAddresses associated with a peer on a node to trigger it manually

@crystalin crystalin added I10-unconfirmed Issue might be valid, but it's not yet known. I2-bug The node fails to follow expected behavior. labels Sep 27, 2023
@crystalin
Copy link
Author

cc @tomaka (I think you worked on the p2p part no ?)

@altonen
Copy link
Contributor

altonen commented Sep 28, 2023

If the dial fails, the address should be removed from the list of known addresses,

FromSwarm::DialFailure(e @ DialFailure { peer_id, error, .. }) => {
if let Some(peer_id) = peer_id {
if let DialError::Transport(errors) = error {
if let Entry::Occupied(mut entry) = self.ephemeral_addresses.entry(peer_id)
{
for (addr, _error) in errors {
entry.get_mut().retain(|a| a != addr);
}
if entry.get().is_empty() {
entry.remove();
}
}
}
}
self.kademlia.on_swarm_event(FromSwarm::DialFailure(e));

Are you saying the node is permanently unreachable because of this one faulty address?

@crystalin
Copy link
Author

@altonen I'll ask our ops to verify, but in our case no. It is only unreachable for a certain amount of time (I believe for around 1-2 minutes) but then it works again, and later on it will happen again.
The scenario I image happening is:

  • Everytime there is validator core group rotation:
    • Collator tries to connect to the new validator on a random address, in this case the bad one but sometime a good one
    • Collator produces blocks but can't send them because the bad connection is still pending timeout
    • Collator connection timeout
    • After timeout, the collator tries to reconnect to another address, which this times works and is able to produce blocks again
    • At next rotation (every 2 minutes), this might (or not) happen again

@altonen
Copy link
Contributor

altonen commented Sep 28, 2023

This is something we discussed in another issue recently. Basically there should be grace period for addresses that have recently failed and then retry dial N times before the address is marked as undialable for good.

Still, something doesn't make sense because libp2p configures the concurrent dial factor to 8 which means that it should be able to establish a connection if there is at least one address the peer can be reached from. If you're able to reproduce this, could you run the node with -lsub-libp2p=trace and post the logs?

@purestakeoskar
Copy link

I can confirm that when this happens the peer is now in "notConnectedPeers", but all the known addresses are the same, so next time it needs to connect to the same peer it can still try the non functional address.
Another thing i wanted to add is that if we add the validators as bootnodes or reserved nodes to the collators, this problem goes away. My theory is that there is already a known good address to the validator to use create the new substream over.

serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Mar 26, 2024
* updated weights

* also fix off-by-one in benchmarks
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Mar 27, 2024
* updated weights

* also fix off-by-one in benchmarks
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 8, 2024
* updated weights

* also fix off-by-one in benchmarks
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 8, 2024
* updated weights

* also fix off-by-one in benchmarks
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 8, 2024
* updated weights

* also fix off-by-one in benchmarks
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 8, 2024
* updated weights

* also fix off-by-one in benchmarks
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 8, 2024
* updated weights

* also fix off-by-one in benchmarks
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024
* updated weights

* also fix off-by-one in benchmarks
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024
* updated weights

* also fix off-by-one in benchmarks
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024
* updated weights

* also fix off-by-one in benchmarks
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024
* updated weights

* also fix off-by-one in benchmarks
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024
* updated weights

* also fix off-by-one in benchmarks
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024
* updated weights

* also fix off-by-one in benchmarks
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 10, 2024
* updated weights

* also fix off-by-one in benchmarks
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 10, 2024
* updated weights

* also fix off-by-one in benchmarks
bkchr pushed a commit that referenced this issue Apr 10, 2024
* updated weights

* also fix off-by-one in benchmarks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I2-bug The node fails to follow expected behavior. I10-unconfirmed Issue might be valid, but it's not yet known.
Projects
Status: Backlog 🗒
Development

No branches or pull requests

3 participants