-
Notifications
You must be signed in to change notification settings - Fork 646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collator using wrong address to connect to Validator #1732
Comments
cc @tomaka (I think you worked on the p2p part no ?) |
If the dial fails, the address should be removed from the list of known addresses, polkadot-sdk/substrate/client/network/src/discovery.rs Lines 604 to 619 in 769bdd3
Are you saying the node is permanently unreachable because of this one faulty address? |
@altonen I'll ask our ops to verify, but in our case no. It is only unreachable for a certain amount of time (I believe for around 1-2 minutes) but then it works again, and later on it will happen again.
|
This is something we discussed in another issue recently. Basically there should be grace period for addresses that have recently failed and then retry dial N times before the address is marked as undialable for good. Still, something doesn't make sense because libp2p configures the concurrent dial factor to 8 which means that it should be able to establish a connection if there is at least one address the peer can be reached from. If you're able to reproduce this, could you run the node with |
I can confirm that when this happens the peer is now in "notConnectedPeers", but all the known addresses are the same, so next time it needs to connect to the same peer it can still try the non functional address. |
* updated weights * also fix off-by-one in benchmarks
* updated weights * also fix off-by-one in benchmarks
* updated weights * also fix off-by-one in benchmarks
* updated weights * also fix off-by-one in benchmarks
* updated weights * also fix off-by-one in benchmarks
* updated weights * also fix off-by-one in benchmarks
* updated weights * also fix off-by-one in benchmarks
* updated weights * also fix off-by-one in benchmarks
* updated weights * also fix off-by-one in benchmarks
* updated weights * also fix off-by-one in benchmarks
* updated weights * also fix off-by-one in benchmarks
* updated weights * also fix off-by-one in benchmarks
* updated weights * also fix off-by-one in benchmarks
* updated weights * also fix off-by-one in benchmarks
* updated weights * also fix off-by-one in benchmarks
* updated weights * also fix off-by-one in benchmarks
Is there an existing issue?
Experiencing problems? Have you tried our Stack Exchange first?
Description of bug
The issue is that some of our Collators, in our internal networks are sometimes not able to get their block backed/included.
After some investigation, we realized that the collator maintains a list of knownAddresses for the validator peer id, which also include a bad address (of a server that is not related to our network)
This should not be a problem as it also contains 3 other addresses which are valid (2x dns4 and 1x ip4).
However when it is time to connect to the validator to later send the candidate, the collator tries to dial the bad address, which doesn't reply (which I think triggers a timeout after 1 minute) and so when the collator tries to send the candidate later, it simply fail saying it can't reach the validator.
Steps to reproduce
I don't know really how to reproduce it on purpose, you need to be able to modify the knownAddresses associated with a peer on a node to trigger it manually
The text was updated successfully, but these errors were encountered: