Collator using wrong address to connect to Validator #1732

crystalin · 2023-09-27T21:53:23Z

Is there an existing issue?

I have searched the existing issues

Experiencing problems? Have you tried our Stack Exchange first?

This is not a support question.

Description of bug

The issue is that some of our Collators, in our internal networks are sometimes not able to get their block backed/included.

After some investigation, we realized that the collator maintains a list of knownAddresses for the validator peer id, which also include a bad address (of a server that is not related to our network)

This should not be a problem as it also contains 3 other addresses which are valid (2x dns4 and 1x ip4).

However when it is time to connect to the validator to later send the candidate, the collator tries to dial the bad address, which doesn't reply (which I think triggers a timeout after 1 minute) and so when the collator tries to send the candidate later, it simply fail saying it can't reach the validator.

Steps to reproduce

I don't know really how to reproduce it on purpose, you need to be able to modify the knownAddresses associated with a peer on a node to trigger it manually

crystalin · 2023-09-27T21:55:48Z

cc @tomaka (I think you worked on the p2p part no ?)

altonen · 2023-09-28T07:48:02Z

If the dial fails, the address should be removed from the list of known addresses,

polkadot-sdk/substrate/client/network/src/discovery.rs

Lines 604 to 619 in 769bdd3

    
           FromSwarm::DialFailure(e @ DialFailure { peer_id, error, .. }) => { 
        
           	if let Some(peer_id) = peer_id { 
        
           		if let DialError::Transport(errors) = error { 
        
           			if let Entry::Occupied(mut entry) = self.ephemeral_addresses.entry(peer_id) 
        
           			{ 
        
           				for (addr, _error) in errors { 
        
           					entry.get_mut().retain(|a| a != addr); 
        
           				} 
        
           				if entry.get().is_empty() { 
        
           					entry.remove(); 
        
           				} 
        
           			} 
        
           		} 
        
           	} 
        
           	self.kademlia.on_swarm_event(FromSwarm::DialFailure(e));

Are you saying the node is permanently unreachable because of this one faulty address?

crystalin · 2023-09-28T11:46:44Z

@altonen I'll ask our ops to verify, but in our case no. It is only unreachable for a certain amount of time (I believe for around 1-2 minutes) but then it works again, and later on it will happen again.
The scenario I image happening is:

Everytime there is validator core group rotation:
- Collator tries to connect to the new validator on a random address, in this case the bad one but sometime a good one
- Collator produces blocks but can't send them because the bad connection is still pending timeout
- Collator connection timeout
- After timeout, the collator tries to reconnect to another address, which this times works and is able to produce blocks again
- At next rotation (every 2 minutes), this might (or not) happen again

altonen · 2023-09-28T12:54:05Z

This is something we discussed in another issue recently. Basically there should be grace period for addresses that have recently failed and then retry dial N times before the address is marked as undialable for good.

Still, something doesn't make sense because libp2p configures the concurrent dial factor to 8 which means that it should be able to establish a connection if there is at least one address the peer can be reached from. If you're able to reproduce this, could you run the node with -lsub-libp2p=trace and post the logs?

purestakeoskar · 2023-09-28T15:31:39Z

I can confirm that when this happens the peer is now in "notConnectedPeers", but all the known addresses are the same, so next time it needs to connect to the same peer it can still try the non functional address.
Another thing i wanted to add is that if we add the validators as bootnodes or reserved nodes to the collators, this problem goes away. My theory is that there is already a known good address to the validator to use create the new substream over.

* updated weights * also fix off-by-one in benchmarks

crystalin added I10-unconfirmed Issue might be valid, but it's not yet known. I2-bug The node fails to follow expected behavior. labels Sep 27, 2023

serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Mar 26, 2024

Updated db weights and some experiments (paritytech#1732)

c28b3ff

* updated weights * also fix off-by-one in benchmarks

serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Mar 27, 2024

Updated db weights and some experiments (paritytech#1732)

f4ea091

* updated weights * also fix off-by-one in benchmarks

serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 8, 2024

Updated db weights and some experiments (paritytech#1732)

668ff08

* updated weights * also fix off-by-one in benchmarks

serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 8, 2024

Updated db weights and some experiments (paritytech#1732)

32a0264

* updated weights * also fix off-by-one in benchmarks

serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 8, 2024

Updated db weights and some experiments (paritytech#1732)

ea29336

* updated weights * also fix off-by-one in benchmarks

serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 8, 2024

Updated db weights and some experiments (paritytech#1732)

e9254b4

* updated weights * also fix off-by-one in benchmarks

serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 8, 2024

Updated db weights and some experiments (paritytech#1732)

fa22185

* updated weights * also fix off-by-one in benchmarks

serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024

Updated db weights and some experiments (paritytech#1732)

9e8f046

* updated weights * also fix off-by-one in benchmarks

serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024

Updated db weights and some experiments (paritytech#1732)

76f8e44

* updated weights * also fix off-by-one in benchmarks

serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024

Updated db weights and some experiments (paritytech#1732)

3a11aa4

* updated weights * also fix off-by-one in benchmarks

serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024

Updated db weights and some experiments (paritytech#1732)

af4a8ea

* updated weights * also fix off-by-one in benchmarks

serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024

Updated db weights and some experiments (paritytech#1732)

0199b36

* updated weights * also fix off-by-one in benchmarks

serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024

Updated db weights and some experiments (paritytech#1732)

6d1eee1

* updated weights * also fix off-by-one in benchmarks

serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 10, 2024

Updated db weights and some experiments (paritytech#1732)

2093052

* updated weights * also fix off-by-one in benchmarks

serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 10, 2024

Updated db weights and some experiments (paritytech#1732)

a5df330

* updated weights * also fix off-by-one in benchmarks

bkchr pushed a commit that referenced this issue Apr 10, 2024

Updated db weights and some experiments (#1732)

df1aed0

* updated weights * also fix off-by-one in benchmarks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collator using wrong address to connect to Validator #1732

Collator using wrong address to connect to Validator #1732

crystalin commented Sep 27, 2023

crystalin commented Sep 27, 2023

altonen commented Sep 28, 2023

crystalin commented Sep 28, 2023

altonen commented Sep 28, 2023

purestakeoskar commented Sep 28, 2023

Collator using wrong address to connect to Validator #1732

Collator using wrong address to connect to Validator #1732

Comments

crystalin commented Sep 27, 2023

Is there an existing issue?

Experiencing problems? Have you tried our Stack Exchange first?

Description of bug

Steps to reproduce

crystalin commented Sep 27, 2023

altonen commented Sep 28, 2023

crystalin commented Sep 28, 2023

altonen commented Sep 28, 2023

purestakeoskar commented Sep 28, 2023