swarm: better backoff logic #1554

Stebalien · 2017-10-18T20:46:35Z

We should try to distinguish between local failures and remote failures. At the very least, we should be resetting our backoffs when new links/routes come online.
We should probably be backing off on a per multiaddr basis, not a per peer basis (unless we establish a connection to the peer and it tells us to to away (need a new protocol for that, related to "disconnect" protocol/message #238).

mishto · 2018-01-24T23:05:20Z

Can we expose baseBackoffTime and maxBackoffTime? the default values are arbitrary and different applications may want different settings.

Stebalien · 2018-01-26T04:05:08Z

Fair enough. Also, it looks like our backoff aren't actually exponential...

Stebalien · 2018-01-26T04:40:34Z

This will be fixed in large refactor/simplification that's coming down the pipe.

Stebalien · 2018-01-26T04:55:37Z

Note to self: Refund backoff "tries" after a period of time. Currently, if we go to max-backoff, wait an hour, and then fail a single dial, we'll wait the max backoff again. We should, instead, notice that an hour has passed and forget all the previous failures.

Code:

	now := time.Now()
	if sinceLast := now.Sub(bp.until); sinceLast > 0 {
		// Refund backoff time at the same rate.
		refund := int(math.Sqrt(float64((sinceLast - BackoffBase) / BackoffCoef)))
		if refund < bp.tries {
			bp.tries -= refund
		} else {
			bp.tries = 0
		}
	}

Not going to do this now because we have so many other changes in the pipeline and we may want to discuss this.

mishto · 2018-01-29T16:11:51Z

Sounds good, thanks.

Stebalien · 2020-03-03T06:57:30Z

Working through all the different backoff cases:

Backoff trying to find a peer.
- This definitely belongs down in the DHT, or as a wrapper around the DHT.
Backoff a port/ip because a TCP dial failed.
- This could happen inside the transport or inside the swarm itself.
  - If it happens inside the transport, we'd need a shared backoff module for backing off dialing multiaddrs with certain prefixes.
  - If it happens inside the swarm, we'd need some way to report the backoff to the swarm. We'd probably do this by returning a special error.
Backoff an IP when we get a "no route to IP" error.
- Same as above.
Backoff a port/ip/peer triple when we end up dialing the wrong peer.
- Same as above.
Backoff a peer/transport when we fail to negotiate a muxer/security transport.
- This is an interesting case. Really, we want to backoff the entire peer for all transports using the upgrader upgrader. This is a case where applying the backoff from within the transport is really the only solution that makes sense (as the transport knows what sub-transports it uses).

This tries to provide a simple-to-reason-about solution to the list of problems in https://github.com/libp2p/go-libp2p-swarm/issues/37

Stebalien · 2020-04-01T22:24:37Z

Status: While @petar's patches are likely the right way to go in the future, they introduce quite a few new interfaces that'll need to be discussed. In the interest of getting a fast fix in, @willscott is implementing (#191) a dumb version that just backs off full addresses inside the swarm itself without changing core libp2p interfaces.

That gives us some breathing room.

This was referenced Oct 18, 2017

backoff triggered inappropriately by context canceled errors from DHT queries. libp2p/go-libp2p-kad-dht#96

Closed

bootstrap: without working /ip6, dial backoff also affects /ip4 ipfs/kubo#4342

Closed

Stebalien mentioned this issue Nov 30, 2017

Ipfs gets into bad state after computer sleep ipfs/kubo#2777

Open

Stebalien mentioned this issue Mar 20, 2018

swarm: make dial backoffs configurable #1549

Open

Stebalien mentioned this issue Apr 4, 2018

Bootstrapping bundle #304

Closed

Stebalien mentioned this issue Jun 25, 2018

Routed hosts should search for new addresses if the available adresses do not work issue#351 #366

Closed

aschmahmann mentioned this issue Oct 25, 2019

Add Backoff Cache Discovery libp2p/go-libp2p-discovery#26

Merged

Stebalien mentioned this issue Jan 29, 2020

dht/find-peers testplan dial backoff problem testground/testground#417

Closed

petar referenced this issue in libp2p/go-libp2p-core Mar 5, 2020

Propose a backoff system for managing interdependent timers.

f8ef82c

This tries to provide a simple-to-reason-about solution to the list of problems in https://github.com/libp2p/go-libp2p-swarm/issues/37

petar mentioned this issue Mar 5, 2020

Propose a backoff system for managing interdependent timers. libp2p/go-libp2p-core#127

Open

petar referenced this issue in libp2p/go-libp2p-core Mar 5, 2020

Propose a backoff system for managing interdependent timers.

df9ffe0

This tries to provide a simple-to-reason-about solution to the list of problems in https://github.com/libp2p/go-libp2p-swarm/issues/37

marten-seemann changed the title ~~Better backoff logic~~ swarm: better backoff logic May 25, 2022

marten-seemann transferred this issue from libp2p/go-libp2p-swarm May 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

swarm: better backoff logic #1554

swarm: better backoff logic #1554

Stebalien commented Oct 18, 2017

mishto commented Jan 24, 2018

Stebalien commented Jan 26, 2018

Stebalien commented Jan 26, 2018

Stebalien commented Jan 26, 2018

mishto commented Jan 29, 2018

Stebalien commented Mar 3, 2020

Stebalien commented Apr 1, 2020

swarm: better backoff logic #1554

swarm: better backoff logic #1554

Comments

Stebalien commented Oct 18, 2017

mishto commented Jan 24, 2018

Stebalien commented Jan 26, 2018

Stebalien commented Jan 26, 2018

Stebalien commented Jan 26, 2018

mishto commented Jan 29, 2018

Stebalien commented Mar 3, 2020

Stebalien commented Apr 1, 2020