Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

racy connection management makes channel stall #180

Open
raulk opened this issue Mar 26, 2021 · 2 comments
Open

racy connection management makes channel stall #180

raulk opened this issue Mar 26, 2021 · 2 comments

Comments

@raulk
Copy link
Member

raulk commented Mar 26, 2021

Running the whitenoise tests and using an interruption policy of 0.2/1s (20% probability that the connection will be interrupted every 1s), the opening of the push channel seems to block forever.

This is really bad because it means that the system can't make progress and effectively the sender is seized because go-data-transfer never returns control

Output:

Mar 26 15:14:26.862276	INFO	2.9067s    MESSAGE << receiver[000] (a06fd2) >> all networks configured
Mar 26 15:14:26.862369	INFO	2.9067s    MESSAGE << receiver[000] (a06fd2) >> transfer starting
Mar 26 15:14:26.862412	INFO	2.9067s    MESSAGE << receiver[000] (a06fd2) >> we are the receiver
Mar 26 15:14:26.865175	INFO	2.9097s    MESSAGE << sender[000] (154eca) >> all networks configured
Mar 26 15:14:26.865301	INFO	2.9097s    MESSAGE << sender[000] (154eca) >> transfer starting
Mar 26 15:14:26.865394	INFO	2.9097s    MESSAGE << sender[000] (154eca) >> we are the sender
Mar 26 15:14:30.136669	INFO	6.1811s    MESSAGE << sender[000] (154eca) >> import took: 3.27130671s
Mar 26 15:14:32.643670	INFO	8.6882s    MESSAGE << sender[000] (154eca) >> interruptor closing connection ------------
Mar 26 15:14:32.643786	INFO	8.6884s    MESSAGE << sender[000] (154eca) >> opening the push data channel
Mar 26 15:14:34.650872	INFO	10.6954s    MESSAGE << sender[000] (154eca) >> interruptor closing connection ------------
Mar 26 15:14:37.668708	INFO	13.7132s    MESSAGE << sender[000] (154eca) >> interruptor closing connection ------------
Mar 26 15:14:44.188653	INFO	20.2332s    MESSAGE << sender[000] (154eca) >> interruptor closing connection ------------
Mar 26 15:14:48.857653	INFO	24.9022s    MESSAGE << sender[000] (154eca) >> interruptor closing connection ------------

Here are three goroutine traces 2 minutes apart from one another:

stall.zip

@dirkmc
Copy link
Contributor

dirkmc commented Mar 26, 2021

If we fail to even open the channel, the expected behaviour is that it will fail the transfer immediately (it doesn't try to restart)

Once the channel is open (once we receive an Accept from the other side) it should attempt restarts.

Note also that you have to explicitly set the config in order to enable reconnect behaviour, see the config in lotus:
https://github.com/filecoin-project/lotus/blob/885ecb97ad631fc64f538034390648e4da69966c/node/modules/client.go#L126-L140

@raulk
Copy link
Member Author

raulk commented Mar 26, 2021

  1. What I'm observing is that the opening blocks entirely -- it does not fail immediately.
  2. Whitenoise is already setting the retry params: https://github.com/raulk/whitenoise/blob/master/testplan/main.go#L105-L113

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants