Block sync performance unreliable on poor Internet connections #2587

avastmick · 2016-05-20T02:13:03Z

UPDATE: there is another roll-up issue #2569 that focusses in on the error / bugs seen in the fast syncing (--fast flag on initial synchronisation). To clarify, the issues I am seeing are present regardless of how the syncing was initiated

System information

$ geth version
Geth
Version: 1.4.3-stable
Protocol Versions: [63 62 61]
Network Id: 1
Go Version: go1.6.1
OS: linux
GOPATH=
GOROOT=/usr/lib/go-1.6

Expected behaviour

Consistent block synchronisation regardless of quality of network connection. Robust recovery from loss of connection. Suitable messaging to user for sync failure.

Actual behaviour

I live in rural New Zealand and work over ADSL (barely) "broadband" that is highly contended that leads to dropped connections, slow connections and high latency. The latter seems to cause considerable issues with Geth finding and retaining peers. So if I loop net.peerCount I can get across 10 second requests to the func: [0, 3, 10, 1, 1 , 1, 8, 8, 3, 3, 0, 0, 1...] and so on. After a period of low to zero peers the client no longer continues to sync (no log entry beyond the Synchronisation failed: message). Restart client and synccing re-starts... rinse and repeat

Occurs on both Windows 10 and Ubuntu 16.04 geth clients. The latter is also mobile (laptop). Same results, worse in busy areas (again a contention ratio issue).

If I run the same on Digital Ocean droplet, steady peer count, and no stalls and full (not --fast) sync in a couple of hours.

Steps to reproduce the behaviour

Run node on poor network connection - need either 3G mobile in busy area, or a high contention-ratio ADSL.

I'd like to add to this - for information. Geth seems very sensitive to unreliable networks or high contention networks. I get stalling / hanging when at home or on 3G / 4G mobile.

This is an issue as it tends to mean that only low-latency, low-contention ratio connections work and this will undermine the goal of meshing and potential usage in non-developed locations.

The text was updated successfully, but these errors were encountered:

sokoow · 2016-05-20T18:59:36Z

I'm also getting lots of Synchronisation failed: no peers to keep download active , but on pretty reliable 50 mbit connection. What's going on ?

avastmick · 2016-05-20T21:17:09Z

Interestingly also is that there seems no affinity with localised nodes that can provide the data - i.e. if I stand up local nodes (on other machines) on the same LAN these too are lost in the same manner unless I add them manually to a static-nodes.json file at the synccing node.

Ideally, if one or two localised nodes are fully syncced they would be the primary providers of data and only back-hauling globally to ensure the data is not corrupt / censored.

I'll set up a play pit and have a look at what is going on - though I'm 2 days behind now and can't catch-up...

bortzmeyer · 2016-05-22T10:21:03Z

My experience with geth on a poorly connected machine is the same: extreme brittleness, and lack of resiliency: even when the network restarts, block synchronization do not.

obscuren · 2016-05-23T08:36:34Z

NOTE: All switching occured during the node being active. Did not start/stop/resume.

Did some syncing using the Network Link Conditioner running at various different settings. Starting at just Very bad internet which basically means that there’s a 500ms delay, limited 1mbps and 10% packet loss. Using this setting the node wouldn’t sync. Switching to High Latency DNS obviously didn’t pose much problem. Switching to Very Bad Internet again almost instantly dropped all peers and didn’t resume (it also threw in a Rolled back 2048 headers). Switched to 3G mode now (780kbps / 330kbps) the download eventually resumed but it is literally crawling forth, exactly what you’d expect of a 3G network.

Considering you're living in NZ with very little nodes near you this would be somewhat expected.

What would help us if you'd run the node as geth --vmodule=p2p=6,downloader=6 and let it run for a while, paste the contents in a gist we can take a better look at your issue.

avastmick · 2016-05-23T08:50:34Z

@obscuren okay thanks. I'll do that and post back results.

avastmick · 2016-05-23T08:58:21Z

@obscuren I ran geth as you asked for a couple of minutes. Gist can be found HERE

I this is too short, please say and I'll pipe to a log and attach.

{EDIT] Oh and would be interested in what some of the errors mean also 👍

Cheers

karalabe · 2016-05-23T09:27:51Z

@avastmick Could you run a speed test via http://www.speedtest.net/ with a European target and send us your specs? I'm really curious what the network latency is (bandwidth too of course).

avastmick · 2016-05-23T09:56:06Z

@karalabe Okay here we go - http://www.speedtest.net/my-result/5347272962 - target was SW UK

Now with these stats I get it's going to be sloooowwww to sync. I know this (trust me); that's not the point. The point is the fragility for connections like this. If it takes days to sync, that's okay - that's what my bitcoind node took: days. But once synced it is robust and doesn't require me to start all over again as I seem to have to now with the geth. I'd hope that once synced the maintenance to keep the node up-to-date is minimal and even a connection like mine can keep up and remain synced.

Thanks for looking into this.

karalabe · 2016-05-23T10:01:32Z

Currently we have quite strict timeouts in place which somewhat aimed to ensure some connectivity guarantees and avoid stalling attacks. Unfortunately if you yourself are a very remote node with no other peers close by, this essentially causes you to consider everyone else a bad peer :D We're considering addig some CLI flag that would bump the timeouts, just need to make sure we don't affect performance and/or security adversely.

alexfisher · 2016-05-29T01:37:02Z

Having repeat "Synchronisation failed: no peers to keep download active" messages here in Michigan, too. This has been quite annoying today.

karalabe · 2016-06-07T07:53:30Z

All these issues should be solves by our latest 1.4.6 release. Please try with than an open a new ticket if some issues still persist.

…sk (ethereum#2587)

avastmick mentioned this issue May 23, 2016

eth: make fast sync great again #2569

Closed

7 tasks

karalabe closed this as completed Jun 7, 2016

maoueh pushed a commit to streamingfast/go-ethereum that referenced this issue Aug 14, 2024

consensus/parlia: support recovery when snapshot of parlia gone in di…

e988d15

…sk (ethereum#2587)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Block sync performance unreliable on poor Internet connections #2587

Block sync performance unreliable on poor Internet connections #2587

avastmick commented May 20, 2016 •

edited

Loading

sokoow commented May 20, 2016

avastmick commented May 20, 2016

bortzmeyer commented May 22, 2016

obscuren commented May 23, 2016

avastmick commented May 23, 2016

avastmick commented May 23, 2016 •

edited

Loading

karalabe commented May 23, 2016

avastmick commented May 23, 2016

karalabe commented May 23, 2016

alexfisher commented May 29, 2016

karalabe commented Jun 7, 2016

Block sync performance unreliable on poor Internet connections #2587

Block sync performance unreliable on poor Internet connections #2587

Comments

avastmick commented May 20, 2016 • edited Loading

System information

Expected behaviour

Actual behaviour

Steps to reproduce the behaviour

sokoow commented May 20, 2016

avastmick commented May 20, 2016

bortzmeyer commented May 22, 2016

obscuren commented May 23, 2016

avastmick commented May 23, 2016

avastmick commented May 23, 2016 • edited Loading

karalabe commented May 23, 2016

avastmick commented May 23, 2016

karalabe commented May 23, 2016

alexfisher commented May 29, 2016

karalabe commented Jun 7, 2016

avastmick commented May 20, 2016 •

edited

Loading

avastmick commented May 23, 2016 •

edited

Loading