Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block sync performance unreliable on poor Internet connections #2587

Closed
avastmick opened this issue May 20, 2016 · 11 comments
Closed

Block sync performance unreliable on poor Internet connections #2587

avastmick opened this issue May 20, 2016 · 11 comments

Comments

@avastmick
Copy link

avastmick commented May 20, 2016

UPDATE: there is another roll-up issue #2569 that focusses in on the error / bugs seen in the fast syncing (--fast flag on initial synchronisation). To clarify, the issues I am seeing are present regardless of how the syncing was initiated

System information

$ geth version
Geth
Version: 1.4.3-stable
Protocol Versions: [63 62 61]
Network Id: 1
Go Version: go1.6.1
OS: linux
GOPATH=
GOROOT=/usr/lib/go-1.6

Expected behaviour

Consistent block synchronisation regardless of quality of network connection. Robust recovery from loss of connection. Suitable messaging to user for sync failure.

Actual behaviour

I live in rural New Zealand and work over ADSL (barely) "broadband" that is highly contended that leads to dropped connections, slow connections and high latency. The latter seems to cause considerable issues with Geth finding and retaining peers. So if I loop net.peerCount I can get across 10 second requests to the func: [0, 3, 10, 1, 1 , 1, 8, 8, 3, 3, 0, 0, 1...] and so on. After a period of low to zero peers the client no longer continues to sync (no log entry beyond the Synchronisation failed: message). Restart client and synccing re-starts... rinse and repeat

Occurs on both Windows 10 and Ubuntu 16.04 geth clients. The latter is also mobile (laptop). Same results, worse in busy areas (again a contention ratio issue).

If I run the same on Digital Ocean droplet, steady peer count, and no stalls and full (not --fast) sync in a couple of hours.

Steps to reproduce the behaviour

Run node on poor network connection - need either 3G mobile in busy area, or a high contention-ratio ADSL.

I'd like to add to this - for information. Geth seems very sensitive to unreliable networks or high contention networks. I get stalling / hanging when at home or on 3G / 4G mobile.

This is an issue as it tends to mean that only low-latency, low-contention ratio connections work and this will undermine the goal of meshing and potential usage in non-developed locations.

@sokoow
Copy link

sokoow commented May 20, 2016

I'm also getting lots of Synchronisation failed: no peers to keep download active , but on pretty reliable 50 mbit connection. What's going on ?

@avastmick
Copy link
Author

Interestingly also is that there seems no affinity with localised nodes that can provide the data - i.e. if I stand up local nodes (on other machines) on the same LAN these too are lost in the same manner unless I add them manually to a static-nodes.json file at the synccing node.

Ideally, if one or two localised nodes are fully syncced they would be the primary providers of data and only back-hauling globally to ensure the data is not corrupt / censored.

I'll set up a play pit and have a look at what is going on - though I'm 2 days behind now and can't catch-up...

@bortzmeyer
Copy link

My experience with geth on a poorly connected machine is the same: extreme brittleness, and lack of resiliency: even when the network restarts, block synchronization do not.

@obscuren
Copy link
Contributor

NOTE: All switching occured during the node being active. Did not start/stop/resume.

Did some syncing using the Network Link Conditioner running at various different settings. Starting at just Very bad internet which basically means that there’s a 500ms delay, limited 1mbps and 10% packet loss. Using this setting the node wouldn’t sync. Switching to High Latency DNS obviously didn’t pose much problem. Switching to Very Bad Internet again almost instantly dropped all peers and didn’t resume (it also threw in a Rolled back 2048 headers). Switched to 3G mode now (780kbps / 330kbps) the download eventually resumed but it is literally crawling forth, exactly what you’d expect of a 3G network.

Considering you're living in NZ with very little nodes near you this would be somewhat expected.

What would help us if you'd run the node as geth --vmodule=p2p=6,downloader=6 and let it run for a while, paste the contents in a gist we can take a better look at your issue.

@avastmick
Copy link
Author

@obscuren okay thanks. I'll do that and post back results.

@avastmick
Copy link
Author

avastmick commented May 23, 2016

@obscuren I ran geth as you asked for a couple of minutes. Gist can be found HERE

I this is too short, please say and I'll pipe to a log and attach.

{EDIT] Oh and would be interested in what some of the errors mean also 👍

Cheers

@karalabe
Copy link
Member

@avastmick Could you run a speed test via http://www.speedtest.net/ with a European target and send us your specs? I'm really curious what the network latency is (bandwidth too of course).

@avastmick
Copy link
Author

@karalabe Okay here we go - http://www.speedtest.net/my-result/5347272962 - target was SW UK

Now with these stats I get it's going to be sloooowwww to sync. I know this (trust me); that's not the point. The point is the fragility for connections like this. If it takes days to sync, that's okay - that's what my bitcoind node took: days. But once synced it is robust and doesn't require me to start all over again as I seem to have to now with the geth. I'd hope that once synced the maintenance to keep the node up-to-date is minimal and even a connection like mine can keep up and remain synced.

Thanks for looking into this.

@karalabe
Copy link
Member

Currently we have quite strict timeouts in place which somewhat aimed to ensure some connectivity guarantees and avoid stalling attacks. Unfortunately if you yourself are a very remote node with no other peers close by, this essentially causes you to consider everyone else a bad peer :D We're considering addig some CLI flag that would bump the timeouts, just need to make sure we don't affect performance and/or security adversely.

@alexfisher
Copy link

Having repeat "Synchronisation failed: no peers to keep download active" messages here in Michigan, too. This has been quite annoying today.

@karalabe
Copy link
Member

karalabe commented Jun 7, 2016

All these issues should be solves by our latest 1.4.6 release. Please try with than an open a new ticket if some issues still persist.

@karalabe karalabe closed this as completed Jun 7, 2016
maoueh pushed a commit to streamingfast/go-ethereum that referenced this issue Aug 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants