ETCM-168: Discovery part4 #770

aakoshh · 2020-11-03T15:15:42Z

Description

Replaces the node discovery currently in Mantis with the one in Scalanet, implementing the Ethereum Discovery v4 spec.

Requires fixes from input-output-hk/scalanet#100

Important changes

Mantis previously lacked automation to resolve its own external IP address. It used to send 0.0.0.0 in the Ping messages as its own address, which the other nodes probably ignored and used the address visible in the connection. However clients fully implementing the discovery protocol get the address from the ENR, so it should be correctly set.

The PR adds an external IP resolution mechanism, however if that were to fail, the address can and should be manually set with the mantis.network.discovery.host. If it's not set the discovery module will log an error and not discovery peers, it will only serve previously discovered and persisted peers. See the below example for setting it via command line arguments.

Testing

To test it, package Mantis, extract it, and start it with discovery and Metrics enabled:

sbt dist
cd target/universal
unzip mantis-3.0.zip 
rm -rf ~/.mantis/mordor
./mantis-3.0/bin/mantis-launcher mordor \
  -Dmantis.metrics.enabled=true \
  -Dmantis.network.discovery.discovery-enabled=true \
  -Dmantis.network.discovery.reuse-known-nodes=false\
  -Dmantis.network.discovery.scan-interval=1.minute \
  -Dmantis.network.discovery.host=$(curl -s ifconfig.me/ip)

With metrics running, the number of connections can be checked with curl:

$ curl -s http://127.0.0.1:13798 | grep discovery
# HELP app_network_discovery_foundPeers_gauge  
# TYPE app_network_discovery_foundPeers_gauge gauge
app_network_discovery_foundPeers_gauge 116.0

I needed fixes in Scalanet at the same time. Those were published locally and copied to Mantis before it was started:

mill scalanet.discovery.publishLocal
cp ~/.ivy2/local/io.iohk/scalanet-discovery_2.12/0.4-SNAPSHOT/jars/scalanet-discovery_2.12.jar ../mantis/target/universal/mantis-3.0/lib/io.iohk.scalanet-discovery_2.12-0.4-SNAPSHOT.jar

The mordor config comes with 20-something bootstrap nodes but I noticed that out of those only a few seem to work. These can be added to mantis-3.0/conf/mordor.conf to cut down the noise of unreachable nodes (I also temporarily disabled syncing and consensus to see anything in the logs, otherwise they get full very quickly and get zipped up by log rotation):

include "app.conf"

mantis {
  blockchains {
    network = "mordor"
    mordor {
      bootstrap-nodes = [
"enode://15b6ae4e9e18772f297c90d83645b0fbdb56667ce2d747d6d575b21d7b60c2d3cd52b11dec24e418438caf80ddc433232b3685320ed5d0e768e3972596385bfc@51.158.191.43:41235", // @q9f core-geth mizar
"enode://651b484b652c07c72adebfaaf8bc2bd95b420b16952ef3de76a9c00ef63f07cca02a20bd2363426f9e6fe372cef96a42b0fec3c747d118f79fd5e02f2a4ebd4e@51.158.190.99:45678", // @q9f core-geth lyrae
"enode://534d18fd46c5cd5ba48a68250c47cea27a1376869755ed631c94b91386328039eb607cf10dd8d0aa173f5ec21e3fb45c5d7a7aa904f97bc2557e9cb4ccc703f1@51.158.190.99:30303", // @q9f besu lyrae
"enode://642cf9650dd8869d42525dbf6858012e3b4d64f475e733847ab6f7742341a4397414865d953874e8f5ed91b0e4e1c533dee14ad1d6bb276a5459b2471460ff0d@157.230.152.87:30303", // @meowsbits but don't count on it
      ]
    }
  }
}

The logs also indicate how many peers have been discovered:

...
2020-11-03 13:56:31,569 [mantis_system-akka.actor.default-dispatcher-7] INFO  i.i.e.network.PeerManagerActor akka://mantis_system/user/peer-manager - Discovered 103 nodes, Blacklisted 95 nodes, connected to 9/85. Trying to connect to 0 more nodes.
...

…-part4

ntallar

I needed fixes in Scalanet at the same time. Those were published locally and copied to Mantis before it was started:

Should a new version of scalanet be published then? Which commit is required? (for me to test it)

These can be added to mantis-3.0/conf/mordor.conf to cut down the noise of unreachable nodes

From were did you get the new bootstrap nodes? For now we have been copying the bootstrap from ETC's geth, though we haven't updated them in more than a month.

But if you found better bootstrap we should definitely add them!

src/main/scala/io/iohk/ethereum/network/discovery/PeerDiscoveryManager.scala

src/main/resources/application.conf

src/main/scala/io/iohk/ethereum/network/discovery/DiscoveryServiceBuilder.scala

src/main/resources/application.conf

src/test/scala/io/iohk/ethereum/ledger/BlockExecutionSpec.scala

src/main/scala/io/iohk/ethereum/network/discovery/codecs/RLPCodecs.scala

aakoshh · 2020-11-05T16:13:45Z

Should a new version of scalanet be published then? Which commit is required? (for me to test it)

It's been published now, it was the PR I referred to in the description.

From were did you get the new bootstrap nodes? For now we have been copying the bootstrap from ETC's geth, though we haven't updated them in more than a month. But if you found better bootstrap we should definitely add them!

Sorry I was vague, I by "adding to ./conf/mordor.conf" I meant to add it in the unzipped config file, to override the default values. These 4 nodes are a subset of the defaults. I got them from looking at the block sync logs, these were the only ones which were logged serving any data.

… definitely captured.

…-part4

aakoshh · 2020-11-06T17:29:34Z

I recommend we also include input-output-hk/scalanet#103, it makes the initial lookup phase much faster on mordor.

…-part4

aakoshh · 2020-11-12T12:15:12Z

When trying to connect to mainnet it turned out that the settings from the PR description lead to a rather slow uptake of nodes, it discovered ~1000 nodes in ~30 minutes. Trying with these different setting was faster:

./mantis-3.0/bin/mantis-launcher etc  -Dmantis.metrics.enabled=true   \
-Dmantis.network.discovery.discovery-enabled=true   \
-Dmantis.network.discovery.reuse-known-nodes=false \
-Dmantis.network.discovery.scan-interval=30.seconds \
-Dmantis.network.discovery.kademlia-bucket-size=1024 \
-Dmantis.network.discovery.kademlia-alpha=1024

The difference is that the algorithm will always keep the closest 1024 nodes in the lookup loop, not just 16, so it will try to bond with many more at the same time. That resulted in ~500 nodes discovered in 2 minutes, ~2100 in ~14 minutes. However most nodes rejected the TCP handshake because they already have too many peers, and this was the point where the node found 3 peers it could use to pick a pivot block (it needs an majority agreement between them, arguably 2 peers would be enough as long as they agree).

The handshaked nodes are persisted, so restarting with normal settings later should see the node reconnect to those:

./mantis-3.0/bin/mantis-launcher etc  -Dmantis.metrics.enabled=true   \
-Dmantis.network.discovery.discovery-enabled=true   \
-Dmantis.network.discovery.reuse-known-nodes=true \
-Dmantis.network.discovery.scan-interval=30.seconds \
-Dmantis.network.discovery.kademlia-bucket-size=16 \
-Dmantis.network.discovery.kademlia-alpha=3

However after a restart it still could only connect to 2 nodes and fail to pick a pivot block, possibly because they other side thought we were already connected and it took them some time to clean up the broken connection.

KonradStaniec · 2020-11-12T12:39:41Z

@aakoshh @ntallar is to:

tweak default params in this pr to enable quicker finding of nodes
create task to investigate why geth is able to discover enough peers quickly with default params (I remember it took like 2-3 min to start fast sync after starting geth)
wdyt ?

aakoshh · 2020-11-12T12:53:07Z

Some of the default params can be tweaked, like the discovery interval of 15 minutes is surely not often enough. Other clients like trinity only did discovery when it didn't have enough candidate peers, it didn't do periodic lookups, while the Go client does a self lookup and 3 random lookups every 30 minutes. The bucket size is kinda part of the spec I think. Alpha can be raised if needed, up to the bucket size (I don't think trinity used alpha, the Go client does).

We can have another look at the go-ethereum codebase too if you think it would be useful.

ntallar

I agree with what Konrad mentioned of creating a separate task to tweak the parameters! What we have is probably good enough for starting deploying on the testnet

src/main/resources/application.conf

src/main/scala/io/iohk/ethereum/network/discovery/DiscoveryConfig.scala

src/main/scala/io/iohk/ethereum/network/discovery/PeerDiscoveryManager.scala

src/main/scala/io/iohk/ethereum/network/discovery/DiscoveryServiceBuilder.scala

src/universal/bin/mantis-launcher

src/main/scala/io/iohk/ethereum/network/discovery/codecs/RLPCodecs.scala

aakoshh · 2020-11-12T16:57:55Z

By the way @ntallar I'm not sure how exactly it works now in Mantis now, but I think the way we can't connect to almost anyone from 2000+ nodes because they all have too many peers could be improved in the new network. For example take a look at the grafting and pruning strategy in gossipsub. Perhaps our node could also have a min-target-max range for incoming/outgoing connections and get rid of them if they go above the threshold, opening up room for newcomers.

src/main/scala/io/iohk/ethereum/network/PeerManagerActor.scala

…-part4

ntallar

Apart from this and the minor comment update, LGTM!

ntallar · 2020-11-13T18:46:33Z

src/test/scala/io/iohk/ethereum/network/discovery/PeerDiscoveryManagerSpec.scala

-import org.scalatest.flatspec.AnyFlatSpec
-import org.scalatest.matchers.should.Matchers
-
-class PeerDiscoveryManagerSpec


Should we add tests for our new PeerDiscoveryManager?

I don't know, the actor doesn't have too much logic in it, so I thought I'd leave it up to you if you're happy with the three of us having seen it, and tested it with mordor/etc, or you'd prefer having unit tests with a mock service.

What would be cool is if there was some end-to-end test that started and least a couple of Mantis nodes that discovered each other, so we knew the service builder is correct as well as the actor.

Added unit tests.

That would be cool... I'm not sure if we had that on Midnight or not 🤔

Kinda, it builds the project (although I think that could be skipped) and starts the node as subprocesses to test the wallet. Not sure if multiple nodes are tested though, probably not.

src/test/scala/io/iohk/ethereum/network/discovery/PeerDiscoveryManagerSpec.scala

ntallar

Only a single comment remains, but LGTM!

 .\\            //.
. \ \          / /.
.\  ,\     /` /,.-
 -.   \  /'/ /  .
 ` -   `-'  \  -
   '.       /.\`
      -    .-
      :`//.'
      .`.'
      .' BP

…-part4

aakoshh marked this pull request as ready for review November 3, 2020 17:29

aakoshh requested review from ntallar and KonradStaniec November 3, 2020 17:31

aakoshh force-pushed the ETCM-168-discovery-part4 branch from be22f07 to 820c273 Compare November 3, 2020 18:09

aakoshh added 6 commits November 3, 2020 19:34

ETCM-168: Use the DiscoveryService. Fix codecs based on mordor trials.

722ba3e

ETCM-168: Mute logs from scalanet and netty.

a41dd6d

ETCM-168: Updated the discovery configuration.

f032e45

ETCM-168: Remove original Packet and Message types.

60d38a1

ETCM-168: Take the TCP port from config rather than wait for the status.

03116dd

ETCM-168: Add kademlia-alpha to config.

a806e7c

aakoshh force-pushed the ETCM-168-discovery-part4 branch from 63d85d5 to a806e7c Compare November 3, 2020 19:34

aakoshh added 11 commits November 4, 2020 15:33

ETCM-168: Use Predefined from Scalanet.

6ad90ca

Merge remote-tracking branch 'origin/develop' into ETCM-168-discovery…

ef83ddc

…-part4

ETCM-168: Fix scalastyle issues.

b432147

ETCM-168: Moved service building to a separate file.

907da9f

ETCM-168: Keep using local Predefined because of nix.

2464423

ETCM-168: Fix config key.

3bd993e

Merge remote-tracking branch 'origin/develop' into ETCM-168-discovery…

119826a

…-part4

Merge remote-tracking branch 'origin/develop' into ETCM-168-discovery…

3074e7d

…-part4

ETCM-168: Updated nix patch with scalanet changes.

d2889c1

ETCM-168: Try again after nix patch.

8a4b75b

ETCM-168: Change to dir in mantis-launcher

d880927

ntallar requested changes Nov 5, 2020

View reviewed changes

aakoshh added 6 commits November 5, 2020 16:41

ETCM-168: Log start/stop errors in the PeerDiscoveryManager.

333385e

ETCM-168: Use null for unset host.

c2aab40

ETCM-168: Fix argument order in ActorLogging.error calls.

3348c6b

ETCM-168: Trying a more basic approach to make sure future errors are…

11b7adc

… definitely captured.

Merge remote-tracking branch 'origin/develop' into ETCM-168-discovery…

2b77e45

…-part4

ETCM-168: Update Scalanet version.

79484fb

aakoshh added 8 commits November 9, 2020 12:02

Merge remote-tracking branch 'origin/develop' into ETCM-168-discovery…

17e653c

…-part4

ETCM-168: Scalafmt on EthService.

401c5d5

ETCM-168: Enroll discovery in the background.

6ac2cd8

Merge remote-tracking branch 'origin/develop' into ETCM-168-discovery…

8925f66

…-part4

ETCM-168: Fix nix.

3ff108f

Merge remote-tracking branch 'origin/develop' into ETCM-168-discovery…

0b7a349

…-part4

Merge remote-tracking branch 'origin/develop' into ETCM-168-discovery…

b18a3b8

…-part4

Merge remote-tracking branch 'origin/develop' into ETCM-168-discovery…

8fccfb7

…-part4

ntallar requested changes Nov 12, 2020

View reviewed changes

aakoshh added 3 commits November 12, 2020 16:02

ETCM-168: Use ConfigUtils for optional value.

1b3556e

ETCM-168: Log fatal errors too.

39c0923

ETCM-168: Be strict by default with the RLPList item numbers.

527e0fa

aakoshh commented Nov 12, 2020

View reviewed changes

src/main/scala/io/iohk/ethereum/network/PeerManagerActor.scala Show resolved Hide resolved

aakoshh added 3 commits November 12, 2020 20:37

Merge remote-tracking branch 'origin/develop' into ETCM-168-discovery…

8c2d6b6

…-part4

ETCM-168: Flip parameter order so it resembles the regular call.

53ac35d

Merge remote-tracking branch 'origin/develop' into ETCM-168-discovery…

cd01dd3

…-part4

ntallar reviewed Nov 13, 2020

View reviewed changes

aakoshh added 2 commits November 13, 2020 21:58

ETCM-168: Update comments about background enrollment.

38359df

ETCM-168: Added unit tests to PeerDiscoveryManager.

aa120cd

ntallar reviewed Nov 17, 2020

View reviewed changes

src/test/scala/io/iohk/ethereum/network/discovery/PeerDiscoveryManagerSpec.scala Show resolved Hide resolved

ntallar approved these changes Nov 17, 2020

View reviewed changes

aakoshh added 2 commits November 17, 2020 19:27

ETCM-168: Use a flag to avoid having to sleep.

fbd7a4d

Merge remote-tracking branch 'origin/develop' into ETCM-168-discovery…

02f3469

…-part4

aakoshh merged commit 0f73e7e into develop Nov 17, 2020

aakoshh deleted the ETCM-168-discovery-part4 branch November 17, 2020 20:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ETCM-168: Discovery part4 #770

ETCM-168: Discovery part4 #770

aakoshh commented Nov 3, 2020 •

edited

Loading

ntallar left a comment

aakoshh commented Nov 5, 2020

aakoshh commented Nov 6, 2020

aakoshh commented Nov 12, 2020 •

edited

Loading

KonradStaniec commented Nov 12, 2020

aakoshh commented Nov 12, 2020 •

edited

Loading

ntallar left a comment

aakoshh commented Nov 12, 2020

ntallar left a comment

ntallar Nov 13, 2020

aakoshh Nov 13, 2020

aakoshh Nov 13, 2020 •

edited

Loading

aakoshh Nov 17, 2020

ntallar Nov 17, 2020

aakoshh Nov 17, 2020

ntallar left a comment

ETCM-168: Discovery part4 #770

ETCM-168: Discovery part4 #770

Conversation

aakoshh commented Nov 3, 2020 • edited Loading

Description

Important changes

Testing

ntallar left a comment

Choose a reason for hiding this comment

aakoshh commented Nov 5, 2020

aakoshh commented Nov 6, 2020

aakoshh commented Nov 12, 2020 • edited Loading

KonradStaniec commented Nov 12, 2020

aakoshh commented Nov 12, 2020 • edited Loading

ntallar left a comment

Choose a reason for hiding this comment

aakoshh commented Nov 12, 2020

ntallar left a comment

Choose a reason for hiding this comment

ntallar Nov 13, 2020

Choose a reason for hiding this comment

aakoshh Nov 13, 2020

Choose a reason for hiding this comment

aakoshh Nov 13, 2020 • edited Loading

Choose a reason for hiding this comment

aakoshh Nov 17, 2020

Choose a reason for hiding this comment

ntallar Nov 17, 2020

Choose a reason for hiding this comment

aakoshh Nov 17, 2020

Choose a reason for hiding this comment

ntallar left a comment

Choose a reason for hiding this comment

aakoshh commented Nov 3, 2020 •

edited

Loading

aakoshh commented Nov 12, 2020 •

edited

Loading

aakoshh commented Nov 12, 2020 •

edited

Loading

aakoshh Nov 13, 2020 •

edited

Loading