Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ETCM-168: Discovery part4 #770

Merged
merged 42 commits into from
Nov 17, 2020
Merged

ETCM-168: Discovery part4 #770

merged 42 commits into from
Nov 17, 2020

Conversation

aakoshh
Copy link
Contributor

@aakoshh aakoshh commented Nov 3, 2020

Description

Replaces the node discovery currently in Mantis with the one in Scalanet, implementing the Ethereum Discovery v4 spec.

Requires fixes from input-output-hk/scalanet#100

Important changes

Mantis previously lacked automation to resolve its own external IP address. It used to send 0.0.0.0 in the Ping messages as its own address, which the other nodes probably ignored and used the address visible in the connection. However clients fully implementing the discovery protocol get the address from the ENR, so it should be correctly set.

The PR adds an external IP resolution mechanism, however if that were to fail, the address can and should be manually set with the mantis.network.discovery.host. If it's not set the discovery module will log an error and not discovery peers, it will only serve previously discovered and persisted peers. See the below example for setting it via command line arguments.

Testing

To test it, package Mantis, extract it, and start it with discovery and Metrics enabled:

sbt dist
cd target/universal
unzip mantis-3.0.zip 
rm -rf ~/.mantis/mordor
./mantis-3.0/bin/mantis-launcher mordor \
  -Dmantis.metrics.enabled=true \
  -Dmantis.network.discovery.discovery-enabled=true \
  -Dmantis.network.discovery.reuse-known-nodes=false\
  -Dmantis.network.discovery.scan-interval=1.minute \
  -Dmantis.network.discovery.host=$(curl -s ifconfig.me/ip)

With metrics running, the number of connections can be checked with curl:

$ curl -s http://127.0.0.1:13798 | grep discovery
# HELP app_network_discovery_foundPeers_gauge  
# TYPE app_network_discovery_foundPeers_gauge gauge
app_network_discovery_foundPeers_gauge 116.0

I needed fixes in Scalanet at the same time. Those were published locally and copied to Mantis before it was started:

mill scalanet.discovery.publishLocal
cp ~/.ivy2/local/io.iohk/scalanet-discovery_2.12/0.4-SNAPSHOT/jars/scalanet-discovery_2.12.jar ../mantis/target/universal/mantis-3.0/lib/io.iohk.scalanet-discovery_2.12-0.4-SNAPSHOT.jar 

The mordor config comes with 20-something bootstrap nodes but I noticed that out of those only a few seem to work. These can be added to mantis-3.0/conf/mordor.conf to cut down the noise of unreachable nodes (I also temporarily disabled syncing and consensus to see anything in the logs, otherwise they get full very quickly and get zipped up by log rotation):

include "app.conf"

mantis {
  blockchains {
    network = "mordor"
    mordor {
      bootstrap-nodes = [
"enode://15b6ae4e9e18772f297c90d83645b0fbdb56667ce2d747d6d575b21d7b60c2d3cd52b11dec24e418438caf80ddc433232b3685320ed5d0e768e3972596385bfc@51.158.191.43:41235", // @q9f core-geth mizar
"enode://651b484b652c07c72adebfaaf8bc2bd95b420b16952ef3de76a9c00ef63f07cca02a20bd2363426f9e6fe372cef96a42b0fec3c747d118f79fd5e02f2a4ebd4e@51.158.190.99:45678", // @q9f core-geth lyrae
"enode://534d18fd46c5cd5ba48a68250c47cea27a1376869755ed631c94b91386328039eb607cf10dd8d0aa173f5ec21e3fb45c5d7a7aa904f97bc2557e9cb4ccc703f1@51.158.190.99:30303", // @q9f besu lyrae
"enode://642cf9650dd8869d42525dbf6858012e3b4d64f475e733847ab6f7742341a4397414865d953874e8f5ed91b0e4e1c533dee14ad1d6bb276a5459b2471460ff0d@157.230.152.87:30303", // @meowsbits but don't count on it
      ]
    }
  }
}

The logs also indicate how many peers have been discovered:

...
2020-11-03 13:56:31,569 [mantis_system-akka.actor.default-dispatcher-7] INFO  i.i.e.network.PeerManagerActor akka://mantis_system/user/peer-manager - Discovered 103 nodes, Blacklisted 95 nodes, connected to 9/85. Trying to connect to 0 more nodes.
...

@aakoshh aakoshh marked this pull request as ready for review November 3, 2020 17:29
Copy link

@ntallar ntallar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed fixes in Scalanet at the same time. Those were published locally and copied to Mantis before it was started:

Should a new version of scalanet be published then? Which commit is required? (for me to test it)

These can be added to mantis-3.0/conf/mordor.conf to cut down the noise of unreachable nodes

From were did you get the new bootstrap nodes? For now we have been copying the bootstrap from ETC's geth, though we haven't updated them in more than a month.

But if you found better bootstrap we should definitely add them!

@aakoshh
Copy link
Contributor Author

aakoshh commented Nov 5, 2020

Should a new version of scalanet be published then? Which commit is required? (for me to test it)

It's been published now, it was the PR I referred to in the description.

From were did you get the new bootstrap nodes? For now we have been copying the bootstrap from ETC's geth, though we haven't updated them in more than a month. But if you found better bootstrap we should definitely add them!

Sorry I was vague, I by "adding to ./conf/mordor.conf" I meant to add it in the unzipped config file, to override the default values. These 4 nodes are a subset of the defaults. I got them from looking at the block sync logs, these were the only ones which were logged serving any data.

@aakoshh
Copy link
Contributor Author

aakoshh commented Nov 6, 2020

I recommend we also include input-output-hk/scalanet#103, it makes the initial lookup phase much faster on mordor.

@aakoshh
Copy link
Contributor Author

aakoshh commented Nov 12, 2020

When trying to connect to mainnet it turned out that the settings from the PR description lead to a rather slow uptake of nodes, it discovered ~1000 nodes in ~30 minutes. Trying with these different setting was faster:

./mantis-3.0/bin/mantis-launcher etc  -Dmantis.metrics.enabled=true   \
-Dmantis.network.discovery.discovery-enabled=true   \
-Dmantis.network.discovery.reuse-known-nodes=false \
-Dmantis.network.discovery.scan-interval=30.seconds \
-Dmantis.network.discovery.kademlia-bucket-size=1024 \
-Dmantis.network.discovery.kademlia-alpha=1024

The difference is that the algorithm will always keep the closest 1024 nodes in the lookup loop, not just 16, so it will try to bond with many more at the same time. That resulted in ~500 nodes discovered in 2 minutes, ~2100 in ~14 minutes. However most nodes rejected the TCP handshake because they already have too many peers, and this was the point where the node found 3 peers it could use to pick a pivot block (it needs an majority agreement between them, arguably 2 peers would be enough as long as they agree).

The handshaked nodes are persisted, so restarting with normal settings later should see the node reconnect to those:

./mantis-3.0/bin/mantis-launcher etc  -Dmantis.metrics.enabled=true   \
-Dmantis.network.discovery.discovery-enabled=true   \
-Dmantis.network.discovery.reuse-known-nodes=true \
-Dmantis.network.discovery.scan-interval=30.seconds \
-Dmantis.network.discovery.kademlia-bucket-size=16 \
-Dmantis.network.discovery.kademlia-alpha=3

However after a restart it still could only connect to 2 nodes and fail to pick a pivot block, possibly because they other side thought we were already connected and it took them some time to clean up the broken connection.

@KonradStaniec
Copy link
Contributor

@aakoshh @ntallar is to:

  • tweak default params in this pr to enable quicker finding of nodes
  • create task to investigate why geth is able to discover enough peers quickly with default params (I remember it took like 2-3 min to start fast sync after starting geth)
    wdyt ?

@aakoshh
Copy link
Contributor Author

aakoshh commented Nov 12, 2020

Some of the default params can be tweaked, like the discovery interval of 15 minutes is surely not often enough. Other clients like trinity only did discovery when it didn't have enough candidate peers, it didn't do periodic lookups, while the Go client does a self lookup and 3 random lookups every 30 minutes. The bucket size is kinda part of the spec I think. Alpha can be raised if needed, up to the bucket size (I don't think trinity used alpha, the Go client does).

We can have another look at the go-ethereum codebase too if you think it would be useful.

Copy link

@ntallar ntallar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with what Konrad mentioned of creating a separate task to tweak the parameters! What we have is probably good enough for starting deploying on the testnet

@aakoshh
Copy link
Contributor Author

aakoshh commented Nov 12, 2020

By the way @ntallar I'm not sure how exactly it works now in Mantis now, but I think the way we can't connect to almost anyone from 2000+ nodes because they all have too many peers could be improved in the new network. For example take a look at the grafting and pruning strategy in gossipsub. Perhaps our node could also have a min-target-max range for incoming/outgoing connections and get rid of them if they go above the threshold, opening up room for newcomers.

Copy link

@ntallar ntallar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from this and the minor comment update, LGTM!

import org.scalatest.flatspec.AnyFlatSpec
import org.scalatest.matchers.should.Matchers

class PeerDiscoveryManagerSpec
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add tests for our new PeerDiscoveryManager?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, the actor doesn't have too much logic in it, so I thought I'd leave it up to you if you're happy with the three of us having seen it, and tested it with mordor/etc, or you'd prefer having unit tests with a mock service.

Copy link
Contributor Author

@aakoshh aakoshh Nov 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be cool is if there was some end-to-end test that started and least a couple of Mantis nodes that discovered each other, so we knew the service builder is correct as well as the actor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added unit tests.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be cool... I'm not sure if we had that on Midnight or not 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kinda, it builds the project (although I think that could be skipped) and starts the node as subprocesses to test the wallet. Not sure if multiple nodes are tested though, probably not.

Copy link

@ntallar ntallar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a single comment remains, but LGTM!

 .\\            //.
. \ \          / /.
.\  ,\     /` /,.-
 -.   \  /'/ /  .
 ` -   `-'  \  -
   '.       /.\`
      -    .-
      :`//.'
      .`.'
      .' BP 

@aakoshh aakoshh merged commit 0f73e7e into develop Nov 17, 2020
@aakoshh aakoshh deleted the ETCM-168-discovery-part4 branch November 17, 2020 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants