Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Peer count slowly decrease to 0 #6384

Open
mask-pp opened this issue Sep 11, 2024 · 8 comments
Open

Peer count slowly decrease to 0 #6384

mask-pp opened this issue Sep 11, 2024 · 8 comments
Labels
Networking question Further information is requested

Comments

@mask-pp
Copy link

mask-pp commented Sep 11, 2024

Description

Please provide a brief description of the issue.

The peer count of beacon_node will slowly decrease to 0 in Holesky network.

Please provide your Lighthouse and Rust version. Are you building from
stable or unstable, which commit?

sigp/lighthouse:v5.3.0

Describe the present behaviour of the application, with regards to this
issue.

issue behavior:
Once the peer count is lower than a certain threshold(about 96), the count value begins to slowly decrease but no possibility of any increase.

geth(v1.13.15) cmd:
`

  • geth
    --holesky
    --datadir /data/holesky-node-full
    --metrics
    --metrics.addr "0.0.0.0"
    --http
    --http.addr "0.0.0.0"
    --http.vhosts ""
    --http.corsdomain "
    "
    --http.api eth,net,web3,txpool
    --ws
    --ws.addr "0.0.0.0"
    --ws.origins ""
    --ws.api eth,net,web3,txpool
    --authrpc.addr "0.0.0.0"
    --authrpc.vhosts "
    "
    --authrpc.jwtsecret /etc/jwt/secret.hex
    --nat extip:$EXT_IP
    --allow-insecure-unlock
    --v5disc
    `

beacon(v5.3.0) cmd:
`

How should the application behave?

The normal peer count is about 100 in holesky network, and should be restored automatically when the peer count is too few.

Please describe the steps required to resolve this issue, if known.

@mask-pp mask-pp changed the title WARN Low peer count peer_count: 0 Peer count slowly decrease to 0 Sep 11, 2024
@AgeManning
Copy link
Member

I think we are going to need some logs to diagnose this.

The described behaviour is similar to a node that loses an internet connection.

Debug logs can be found in the /data/lighthouse/holesky/beacon_node/logs directory. Pasting those here, or DM'ing me on discord (@AgeManning) will help us figure out the issue.

@mask-pp
Copy link
Author

mask-pp commented Sep 11, 2024

I think we are going to need some logs to diagnose this.

The described behaviour is similar to a node that loses an internet connection.

Debug logs can be found in the /data/lighthouse/holesky/beacon_node/logs directory. Pasting those here, or DM'ing me on discord (@AgeManning) will help us figure out the issue.

Thx man, since the debug log is too large, I need to crop out the useful information and then send it to you.

@michaelsproul
Copy link
Member

@mask-pp The logs compress well. It's best if you can compress them and send the whole file as it is all potentially relevant

Something else you could check would be your time sync. Make sure you've got NTP running and that sudo timedatectl status shows you're synced. You could also try Chrony.

@chong-he chong-he added the question Further information is requested label Sep 12, 2024
@mask-pp
Copy link
Author

mask-pp commented Sep 12, 2024

I think we are going to need some logs to diagnose this.
The described behaviour is similar to a node that loses an internet connection.
Debug logs can be found in the /data/lighthouse/holesky/beacon_node/logs directory. Pasting those here, or DM'ing me on discord (@AgeManning) will help us figure out the issue.

Thx man, since the debug log is too large, I need to crop out the useful information and then send it to you.

@AgeManning Hei friend, I have sent u the get log cmd privately. Hope these logs are helpful in solving the issue.

@chong-he
Copy link
Member

Are you on VPS or running Lighthouse locally?

@mask-pp
Copy link
Author

mask-pp commented Sep 12, 2024

Are you on VPS or running Lighthouse locally?

Running in k8s

@chong-he
Copy link
Member

Linking a similar issue here: #5271

@AgeManning
Copy link
Member

I have been through these logs.

The logs show "Socket Updated" (you can grep through the logs for this).

This log indicates that discovery is changing contactable IP/PORT based on what others see as the src in the packets they receive.
It starts out with a port of 9000 (which is usually correct) then when it changes to a random other port, lighthouse can no longer discover peers. This is because other nodes will not respond if the ENR has invalid settings.

It typically means that the router or gateway is sending traffic to other peers on different ports other than 9000. This could be because of a symmetric nat for example. Usually on home router's this means the ports are not forward'ed correctly. Setting up a UDP port forward should make the router move traffic in and out through the same external port. If it is using other random ports, the ENR can be updated incorrectly.

I've seen this happen a few times and there is some changes to discovery we can make that might improve this situation. I'll make some PRs.

The immediate solution is to verify why traffic is being sent out on different random ports and to double check the NAT configuration for the UDP discovery traffic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Networking question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants