Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

health: Implement a health check standard #529

Open
crypto-services opened this issue Mar 18, 2024 · 0 comments
Open

health: Implement a health check standard #529

crypto-services opened this issue Mar 18, 2024 · 0 comments

Comments

@crypto-services
Copy link

crypto-services commented Mar 18, 2024

We recently opened a PR on the Go Ethereum repo to implement a health API that will work with cloud/loud balancer provider's health checks to determine if a node is in a suitable state to handle requests from clients.

It would be ideal for consensus on a universal standard for such a service to be acquired prior to implementation. @lightclient recommended we open an issue here to get the ball rolling.

How Health Checks Work

Health checks across large providers generally work the same way (GCP, Azure, AWS):

  • The provider sends a GET request to an endpoint running on the service.
  • The service responds either status 200 (healthy), 500 (unhealthy) or doesn't respond at all (unhealthy).
  • If the service is unhealthy traffic heading to that piece of infrastructure is routed elsewhere.

There is some nuance where a small number of providers accept 2XX/5XX statuses with different actions for different statuses, however the 200/500 model is universal.

Limitations

Health checks only allow simple GET requests. This is at odds with the current Ethereum standard which only allows POST requests of type application/json.

Because only GET requests are allowed you cannot post the parameters in the JSON-RPC 2.0 format. Users must utilise custom headers or query strings in the URL to pass parameters.

Checks

As a starting point these are the checks we implemented in our PR (derived from Erigon's solution):

  • syncing: Check if the node is in the syncing state.
  • max_seconds_behind: Check the timestamp of the latest block isn't greater than max_seconds_behind seconds ago.
  • min_peers: Confirm there are at least min_peers number of peers connected to the node.
  • check_block: Confirm the node has synced beyond the height of check_block.

In our implementation we used custom headers to pass the values for each variable. When a variable is not defined that check will not run.

Upon finishing the checks the node returns either status 200 if all checks passed or 500 if any failed. Additionally an object containing the results for each test (OK, DISABLED or the error message for the check) is returned.

Further Work

Any assistance on further developing this idea in to a solid standard and getting this moving forward would be hugely appreciated. Such an endpoint would be massively useful to us as node operators and likely many others. Our team has the resources and willingness to assist however possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant