Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

health: expand health check functionality with health package #29056

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

crypto-services
Copy link

Problem

Health checks for many infrastructure providers such as GCP, AWS and Cloudflare require an endpoint which accepts an HTTP GET request and returns status 200 if the service is healthy.

The current mechanism to address this was last updated in 2017 (#15496) where an exception was made to return status 200 for an empty HTTP GET request as the rpc only currently accepts POST requests of type application/json.

It would be useful to expand this functionality to enable operators to define the state required to be considered healthy. Other execution clients such as Erigon and Nethermind have implemented similar solutions.

Proposed Solution

This PR seeks to add a new package called health which extends the functionality of the http server to add a /health endpoint when invoked with the --http.health flag.

When enabled an operator can set parameters using the X-GETH-HEALTHCHECK header.

Available options (these match Erigon for cross compatibility):

  • synced - will check if the node has completed syncing
  • min_peer_count<count> - will check that the node has at least <count> many peers
  • check_block<block> - will check that the node is at least ahead of the <block> specified
  • max_seconds_behind<seconds> - will check that the node is no more than <seconds> behind from its latest block

The request will return:

  • Status 200 if all of the specified parameters are met
  • Status 500 if one or more parameters is not met or invalid
  • A response body containing the status of each check

When no value is provided for an option that check will not run. The status of the check will be DISABLED in the return object.

Example Request

curl --request GET 'http://localhost:8545/health' \
--header 'X-GETH-HEALTHCHECK: min_peer_count10' \
--header 'X-GETH-HEALTHCHECK: synced' \
--header 'X-GETH-HEALTHCHECK: max_seconds_behind60'

Alternatively you can send a POST request using the same parameters in JSON:

curl --request POST http://localhost:8545/health \
--data '{"min_peer_count": 10, "synced": true, "max_seconds_behind": 60}' 

Example Healthy Response (Status: 200)

{
    "query": "OK",
    "check_block":"DISABLED",
    "max_seconds_behind":"OK",
    "min_peer_count":"OK",
    "synced":"OK"
}

Example Unhealthy Response (Status: 500)

{
    "query": "OK",
    "check_block":"DISABLED",
    "max_seconds_behind":"OK",
    "min_peer_count":"not enough peers: 2 (minimum 10)",
    "synced":"OK"
}

Structure

Due to the restriction on the rpc package only accepting POST requests of type application/json and issues with circular imports creating a separate package which can be enabled independently appeared to be the cleanest solution. The package is similar in implementation to the graphql package in that it extends the http server, so the new package is mimicking the structure of this package.

Further Development

The work on this PR was done by the team at Republic Crypto. We're happy to accept input and apply requests for change where needed.

@crypto-services
Copy link
Author

@rjl493456442 failing check related to this PR rectified. The remaining failure on the GETH_ARCH=386 check which is failing with util.go:48 is on master branch 🙏

@learnerLj
Copy link

The functions mentioned can be implemented by RPC method, such as net_peerCount, eth_blockNumber. Thus, there is no need to integrate them to privde a new interface. A new cmd tool might be more appropriate.

@crypto-services
Copy link
Author

Hey @learnerLj, thanks for taking the time to review! Could you elaborate a little bit on what you'd like to see? The problem with the RPC is health checks generally do not allow people running load balancers to use POST requests for checks and are limited in scope. It would of course be possible to have a separate application running to proxy GET requests and convert them to POST requests to the RPC but for those running many nodes in load balancers an integrated solution would be useful in our experience.

@learnerLj
Copy link

Could you please introduce the structure of banlancers and geth? From my perspective, a new package for a specific business can bring unnessesary complexity to geth. Thus, I recomends that you can fork it and write your application in cmd folder. when composing the protocol stack(ref. https://blog-blockchain.xyz/geth/Geth-Unveiled-Ethereum-Startup/), your app get the rpc port by reading the open endpoints.

@crypto-services
Copy link
Author

Ah ok that makes sense and I agree with your recommendations, I will make some updates! Forgive me if I misunderstand your initial question. A balancer is used when you have a pool (or multiple pools) of instances running behind a single url. These are used to distribute traffic across multiple servers to spread the workload and/or link users to their geographically closest services. The health checks automatically add/remove nodes from the pools and route traffic based on the result of health checks to ensure continuity of service. For geth this is most relevant to rpc services and those using geth as the execution client for validators.

@holiman
Copy link
Contributor

holiman commented Mar 5, 2024

We like the proposal, but the integration needs to be done somewhat differently, also it would be good to coordinate this endpoint specification a bit better amongst client devs.

@crypto-services
Copy link
Author

@holiman Sure thing, sounds good. Do you have any advice on how you'd structure it? I had a go at implementing it as per @learnerLj's recommendations but got stuck at integrating a separate app to run via the main process as there doesn't seem to be a precedence for that with the current cmd apps.

@karalabe
Copy link
Member

karalabe commented Mar 6, 2024

A few ides, just to do a braindump:

  • Adding an RPC method to do health queries is very simple, we could add arbitrarily many filter conditions as they make sense and we could return either a combo or split up responses for each query. Here it would be nice to coordinate with other teams so we have a uniform API.
    • One catch is that the Erigon spec (and your proposal too) return 200 vs 50X HTTP codes depending on the result. This goes against the current RPC spec. That does not necessarily mean we can't special case this endpoint, but it's a non obvious thing.
  • Adding a dedicated HTTP endpoint for RESTful health checks also feels ok from my part, here what I'd like to investigate (maybe you can provide some pointers) is how other big players in the cloud/web2 industry do it. If we want to be composable/orthogonal with that world, we should follow their rules on this.
    • Passing queries in headers is kind of ok, but the query language of just concatenating the key and the value (e.g. min_peer_count10) feels very wonky and I'm semi sure this cannot be the way the industry works.

@crypto-services
Copy link
Author

Agree that the concatenation of variable names and values does not feel right. Erigon was the most analogous project to Geth with this feature so we used their structure as the basis for this in an attempt to create some uniformity. Having individual headers for each option would make more sense to me. An example of this from Fortinet.

Unfortunately returning 200 vs 50X HTTP codes depending on the result is a non-configurable standard for the major services (GCP, Azure, AWS) so it would require a caveat to the rule.

In terms of header name formatting this is GCP's guidance on custom header names:

  • The header name must be a valid HTTP header-field name definition (per RFC 7230).
  • The header name must not be X-User-IP or CDN-Loop.
  • The header name must not begin with X-Google, X-Goog-, X-GFE or X-Amz-.
  • A header name must not appear more than once in the list of added headers.

Erigon's implementation does seem to break that final rule.

An alternative solution would be to use query strings and pass the values in the url where providing no value returns the current behaviour and adding values triggers the extra checks:

curl --request GET 'http://localhost:8545/health?min_peer_count=10&max_seconds_behind=60'

@lightclient
Copy link
Member

I agree that this seems like something that would be nice to standardize across EL clients. Maybe worth opening an issue to https://github.com/ethereum/execution-apis and discussing on ACD at some point?

@crypto-services
Copy link
Author

@lightclient sounds like a plan. Done 👍

@crypto-services
Copy link
Author

Can anyone help get this moving in some way? @lightclient the issue raised on the execution-apis has not yet been ACK'd, do you have any advice for us 🙇‍♂️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants