health: expand health check functionality with `health` package #29056

crypto-services · 2024-02-22T06:09:13Z

Problem

Health checks for many infrastructure providers such as GCP, AWS and Cloudflare require an endpoint which accepts an HTTP GET request and returns status 200 if the service is healthy.

The current mechanism to address this was last updated in 2017 (#15496) where an exception was made to return status 200 for an empty HTTP GET request as the rpc only currently accepts POST requests of type application/json.

It would be useful to expand this functionality to enable operators to define the state required to be considered healthy. Other execution clients such as Erigon and Nethermind have implemented similar solutions.

Proposed Solution

This PR seeks to add a new package called health which extends the functionality of the http server to add a /health endpoint when invoked with the --http.health flag.

When enabled an operator can set parameters using the X-GETH-HEALTHCHECK header.

Available options (these match Erigon for cross compatibility):

synced - will check if the node has completed syncing
min_peer_count<count> - will check that the node has at least <count> many peers
check_block<block> - will check that the node is at least ahead of the <block> specified
max_seconds_behind<seconds> - will check that the node is no more than <seconds> behind from its latest block

The request will return:

Status 200 if all of the specified parameters are met
Status 500 if one or more parameters is not met or invalid
A response body containing the status of each check

When no value is provided for an option that check will not run. The status of the check will be DISABLED in the return object.

Example Request

curl --request GET 'http://localhost:8545/health' \
--header 'X-GETH-HEALTHCHECK: min_peer_count10' \
--header 'X-GETH-HEALTHCHECK: synced' \
--header 'X-GETH-HEALTHCHECK: max_seconds_behind60'

Alternatively you can send a POST request using the same parameters in JSON:

curl --request POST http://localhost:8545/health \
--data '{"min_peer_count": 10, "synced": true, "max_seconds_behind": 60}'

Example Healthy Response (`Status: 200`)

{
    "query": "OK",
    "check_block":"DISABLED",
    "max_seconds_behind":"OK",
    "min_peer_count":"OK",
    "synced":"OK"
}

Example Unhealthy Response (`Status: 500`)

{
    "query": "OK",
    "check_block":"DISABLED",
    "max_seconds_behind":"OK",
    "min_peer_count":"not enough peers: 2 (minimum 10)",
    "synced":"OK"
}

Structure

Due to the restriction on the rpc package only accepting POST requests of type application/json and issues with circular imports creating a separate package which can be enabled independently appeared to be the cleanest solution. The package is similar in implementation to the graphql package in that it extends the http server, so the new package is mimicking the structure of this package.

Further Development

The work on this PR was done by the team at Republic Crypto. We're happy to accept input and apply requests for change where needed.

…h-Healthcheck

crypto-services · 2024-02-22T08:48:42Z

@rjl493456442 failing check related to this PR rectified. The remaining failure on the GETH_ARCH=386 check which is failing with util.go:48 is on master branch 🙏

learnerLj · 2024-02-22T11:53:15Z

The functions mentioned can be implemented by RPC method, such as net_peerCount, eth_blockNumber. Thus, there is no need to integrate them to privde a new interface. A new cmd tool might be more appropriate.

crypto-services · 2024-02-23T04:39:34Z

Hey @learnerLj, thanks for taking the time to review! Could you elaborate a little bit on what you'd like to see? The problem with the RPC is health checks generally do not allow people running load balancers to use POST requests for checks and are limited in scope. It would of course be possible to have a separate application running to proxy GET requests and convert them to POST requests to the RPC but for those running many nodes in load balancers an integrated solution would be useful in our experience.

learnerLj · 2024-02-23T07:43:59Z

Could you please introduce the structure of banlancers and geth? From my perspective, a new package for a specific business can bring unnessesary complexity to geth. Thus, I recomends that you can fork it and write your application in cmd folder. when composing the protocol stack(ref. https://blog-blockchain.xyz/geth/Geth-Unveiled-Ethereum-Startup/), your app get the rpc port by reading the open endpoints.

crypto-services · 2024-02-26T01:48:38Z

Ah ok that makes sense and I agree with your recommendations, I will make some updates! Forgive me if I misunderstand your initial question. A balancer is used when you have a pool (or multiple pools) of instances running behind a single url. These are used to distribute traffic across multiple servers to spread the workload and/or link users to their geographically closest services. The health checks automatically add/remove nodes from the pools and route traffic based on the result of health checks to ensure continuity of service. For geth this is most relevant to rpc services and those using geth as the execution client for validators.

holiman · 2024-03-05T13:37:36Z

We like the proposal, but the integration needs to be done somewhat differently, also it would be good to coordinate this endpoint specification a bit better amongst client devs.

crypto-services · 2024-03-05T21:25:33Z

@holiman Sure thing, sounds good. Do you have any advice on how you'd structure it? I had a go at implementing it as per @learnerLj's recommendations but got stuck at integrating a separate app to run via the main process as there doesn't seem to be a precedence for that with the current cmd apps.

karalabe · 2024-03-06T10:07:32Z

A few ides, just to do a braindump:

Adding an RPC method to do health queries is very simple, we could add arbitrarily many filter conditions as they make sense and we could return either a combo or split up responses for each query. Here it would be nice to coordinate with other teams so we have a uniform API.
- One catch is that the Erigon spec (and your proposal too) return 200 vs 50X HTTP codes depending on the result. This goes against the current RPC spec. That does not necessarily mean we can't special case this endpoint, but it's a non obvious thing.
Adding a dedicated HTTP endpoint for RESTful health checks also feels ok from my part, here what I'd like to investigate (maybe you can provide some pointers) is how other big players in the cloud/web2 industry do it. If we want to be composable/orthogonal with that world, we should follow their rules on this.
- Passing queries in headers is kind of ok, but the query language of just concatenating the key and the value (e.g. min_peer_count10) feels very wonky and I'm semi sure this cannot be the way the industry works.

crypto-services · 2024-03-07T06:55:58Z

Agree that the concatenation of variable names and values does not feel right. Erigon was the most analogous project to Geth with this feature so we used their structure as the basis for this in an attempt to create some uniformity. Having individual headers for each option would make more sense to me. An example of this from Fortinet.

Unfortunately returning 200 vs 50X HTTP codes depending on the result is a non-configurable standard for the major services (GCP, Azure, AWS) so it would require a caveat to the rule.

In terms of header name formatting this is GCP's guidance on custom header names:

The header name must be a valid HTTP header-field name definition (per RFC 7230).
The header name must not be X-User-IP or CDN-Loop.
The header name must not begin with X-Google, X-Goog-, X-GFE or X-Amz-.
A header name must not appear more than once in the list of added headers.

Erigon's implementation does seem to break that final rule.

An alternative solution would be to use query strings and pass the values in the url where providing no value returns the current behaviour and adding values triggers the extra checks:

curl --request GET 'http://localhost:8545/health?min_peer_count=10&max_seconds_behind=60'

lightclient · 2024-03-12T20:49:39Z

I agree that this seems like something that would be nice to standardize across EL clients. Maybe worth opening an issue to https://github.com/ethereum/execution-apis and discussing on ACD at some point?

crypto-services · 2024-03-18T03:09:00Z

@lightclient sounds like a plan. Done 👍

crypto-services · 2024-06-03T05:32:46Z

Can anyone help get this moving in some way? @lightclient the issue raised on the execution-apis has not yet been ACK'd, do you have any advice for us 🙇‍♂️

crypto-services added 13 commits February 15, 2024 16:25

Initial cut

96968b1

Update cors to http

bb6215b

Remove unused backend

3ded1c1

Add testing

cc3cd2f

Update test name

a8be551

Remove parent context

bc2ace9

Merge remote-tracking branch 'upstream/master' into hkhosh_ENG191-Get…

adc14c8

…h-Healthcheck

health: Add CLI flag

6fc5fbe

health: Add comments

b159773

health: Replace header check if with switch

08dc03e

Merge remote-tracking branch 'upstream/master' into hkhosh_ENG191-Get…

2ee7728

…h-Healthcheck

Remove unrequired big int conversion

be21712

health: Remove unrequired uint64 conversation

ff742cb

rjl493456442 added the status:triage label Feb 22, 2024

holiman removed the status:triage label Mar 5, 2024

crypto-services mentioned this pull request Mar 18, 2024

health: Implement a health check standard ethereum/execution-apis#529

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

health: expand health check functionality with `health` package #29056

health: expand health check functionality with `health` package #29056

crypto-services commented Feb 22, 2024

crypto-services commented Feb 22, 2024

learnerLj commented Feb 22, 2024

crypto-services commented Feb 23, 2024

learnerLj commented Feb 23, 2024

crypto-services commented Feb 26, 2024

holiman commented Mar 5, 2024

crypto-services commented Mar 5, 2024

karalabe commented Mar 6, 2024

crypto-services commented Mar 7, 2024

lightclient commented Mar 12, 2024

crypto-services commented Mar 18, 2024

crypto-services commented Jun 3, 2024

health: expand health check functionality with health package #29056

Are you sure you want to change the base?

health: expand health check functionality with health package #29056

Conversation

crypto-services commented Feb 22, 2024

Problem

Proposed Solution

Example Request

Example Healthy Response (Status: 200)

Example Unhealthy Response (Status: 500)

Structure

Further Development

crypto-services commented Feb 22, 2024

learnerLj commented Feb 22, 2024

crypto-services commented Feb 23, 2024

learnerLj commented Feb 23, 2024

crypto-services commented Feb 26, 2024

holiman commented Mar 5, 2024

crypto-services commented Mar 5, 2024

karalabe commented Mar 6, 2024

crypto-services commented Mar 7, 2024

lightclient commented Mar 12, 2024

crypto-services commented Mar 18, 2024

crypto-services commented Jun 3, 2024

health: expand health check functionality with `health` package #29056

health: expand health check functionality with `health` package #29056

Example Healthy Response (`Status: 200`)

Example Unhealthy Response (`Status: 500`)