Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cumulus/minimal-node: added prometheus metrics for the RPC client #5572

Merged
merged 23 commits into from
Sep 19, 2024

Conversation

iulianbarbu
Copy link
Contributor

@iulianbarbu iulianbarbu commented Sep 3, 2024

Description

When we start a node with connections to external RPC servers (as a minimal node), we lack metrics around how many individual calls we're doing to the remote RPC servers and their duration. This PR adds metrics that measure durations of each RPC call made by the minimal nodes, and implicitly how many calls there are.

Closes #5409
Closes #5689

Integration

Node operators should be able to track minimal node metrics and decide appropriate actions according to how the metrics are interpreted/felt. The added metrics can be observed by curl'ing the prometheus metrics endpoint for the relaychain parachain (it was changed based on the review). The metrics are represented by polkadot_parachain_relay_chain_rpc_interface relay_chain_rpc_interface namespace (I realized lining up parachain_relay_chain in the same metric might be confusing :). Excerpt from the curl:

relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="0.001"} 15
relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="0.004"} 23
relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="0.016"} 23
relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="0.064"} 23
relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="0.256"} 24
relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="1.024"} 24
relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="4.096"} 24
relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="16.384"} 24
relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="65.536"} 24
relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="+Inf"} 24
relay_chain_rpc_interface_sum{method="chain_getBlockHash",chain="rococo_local_testnet"} 0.11719075
relay_chain_rpc_interface_count{method="chain_getBlockHash",chain="rococo_local_testnet"} 24

Review Notes

The way we measure durations/hits is based on HistogramVec struct which allows us to collect timings for each RPC client method called from the minimal node., It can be extended to measure the RPCs against other dimensions too (status codes, response sizes, etc). The timing measuring is done at the level of the relay-chain-rpc-interface, in the RelayChainRpcClient struct's method 'request_tracing'. A single entry point for all RPC requests done through the relay-chain-rpc-interface. The requests durations will fall under exponential buckets described by start 0.001, factor 4 and count 9.

@iulianbarbu iulianbarbu added the T9-cumulus This PR/Issue is related to cumulus. label Sep 3, 2024
@iulianbarbu iulianbarbu self-assigned this Sep 3, 2024
@iulianbarbu iulianbarbu force-pushed the add-rpc-collator-metrics branch 2 times, most recently from 4c1f326 to a42b3ae Compare September 4, 2024 13:43
@paritytech-cicd-pr
Copy link

The CI pipeline was cancelled due to failure one of the required jobs.
Job name: test-linux-stable-int
Logs: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7290328

Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
@iulianbarbu iulianbarbu marked this pull request as ready for review September 13, 2024 08:02
Copy link
Contributor

@michalkucharczyk michalkucharczyk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, nits only.

cumulus/client/relay-chain-rpc-interface/src/lib.rs Outdated Show resolved Hide resolved
cumulus/client/relay-chain-rpc-interface/src/metrics.rs Outdated Show resolved Hide resolved
cumulus/client/relay-chain-rpc-interface/src/metrics.rs Outdated Show resolved Hide resolved
"polkadot_parachain_relay_chain_rpc_interface",
"Tracks stats about cumulus relay chain RPC interface",
),
buckets: prometheus::exponential_buckets(0.001, 4.0, 9)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curiosity - any reason for these values?

Am I right that buckets will be?

0.001
0.004
0.016
0.064
0.256
1.024
4.096
16.384
65.536

Copy link
Contributor Author

@iulianbarbu iulianbarbu Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct for the buckets.

I picked the buckets split by seeing it used for other requests timers in the code (related to substrate libp2p), although there isn't a particular relationship between them. Ideally we'll have some preliminary measurements for these first and then pick the buckets. I am ok with these values though because they correspond to some rough back of the envelope measurements for exchanging data over the network (e.g USA -> EU -> USA ~ 150 ms). I think that the higher buckets (e.g. >1s) can be considered extreme, and might correspond to super infrequent outliers (assuming the network runs fine most of the time).

LE: on my above higher buckets note, depends as usual. We measure implicitly the time it takes for the external RPC to process the request and return the response, so I think they can hold some of the observations, depending on the nature of the RPC call.

Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
@@ -127,6 +128,7 @@ pub async fn build_minimal_relay_chain_node_with_rpc(
let client = cumulus_relay_chain_rpc_interface::create_client_and_start_worker(
relay_chain_url,
task_manager,
polkadot_config.prometheus_registry(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking about whether this is better registered on the parachain side.

Technically, this is doing relay chain calls. However, the calls to the relay chain are basically done only on collators, the code lives in cumulus. As a user I would probably expect these metrics to be attached to the parachain prometheus endpoint.

Copy link
Contributor Author

@iulianbarbu iulianbarbu Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked for reasons to keep the metrics on the relay chain side but couldn't find any. To be honest, it is still not clear to me what kind of metrics should fall under the relay chain prometheus exporter (its concerns on the collator side are not crispy clear in my mind yet), but for our case I agree that these metrics seem more relevant to the internals of how parachains work, so it would be useful to expose them in the "parachain" prometheus exporter.

Changed this here: 309ae23.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, it is still not clear to me what kind of metrics should fall under the relay chain prometheus exporter

So when you don't use --relay-chain-rpc-urls then the collator will start an embedded node. Which is basically the same as a polkadot full node that you start with the polkadot binary.

If course this internal node will export all its metrics, there is a whole bunch of them defined in substrate. So in these scenarios it can make sense to monitor the relay chain node separately. This RPC functionality however is in the end parachain specific and therefore goes to the parachain prometheus endpoint.

iulianbarbu and others added 5 commits September 18, 2024 21:54
...and rename for clarity

Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Copy link
Contributor

@skunert skunert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@iulianbarbu iulianbarbu added this pull request to the merge queue Sep 19, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Sep 19, 2024
@iulianbarbu iulianbarbu added this pull request to the merge queue Sep 19, 2024
Merged via the queue into paritytech:master with commit c8d5e5a Sep 19, 2024
198 of 208 checks passed
@iulianbarbu iulianbarbu deleted the add-rpc-collator-metrics branch September 19, 2024 17:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T9-cumulus This PR/Issue is related to cumulus.
Projects
None yet
5 participants