Skip to content

Commit

Permalink
Introduce subsystem benchmarking tool (#2528)
Browse files Browse the repository at this point in the history
This tool makes it easy to run parachain consensus stress/performance
testing on your development machine or in CI.

## Motivation
The parachain consensus node implementation spans across many modules
which we call subsystems. Each subsystem is responsible for a small part
of logic of the parachain consensus pipeline, but in general the most
load and performance issues are localized in just a few core subsystems
like `availability-recovery`, `approval-voting` or
`dispute-coordinator`. In the absence of such a tool, we would run large
test nets to load/stress test these parts of the system. Setting up and
making sense of the amount of data produced by such a large test is very
expensive, hard to orchestrate and is a huge development time sink.

## PR contents
- CLI tool 
- Data Availability Read test
- reusable mockups and components needed so far
- Documentation on how to get started

### Data Availability Read test

An overseer is built with using a real `availability-recovery` susbsytem
instance while dependent subsystems like `av-store`, `network-bridge`
and `runtime-api` are mocked. The network bridge will emulate all the
network peers and their answering to requests.

The test is going to be run for a number of blocks. For each block it
will generate send a “RecoverAvailableData” request for an arbitrary
number of candidates. We wait for the subsystem to respond to all
requests before moving to the next block.
At the same time we collect the usual subsystem metrics and task CPU
metrics and show some nice progress reports while running.

### Here is how the CLI looks like:

```
[2023-11-28T13:06:27Z INFO  subsystem_bench::core::display] n_validators = 1000, n_cores = 20, pov_size = 5120 - 5120, error = 3, latency = Some(PeerLatency { min_latency: 1ms, max_latency: 100ms })
[2023-11-28T13:06:27Z INFO  subsystem-bench::availability] Generating template candidate index=0 pov_size=5242880
[2023-11-28T13:06:27Z INFO  subsystem-bench::availability] Created test environment.
[2023-11-28T13:06:27Z INFO  subsystem-bench::availability] Pre-generating 60 candidates.
[2023-11-28T13:06:30Z INFO  subsystem-bench::core] Initializing network emulation for 1000 peers.
[2023-11-28T13:06:30Z INFO  subsystem-bench::availability] Current block 1/3
[2023-11-28T13:06:30Z INFO  substrate_prometheus_endpoint] 〽️ Prometheus exporter started at 127.0.0.1:9999
[2023-11-28T13:06:30Z INFO  subsystem_bench::availability] 20 recoveries pending
[2023-11-28T13:06:37Z INFO  subsystem_bench::availability] Block time 6262ms
[2023-11-28T13:06:37Z INFO  subsystem-bench::availability] Sleeping till end of block (0ms)
[2023-11-28T13:06:37Z INFO  subsystem-bench::availability] Current block 2/3
[2023-11-28T13:06:37Z INFO  subsystem_bench::availability] 20 recoveries pending
[2023-11-28T13:06:43Z INFO  subsystem_bench::availability] Block time 6369ms
[2023-11-28T13:06:43Z INFO  subsystem-bench::availability] Sleeping till end of block (0ms)
[2023-11-28T13:06:43Z INFO  subsystem-bench::availability] Current block 3/3
[2023-11-28T13:06:43Z INFO  subsystem_bench::availability] 20 recoveries pending
[2023-11-28T13:06:49Z INFO  subsystem_bench::availability] Block time 6194ms
[2023-11-28T13:06:49Z INFO  subsystem-bench::availability] Sleeping till end of block (0ms)
[2023-11-28T13:06:49Z INFO  subsystem_bench::availability] All blocks processed in 18829ms
[2023-11-28T13:06:49Z INFO  subsystem_bench::availability] Throughput: 102400 KiB/block
[2023-11-28T13:06:49Z INFO  subsystem_bench::availability] Block time: 6276 ms
[2023-11-28T13:06:49Z INFO  subsystem_bench::availability] 
    
    Total received from network: 415 MiB
    Total sent to network: 724 KiB
    Total subsystem CPU usage 24.00s
    CPU usage per block 8.00s
    Total test environment CPU usage 0.15s
    CPU usage per block 0.05s
```

### Prometheus/Grafana stack in action
<img width="1246" alt="Screenshot 2023-11-28 at 15 11 10"
src="https://github.com/paritytech/polkadot-sdk/assets/54316454/eaa47422-4a5e-4a3a-aaef-14ca644c1574">
<img width="1246" alt="Screenshot 2023-11-28 at 15 12 01"
src="https://github.com/paritytech/polkadot-sdk/assets/54316454/237329d6-1710-4c27-8f67-5fb11d7f66ea">
<img width="1246" alt="Screenshot 2023-11-28 at 15 12 38"
src="https://github.com/paritytech/polkadot-sdk/assets/54316454/a07119e8-c9f1-4810-a1b3-f1b7b01cf357">

---------

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
  • Loading branch information
sandreim authored and command-bot committed Dec 14, 2023
1 parent 2419abf commit 052b83f
Show file tree
Hide file tree
Showing 31 changed files with 5,829 additions and 38 deletions.
92 changes: 90 additions & 2 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -151,6 +151,7 @@ members = [
"polkadot/node/primitives",
"polkadot/node/service",
"polkadot/node/subsystem",
"polkadot/node/subsystem-bench",
"polkadot/node/subsystem-test-helpers",
"polkadot/node/subsystem-types",
"polkadot/node/subsystem-util",
Expand Down
4 changes: 4 additions & 0 deletions polkadot/node/network/availability-recovery/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ workspace = true

[dependencies]
futures = "0.3.21"
tokio = "1.24.2"
schnellru = "0.2.1"
rand = "0.8.5"
fatality = "0.0.6"
Expand Down Expand Up @@ -40,3 +41,6 @@ sc-network = { path = "../../../../substrate/client/network" }

polkadot-node-subsystem-test-helpers = { path = "../../subsystem-test-helpers" }
polkadot-primitives-test-helpers = { path = "../../../primitives/test-helpers" }

[features]
subsystem-benchmarks = []
13 changes: 10 additions & 3 deletions polkadot/node/network/availability-recovery/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ mod error;
mod futures_undead;
mod metrics;
mod task;
use metrics::Metrics;
pub use metrics::Metrics;

#[cfg(test)]
mod tests;
Expand Down Expand Up @@ -603,7 +603,8 @@ impl AvailabilityRecoverySubsystem {
}
}

async fn run<Context>(self, mut ctx: Context) -> SubsystemResult<()> {
/// Starts the inner subsystem loop.
pub async fn run<Context>(self, mut ctx: Context) -> SubsystemResult<()> {
let mut state = State::default();
let Self {
mut req_receiver,
Expand Down Expand Up @@ -681,6 +682,7 @@ impl AvailabilityRecoverySubsystem {
&mut state,
signal,
).await? {
gum::debug!(target: LOG_TARGET, "subsystem concluded");
return Ok(());
}
FromOrchestra::Communication { msg } => {
Expand Down Expand Up @@ -845,12 +847,17 @@ async fn erasure_task_thread(
let _ = sender.send(maybe_data);
},
None => {
gum::debug!(
gum::trace!(
target: LOG_TARGET,
"Erasure task channel closed. Node shutting down ?",
);
break
},
}

// In benchmarks this is a very hot loop not yielding at all.
// To update CPU metrics for the task we need to yield.
#[cfg(feature = "subsystem-benchmarks")]
tokio::task::yield_now().await;
}
}
17 changes: 14 additions & 3 deletions polkadot/node/network/availability-recovery/src/metrics.rs
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,10 @@ struct MetricsInner {
///
/// Gets incremented on each sent chunk requests.
chunk_requests_issued: Counter<U64>,

/// Total number of bytes recovered
///
/// Gets incremented on each succesful recovery
recovered_bytes_total: Counter<U64>,
/// A counter for finished chunk requests.
///
/// Split by result:
Expand Down Expand Up @@ -133,9 +136,10 @@ impl Metrics {
}

/// A full recovery succeeded.
pub fn on_recovery_succeeded(&self) {
pub fn on_recovery_succeeded(&self, bytes: usize) {
if let Some(metrics) = &self.0 {
metrics.full_recoveries_finished.with_label_values(&["success"]).inc()
metrics.full_recoveries_finished.with_label_values(&["success"]).inc();
metrics.recovered_bytes_total.inc_by(bytes as u64)
}
}

Expand Down Expand Up @@ -171,6 +175,13 @@ impl metrics::Metrics for Metrics {
)?,
registry,
)?,
recovered_bytes_total: prometheus::register(
Counter::new(
"polkadot_parachain_availability_recovery_bytes_total",
"Total number of bytes recovered",
)?,
registry,
)?,
chunk_requests_finished: prometheus::register(
CounterVec::new(
Opts::new(
Expand Down
3 changes: 2 additions & 1 deletion polkadot/node/network/availability-recovery/src/task.rs
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ use crate::{
PostRecoveryCheck, LOG_TARGET,
};
use futures::{channel::oneshot, SinkExt};
use parity_scale_codec::Encode;
#[cfg(not(test))]
use polkadot_node_network_protocol::request_response::CHUNK_REQUEST_TIMEOUT;
use polkadot_node_network_protocol::request_response::{
Expand Down Expand Up @@ -432,7 +433,7 @@ where
return Err(err)
},
Ok(data) => {
self.params.metrics.on_recovery_succeeded();
self.params.metrics.on_recovery_succeeded(data.encoded_size());
return Ok(data)
},
}
Expand Down
29 changes: 1 addition & 28 deletions polkadot/node/network/availability-recovery/src/tests.rs
Original file line number Diff line number Diff line change
Expand Up @@ -24,12 +24,12 @@ use parity_scale_codec::Encode;
use polkadot_node_network_protocol::request_response::{
self as req_res, IncomingRequest, Recipient, ReqProtocolNames, Requests,
};
use polkadot_node_subsystem_test_helpers::derive_erasure_chunks_with_proofs_and_root;

use super::*;

use sc_network::{config::RequestResponseConfig, IfDisconnected, OutboundFailure, RequestFailure};

use polkadot_erasure_coding::{branches, obtain_chunks_v1 as obtain_chunks};
use polkadot_node_primitives::{BlockData, PoV, Proof};
use polkadot_node_subsystem::messages::{
AllMessages, NetworkBridgeTxMessage, RuntimeApiMessage, RuntimeApiRequest,
Expand Down Expand Up @@ -456,33 +456,6 @@ fn validator_authority_id(val_ids: &[Sr25519Keyring]) -> Vec<AuthorityDiscoveryI
val_ids.iter().map(|v| v.public().into()).collect()
}

fn derive_erasure_chunks_with_proofs_and_root(
n_validators: usize,
available_data: &AvailableData,
alter_chunk: impl Fn(usize, &mut Vec<u8>),
) -> (Vec<ErasureChunk>, Hash) {
let mut chunks: Vec<Vec<u8>> = obtain_chunks(n_validators, available_data).unwrap();

for (i, chunk) in chunks.iter_mut().enumerate() {
alter_chunk(i, chunk)
}

// create proofs for each erasure chunk
let branches = branches(chunks.as_ref());

let root = branches.root();
let erasure_chunks = branches
.enumerate()
.map(|(index, (proof, chunk))| ErasureChunk {
chunk: chunk.to_vec(),
index: ValidatorIndex(index as _),
proof: Proof::try_from(proof).unwrap(),
})
.collect::<Vec<ErasureChunk>>();

(erasure_chunks, root)
}

impl Default for TestState {
fn default() -> Self {
let validators = vec![
Expand Down
2 changes: 2 additions & 0 deletions polkadot/node/overseer/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -276,6 +276,7 @@ impl From<FinalityNotification<Block>> for BlockInfo {

/// An event from outside the overseer scope, such
/// as the substrate framework or user interaction.
#[derive(Debug)]
pub enum Event {
/// A new block was imported.
///
Expand All @@ -300,6 +301,7 @@ pub enum Event {
}

/// Some request from outer world.
#[derive(Debug)]
pub enum ExternalRequest {
/// Wait for the activation of a particular hash
/// and be notified by means of the return channel.
Expand Down
61 changes: 61 additions & 0 deletions polkadot/node/subsystem-bench/Cargo.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
[package]
name = "polkadot-subsystem-bench"
description = "Subsystem performance benchmark client"
version = "1.0.0"
authors.workspace = true
edition.workspace = true
license.workspace = true
readme = "README.md"
publish = false

[[bin]]
name = "subsystem-bench"
path = "src/subsystem-bench.rs"

# Prevent rustdoc error. Already documented from top-level Cargo.toml.
doc = false

[dependencies]
polkadot-node-subsystem = { path = "../subsystem" }
polkadot-node-subsystem-util = { path = "../subsystem-util" }
polkadot-node-subsystem-types = { path = "../subsystem-types" }
polkadot-node-primitives = { path = "../primitives" }
polkadot-primitives = { path = "../../primitives" }
polkadot-node-network-protocol = { path = "../network/protocol" }
polkadot-availability-recovery = { path = "../network/availability-recovery", features = ["subsystem-benchmarks"] }
color-eyre = { version = "0.6.1", default-features = false }
polkadot-overseer = { path = "../overseer" }
colored = "2.0.4"
assert_matches = "1.5"
async-trait = "0.1.57"
sp-keystore = { path = "../../../substrate/primitives/keystore" }
sc-keystore = { path = "../../../substrate/client/keystore" }
sp-core = { path = "../../../substrate/primitives/core" }
clap = { version = "4.4.6", features = ["derive"] }
futures = "0.3.21"
futures-timer = "3.0.2"
gum = { package = "tracing-gum", path = "../gum" }
polkadot-erasure-coding = { package = "polkadot-erasure-coding", path = "../../erasure-coding" }
log = "0.4.17"
env_logger = "0.9.0"
rand = "0.8.5"
parity-scale-codec = { version = "3.6.1", features = ["derive", "std"] }
tokio = "1.24.2"
clap-num = "1.0.2"
polkadot-node-subsystem-test-helpers = { path = "../subsystem-test-helpers" }
sp-keyring = { path = "../../../substrate/primitives/keyring" }
sp-application-crypto = { path = "../../../substrate/primitives/application-crypto" }
sc-network = { path = "../../../substrate/client/network" }
sc-service = { path = "../../../substrate/client/service" }
polkadot-node-metrics = { path = "../metrics" }
itertools = "0.11.0"
polkadot-primitives-test-helpers = { path = "../../primitives/test-helpers" }
prometheus_endpoint = { package = "substrate-prometheus-endpoint", path = "../../../substrate/utils/prometheus" }
prometheus = { version = "0.13.0", default-features = false }
serde = "1.0.192"
serde_yaml = "0.9"
paste = "1.0.14"
orchestra = { version = "0.3.3", default-features = false, features = ["futures_channel"] }

[features]
default = []
Loading

0 comments on commit 052b83f

Please sign in to comment.