TPC-H Benchmarks

This document will help you run the TPC-H benchmarks in this directory.

Setup

Clone this repository

git clone git@github.com:coiled/benchmarks
cd benchmarks

Follow the environment creation steps in the root directory. Namely the following:

mamba env create -n tpch -f ci/environment.yml
conda activate tpch
pip-compile ci/requirements-2nightly.in         # Or `ci/requirements-2tpch-non-dask.in` if you want Spark/DuckDb/Polars
pip install -r ci/requirements-2nightly.txt

Run Dask Benchmarks

pytest --benchmark tests/tpch/test_dask.py

Configure

By default we run Scale 100 (about 100 GB) on the cloud with Coiled. You can configure this by changing the values for _local and _scale in the conftest.py file in this directory (they're at the top).

Local Data Generation

If you want to run locally, you'll need to generate data. Run the following from the root directory of this repository.

python tests/tpch/generate_data.py --scale 10

Run Many Tests

When running on the cloud you can run many tests simultaneously. We recommend using pytest-xdist for this with the keywords:

-n 4 run four parallel jobs
--dist loadscope split apart by module

py.test --benchmark -n 4 --dist loadscope tests/tpch

Generate Plots

Timing outputs are dropped into benchmark.db in the root of this repository. You can generate charts analyzing results using either the notebook visualize.ipynb in this directory (recommended) or the generate-plot.py script in this directory. These require ibis and altair (not installed above).

These are both meant to be run from the root directory of this repository.

These pull out the most recent records for each query/library pairing. If you're changing scales and want to ensure clean results, you may want to nuke your benchmark.db file between experiments (it's ok, it'll regenerate automatically).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

TPC-H Benchmarks

Setup

Run Dask Benchmarks

Configure

Local Data Generation

Run Many Tests

Generate Plots

Files

README.md

Latest commit

History

README.md

File metadata and controls

TPC-H Benchmarks

Setup

Run Dask Benchmarks

Configure

Local Data Generation

Run Many Tests

Generate Plots