Running benchmarks

The benchmark module measures end-to-end performance of the Meta processing pipeline, from CSV input to triplestore upload.

Usage

uv run python -m oc_meta.run.benchmark -c <CONFIG> [options]

Parameter	Default	Description
`-c`, `--config`	Required	Path to benchmark config YAML
`--sizes`	None	Generate N synthetic records. Multiple values for scalability analysis
`--runs`	1	Execute benchmark multiple times for statistical analysis
`--seed`	42	Random seed for reproducible data
`--fresh-data`	False	Generate new data for each run
`--no-cleanup`	False	Skip database reset after benchmark
`--update-scenario`	False	Test graph diff performance (preload partial, then complete data)
`--preload-high-authors`	None	Preload BR with N authors before benchmark

Single run with 100 synthetic records:

uv run python -m oc_meta.run.benchmark -c benchmark_config.yaml --sizes 100

Statistical analysis with 5 runs:

uv run python -m oc_meta.run.benchmark -c benchmark_config.yaml --sizes 100 --runs 5

Scalability analysis across multiple sizes:

uv run python -m oc_meta.run.benchmark -c benchmark_config.yaml --sizes 10 50 100 500 --runs 3

Update scenario (tests graph diff when updating existing entities):

uv run python -m oc_meta.run.benchmark -c benchmark_config.yaml --sizes 100 --update-scenario

High-author stress test (simulates ATLAS paper with 2869 authors):

uv run python -m oc_meta.run.benchmark -c benchmark_config.yaml --preload-high-authors 2869

Reports are saved in oc_meta/run/benchmark/reports/: