Skip to content

Generating test data

Generates CSV files with synthetic bibliographic metadata for benchmark testing.

Terminal window
uv run python -m oc_meta.run.benchmark.generate_benchmark_data -o <OUTPUT> [options]
ParameterDefaultDescription
-o, --outputRequiredOutput CSV file path
-s, --size100Number of records to generate
--seed42Random seed for reproducibility
Terminal window
uv run python -m oc_meta.run.benchmark.generate_benchmark_data \
-o test_data.csv \
-s 1000 \
--seed 123

Each record includes:

FieldValues
idSynthetic DOI (10.1038/benchmark.NNNNNN), optionally PMID
titleRandom selection from sample titles
author1-5 authors with ORCID identifiers
pub_dateRandom date 2015-2024
venueRandom journal with ISSN
volume1-50
issue1-12
pageRandom page range
typejournal article, review, conference paper, etc.
publisherRandom publisher with Crossref ID

The generator uses fixed sample data to produce realistic but synthetic records. Same seed produces identical output.