Generating test data

Generates CSV files with synthetic bibliographic metadata for benchmark testing.

Usage

uv run python -m oc_meta.run.benchmark.generate_benchmark_data -o <OUTPUT> [options]

Parameter	Default	Description
`-o`, `--output`	Required	Output CSV file path
`-s`, `--size`	100	Number of records to generate
`--seed`	42	Random seed for reproducibility

uv run python -m oc_meta.run.benchmark.generate_benchmark_data \
    -o test_data.csv \
    -s 1000 \
    --seed 123

Each record includes:

Field	Values
id	Synthetic DOI (10.1038/benchmark.NNNNNN), optionally PMID
title	Random selection from sample titles
author	1-5 authors with ORCID identifiers
pub_date	Random date 2015-2024
venue	Random journal with ISSN
volume	1-50
issue	1-12
page	Random page range
type	journal article, review, conference paper, etc.
publisher	Random publisher with Crossref ID

The generator uses fixed sample data to produce realistic but synthetic records. Same seed produces identical output.