Generating test data
Generates CSV files with synthetic bibliographic metadata for benchmark testing.
uv run python -m oc_meta.run.benchmark.generate_benchmark_data -o <OUTPUT> [options]Parameters
Section titled “Parameters”| Parameter | Default | Description |
|---|---|---|
-o, --output | Required | Output CSV file path |
-s, --size | 100 | Number of records to generate |
--seed | 42 | Random seed for reproducibility |
Example
Section titled “Example”uv run python -m oc_meta.run.benchmark.generate_benchmark_data \ -o test_data.csv \ -s 1000 \ --seed 123Generated data
Section titled “Generated data”Each record includes:
| Field | Values |
|---|---|
| id | Synthetic DOI (10.1038/benchmark.NNNNNN), optionally PMID |
| title | Random selection from sample titles |
| author | 1-5 authors with ORCID identifiers |
| pub_date | Random date 2015-2024 |
| venue | Random journal with ISSN |
| volume | 1-50 |
| issue | 1-12 |
| page | Random page range |
| type | journal article, review, conference paper, etc. |
| publisher | Random publisher with Crossref ID |
The generator uses fixed sample data to produce realistic but synthetic records. Same seed produces identical output.