Generating test data#
Generates CSV files with synthetic bibliographic metadata for benchmark testing.
Usage#
uv run python -m oc_meta.run.benchmark.generate_benchmark_data -o <OUTPUT> [options]
Parameters#
Parameter |
Default |
Description |
|---|---|---|
|
Required |
Output CSV file path |
|
100 |
Number of records to generate |
|
42 |
Random seed for reproducibility |
Example#
uv run python -m oc_meta.run.benchmark.generate_benchmark_data \
-o test_data.csv \
-s 1000 \
--seed 123
Generated data#
Each record includes:
Field |
Values |
|---|---|
id |
Synthetic DOI (10.1038/benchmark.NNNNNN), optionally PMID |
title |
Random selection from sample titles |
author |
1-5 authors with ORCID identifiers |
pub_date |
Random date 2015-2024 |
venue |
Random journal with ISSN |
volume |
1-50 |
issue |
1-12 |
page |
Random page range |
type |
journal article, review, conference paper, etc. |
publisher |
Random publisher with Crossref ID |
The generator uses fixed sample data to produce realistic but synthetic records. Same seed produces identical output.