Generate CSV
The CSV generator creates a CSV dump from the RDF data.
uv run python -m oc_meta.run.meta.generate_csv -c <CONFIG> -o <OUTPUT_DIR> [OPTIONS]Required arguments
Section titled “Required arguments”| Argument | Description |
|---|---|
-c, --config | Path to Meta configuration file |
-o, --output_dir | Directory where CSV files will be stored |
Optional arguments
Section titled “Optional arguments”| Argument | Default | Description |
|---|---|---|
--redis-host | localhost | Redis server hostname |
--redis-port | 6379 | Redis server port |
--redis-db | 2 | Redis database number for caching |
--workers | 4 | Number of parallel workers |
--clean | - | Clear checkpoint and Redis cache before starting |
Example
Section titled “Example”uv run python -m oc_meta.run.meta.generate_csv \ -c meta_config.yaml \ -o /data/csv_dump \ --workers 8The script reads RDF data from the directory specified by output_rdf_dir in the configuration file. It expects:
- JSON-LD files compressed in ZIP archives
- Standard OpenCitations Meta directory structure (
br/,ra/,id/,ar/,re/) - Files organized by supplier prefix and numeric ranges
The script only processes bibliographic resources (BR), but resolves related entities:
- Identifiers (ID): DOI, PMID, PMCID, ISBN, ISSN, etc.
- Responsible agents (RA): Authors, editors, publishers with their identifiers
- Agent roles (AR): Links between BR and RA with ordering via
hasNext - Resource embodiments (RE): Page numbers
Output
Section titled “Output”CSV files
Section titled “CSV files”Output files are named output_1.csv, output_2.csv, etc., with a maximum of 3000 rows per file.
The CSV format matches the standard Meta input format:
| Column | Description |
|---|---|
id | Space-separated identifiers (OMID + external IDs) |
title | Publication title |
author | Semicolon-separated authors in format Family, Given [identifiers] |
issue | Issue number |
volume | Volume number |
venue | Venue title with identifiers |
page | Page range (e.g., 123-456) |
pub_date | Publication date |
type | Publication type (e.g., journal article, book chapter) |
publisher | Publisher with identifiers |
editor | Semicolon-separated editors |
Checkpoint file
Section titled “Checkpoint file”The script creates processed_br_files.txt in the output directory to track which RDF files have been processed. This enables resumability: if the script is interrupted, it will skip already processed files on restart.
Redis cache
Section titled “Redis cache”Processed OMIDs are stored in Redis to avoid duplicates. The cache persists across runs unless --clean is specified.
Processing details
Section titled “Processing details”Skipped entities
Section titled “Skipped entities”The following entity types are skipped as standalone records (they appear only as venue containers):
- Journal volumes (
fabio:JournalVolume) - Journal issues (
fabio:JournalIssue)
Author ordering
Section titled “Author ordering”Authors, editors, and publishers are ordered by following the oc:hasNext chain from agent roles. The script detects the first agent role (one not referenced by any other hasNext) and follows the chain to maintain correct ordering.
Venue hierarchy
Section titled “Venue hierarchy”For journal articles, the script traverses the frbr:partOf hierarchy:
Article → Issue → Volume → JournalExtracting issue number, volume number, and journal title with identifiers.
Cycle detection
Section titled “Cycle detection”The script includes safeguards against:
- Cycles in
hasNextchains (max iterations limit) - Cycles in venue hierarchy (visited set + max depth of 5)
Warnings are printed when cycles are detected.
Resumability
Section titled “Resumability”The script supports resuming interrupted processing:
- File-level checkpoint: Tracks processed RDF files in
processed_br_files.txt - Entity-level cache: Stores processed OMIDs in Redis
To start fresh, use the --clean flag:
uv run python -m oc_meta.run.meta.generate_csv \ -c meta_config.yaml \ -o /data/csv_dump \ --cleanThis removes the checkpoint file and clears the Redis cache.
Performance
Section titled “Performance”- Uses multiprocessing with configurable worker count
- LRU cache (2000 entries) for loaded JSON files
- Redis pipeline batching for efficient cache operations
- Progress bar shows processing status and time estimates