Generate CSV#
The CSV generator creates a CSV dump from the RDF data.
Usage#
uv run python -m oc_meta.run.meta.generate_csv -c <CONFIG> -o <OUTPUT_DIR> [OPTIONS]
Required arguments#
Argument |
Description |
|---|---|
|
Path to Meta configuration file |
|
Directory where CSV files will be stored |
Optional arguments#
Argument |
Default |
Description |
|---|---|---|
|
|
Redis server hostname |
|
|
Redis server port |
|
|
Redis database number for caching |
|
|
Number of parallel workers |
|
- |
Clear checkpoint and Redis cache before starting |
Example#
uv run python -m oc_meta.run.meta.generate_csv \
-c meta_config.yaml \
-o /data/csv_dump \
--workers 8
Input#
The script reads RDF data from the directory specified by output_rdf_dir in the configuration file. It expects:
JSON-LD files compressed in ZIP archives
Standard OpenCitations Meta directory structure (
br/,ra/,id/,ar/,re/)Files organized by supplier prefix and numeric ranges
The script only processes bibliographic resources (BR), but resolves related entities:
Identifiers (ID): DOI, PMID, PMCID, ISBN, ISSN, etc.
Responsible agents (RA): Authors, editors, publishers with their identifiers
Agent roles (AR): Links between BR and RA with ordering via
hasNextResource embodiments (RE): Page numbers
Output#
CSV files#
Output files are named output_1.csv, output_2.csv, etc., with a maximum of 3000 rows per file.
The CSV format matches the standard Meta input format:
Column |
Description |
|---|---|
|
Space-separated identifiers (OMID + external IDs) |
|
Publication title |
|
Semicolon-separated authors in format |
|
Issue number |
|
Volume number |
|
Venue title with identifiers |
|
Page range (e.g., |
|
Publication date |
|
Publication type (e.g., |
|
Publisher with identifiers |
|
Semicolon-separated editors |
Checkpoint file#
The script creates processed_br_files.txt in the output directory to track which RDF files have been processed. This enables resumability: if the script is interrupted, it will skip already processed files on restart.
Redis cache#
Processed OMIDs are stored in Redis to avoid duplicates. The cache persists across runs unless --clean is specified.
Processing details#
Skipped entities#
The following entity types are skipped as standalone records (they appear only as venue containers):
Journal volumes (
fabio:JournalVolume)Journal issues (
fabio:JournalIssue)
Venue hierarchy#
For journal articles, the script traverses the frbr:partOf hierarchy:
Article → Issue → Volume → Journal
Extracting issue number, volume number, and journal title with identifiers.
Cycle detection#
The script includes safeguards against:
Cycles in
hasNextchains (max iterations limit)Cycles in venue hierarchy (visited set + max depth of 5)
Warnings are printed when cycles are detected.
Resumability#
The script supports resuming interrupted processing:
File-level checkpoint: Tracks processed RDF files in
processed_br_files.txtEntity-level cache: Stores processed OMIDs in Redis
To start fresh, use the --clean flag:
uv run python -m oc_meta.run.meta.generate_csv \
-c meta_config.yaml \
-o /data/csv_dump \
--clean
This removes the checkpoint file and clears the Redis cache.
Performance#
Uses multiprocessing with configurable worker count
LRU cache (2000 entries) for loaded JSON files
Redis pipeline batching for efficient cache operations
Progress bar shows processing status and time estimates