Skip to content

Generate CSV

The CSV generator creates a CSV dump from the RDF data.

Terminal window
uv run python -m oc_meta.run.meta.generate_csv -c <CONFIG> -o <OUTPUT_DIR> [OPTIONS]
ArgumentDescription
-c, --configPath to Meta configuration file
-o, --output_dirDirectory where CSV files will be stored
ArgumentDefaultDescription
--redis-hostlocalhostRedis server hostname
--redis-port6379Redis server port
--redis-db2Redis database number for caching
--workers4Number of parallel workers
--clean-Clear checkpoint and Redis cache before starting
Terminal window
uv run python -m oc_meta.run.meta.generate_csv \
-c meta_config.yaml \
-o /data/csv_dump \
--workers 8

The script reads RDF data from the directory specified by output_rdf_dir in the configuration file. It expects:

  • JSON-LD files compressed in ZIP archives
  • Standard OpenCitations Meta directory structure (br/, ra/, id/, ar/, re/)
  • Files organized by supplier prefix and numeric ranges

The script only processes bibliographic resources (BR), but resolves related entities:

  • Identifiers (ID): DOI, PMID, PMCID, ISBN, ISSN, etc.
  • Responsible agents (RA): Authors, editors, publishers with their identifiers
  • Agent roles (AR): Links between BR and RA with ordering via hasNext
  • Resource embodiments (RE): Page numbers

Output files are named output_1.csv, output_2.csv, etc., with a maximum of 3000 rows per file.

The CSV format matches the standard Meta input format:

ColumnDescription
idSpace-separated identifiers (OMID + external IDs)
titlePublication title
authorSemicolon-separated authors in format Family, Given [identifiers]
issueIssue number
volumeVolume number
venueVenue title with identifiers
pagePage range (e.g., 123-456)
pub_datePublication date
typePublication type (e.g., journal article, book chapter)
publisherPublisher with identifiers
editorSemicolon-separated editors

The script creates processed_br_files.txt in the output directory to track which RDF files have been processed. This enables resumability: if the script is interrupted, it will skip already processed files on restart.

Processed OMIDs are stored in Redis to avoid duplicates. The cache persists across runs unless --clean is specified.

The following entity types are skipped as standalone records (they appear only as venue containers):

  • Journal volumes (fabio:JournalVolume)
  • Journal issues (fabio:JournalIssue)

Authors, editors, and publishers are ordered by following the oc:hasNext chain from agent roles. The script detects the first agent role (one not referenced by any other hasNext) and follows the chain to maintain correct ordering.

For journal articles, the script traverses the frbr:partOf hierarchy:

Article → Issue → Volume → Journal

Extracting issue number, volume number, and journal title with identifiers.

The script includes safeguards against:

  • Cycles in hasNext chains (max iterations limit)
  • Cycles in venue hierarchy (visited set + max depth of 5)

Warnings are printed when cycles are detected.

The script supports resuming interrupted processing:

  1. File-level checkpoint: Tracks processed RDF files in processed_br_files.txt
  2. Entity-level cache: Stores processed OMIDs in Redis

To start fresh, use the --clean flag:

Terminal window
uv run python -m oc_meta.run.meta.generate_csv \
-c meta_config.yaml \
-o /data/csv_dump \
--clean

This removes the checkpoint file and clears the Redis cache.

  • Uses multiprocessing with configurable worker count
  • LRU cache (2000 entries) for loaded JSON files
  • Redis pipeline batching for efficient cache operations
  • Progress bar shows processing status and time estimates