Skip to content

Processing

The main Meta process reads CSV files, curates the data, generates RDF, and uploads to a triplestore.

Terminal window
uv run python -m oc_meta.run.meta_process -c meta_config.yaml
  • Creates output directories (info_dir, output_csv_dir, output_rdf_dir)
  • Initializes Redis connection for OMID counter handling
  • Generates time_agnostic_library_config.json for provenance queries (if it doesn’t exist)
  • Loads list of already processed files from cache to skip them

The Curator processes each CSV row:

  • Parses identifiers from the id column and validates their syntax (DOI regex, ORCID checksum, ISSN checksum, etc.)
  • Normalizes metadata: title casing, date format standardization, author name parsing
  • Uses ResourceFinder to query the triplestore and check if entities already exist
  • Builds in-memory graphs (everything_everywhere_allatonce) with data from existing entities
  • Outputs a curated CSV file with normalized data and assigned OMIDs

During curation, Meta builds indexes that map:

  • index_id_ra: Identifiers → Responsible agent OMIDs
  • index_id_br: Identifiers → Bibliographic resource OMIDs
  • re_index: Resource embodiment data
  • ar_index: Agent role sequences (author/editor chains)
  • VolIss: Volume/issue structure for venues

These indexes avoid repeated SPARQL queries during RDF creation.

The Creator generates RDF using GraphSet:

  • Bibliographic resources (BR): Articles, books, journals, proceedings, etc.
  • Responsible agents (RA): Authors, editors, publishers (persons or organizations)
  • Identifiers (ID): DOIs, ORCIDs, ISSNs, ISBNs, etc.
  • Agent roles (AR): Proxy entities linking BR to RA, with role type and sequence (hasNext chain)
  • Resource embodiments (RE): Page ranges

After entity creation, ProvSet generates provenance snapshots tracking creation time, responsible agent, and primary source.

Meta runs four parallel processes using multiprocessing:

  1. Data RDF storage: Writes data entities to JSON-LD files (if generate_rdf_files: true)
  2. Provenance RDF storage: Writes provenance to JSON-LD files
  3. Data SPARQL generation: Generates SPARQL UPDATE queries for data triplestore
  4. Provenance SPARQL generation: Generates SPARQL UPDATE queries for provenance triplestore

After query generation, Meta uploads SPARQL queries to both triplestores using piccione.upload_sparql_updates.

Meta processes CSV files sequentially (one at a time), but uses parallel processes within each file for I/O operations. This design is hard-coded for stability reasons: Virtuoso does not handle parallel SPARQL queries well.

For each file, Meta spawns up to 4 parallel processes:

  • 2 for RDF file storage (data + provenance)
  • 2 for SPARQL query generation (data + provenance)

If the automatic upload fails mid-process (connection timeout, triplestore restart, etc.), you can retry with the manual upload script:

Terminal window
uv run python -m oc_meta.run.upload.on_triplestore <ENDPOINT_URL> <SPARQL_FOLDER>

Options:

OptionDefaultDescription
--batch_size10Quadruples per batch
--cache_filets_upload_cache.jsonTrack processed files
--failed_filefailed_queries.txtLog failed queries
--stop_file.stop_uploadTouch this file to stop gracefully

The script tracks progress in the cache file, so you can restart without reprocessing completed files.

Meta tracks progress in two files inside base_output_dir:

  • cache.txt: Lists successfully processed CSV files. On restart, Meta skips files already in this list.
  • errors.txt: Logs failed files with their error messages (filename + traceback).

If a file fails, Meta logs the error and continues with the next file. At the end of a complete run, cache.txt is renamed with a timestamp (e.g., cache_2024-01-15T10_30_00.txt).

When generate_rdf_files: true:

output/
├── br/ # Bibliographic resources
│ └── 060/ # Supplier prefix
│ └── 10000/ # Entities 1-10000 (dir_split_number)
│ ├── 1000.zip # Entities 1-1000 (items_per_file)
│ ├── 2000.zip # Entities 1001-2000
│ └── ...
├── ra/ # Responsible agents
├── id/ # Identifiers
├── ar/ # Agent roles
├── re/ # Resource embodiments
└── prov/ # Provenance graphs

The directory structure is determined by dir_split_number (entities per subdirectory) and items_per_file (entities per JSON file). For example, with dir_split_number: 10000 and items_per_file: 1000, entity br/060/15234 is stored in br/060/20000/16000.zip.