Processing#
The main Meta process reads CSV files, curates the data, generates RDF, and uploads to a triplestore.
Running Meta#
uv run python -m oc_meta.run.meta_process -c meta_config.yaml
What happens during processing#
1. Preparation#
Creates output directories (
info_dir,output_csv_dir,output_rdf_dir)Initializes filesystem-based counter handling via
FilesystemCounterHandler(counters stored ininfo_dir)Generates
time_agnostic_library_config.jsonfor provenance queries (if it doesn’t exist)Loads list of already processed files from cache to skip them
2. Data curation#
The Curator processes each CSV row:
Parses identifiers from the
idcolumn and validates their syntax (DOI regex, ORCID checksum, ISSN checksum, etc.)Normalizes metadata: title casing, date format standardization, author name parsing
Uses ResourceFinder to query the triplestore and check if entities already exist
Builds in-memory dict indexes with data from existing entities
Outputs a curated CSV file with normalized data and assigned OMIDs
3. Index building#
During curation, Meta builds indexes that map:
index_id_ra: Identifiers → Responsible agent OMIDsindex_id_br: Identifiers → Bibliographic resource OMIDsre_index: Resource embodiment dataar_index: Agent role sequences (author/editor chains)VolIss: Volume/issue structure for venues
These indexes avoid repeated SPARQL queries during RDF creation.
4. RDF creation#
The Creator generates RDF using GraphSet:
Bibliographic resources (BR): Articles, books, journals, proceedings, etc.
Responsible agents (RA): Authors, editors, publishers (persons or organizations)
Identifiers (ID): DOIs, ORCIDs, ISSNs, ISBNs, etc.
Agent roles (AR): Proxy entities linking BR to RA, with role type and sequence (hasNext chain)
Resource embodiments (RE): Page ranges
After entity creation, ProvSet generates provenance snapshots tracking creation time, responsible agent, and primary source.
5. Storage#
Meta runs four parallel processes using multiprocessing:
Data RDF storage: Writes data entities to JSON-LD files
Provenance RDF storage: Writes provenance to JSON-LD files
Data SPARQL generation: Generates SPARQL UPDATE queries for data triplestore
Provenance SPARQL generation: Generates SPARQL UPDATE queries for provenance triplestore
After query generation, Meta uploads SPARQL queries to both triplestores using piccione.upload_sparql_updates.
Multiprocessing#
Meta processes CSV files sequentially (one at a time), but uses parallel processes within each file for I/O operations. This design is hard-coded for stability reasons: Virtuoso does not handle parallel SPARQL queries well.
For each file, Meta spawns up to 4 parallel processes:
2 for RDF file storage (data + provenance)
2 for SPARQL query generation (data + provenance)
Manual upload#
If the automatic upload fails mid-process (connection timeout, triplestore restart, etc.), you can retry with the manual upload script:
uv run python -m oc_meta.run.upload.on_triplestore <ENDPOINT_URL> <SPARQL_FOLDER>
Options:
Option |
Default |
Description |
|---|---|---|
|
10 |
Quadruples per batch |
|
ts_upload_cache.json |
Track processed files |
|
failed_queries.txt |
Log failed queries |
|
.stop_upload |
Touch this file to stop gracefully |
The script tracks progress in the cache file, so you can restart without reprocessing completed files.
Error handling#
Meta tracks progress in two files inside base_output_dir:
cache.txt: Lists successfully processed CSV files. On restart, Meta skips files already in this list.errors.txt: Logs failed files with their error messages (filename + traceback).
If a file fails, Meta logs the error and continues with the next file. At the end of a complete run, cache.txt is renamed with a timestamp (e.g., cache_2024-01-15T10_30_00.txt).
Output files#
RDF files are always generated. When rdf_files_only: false (default), SPARQL queries are also produced:
output/
├── br/ # Bibliographic resources
│ └── 060/ # Supplier prefix
│ └── 10000/ # Entities 1-10000 (dir_split_number)
│ ├── 1000.zip # Entities 1-1000 (items_per_file)
│ ├── 2000.zip # Entities 1001-2000
│ └── ...
├── ra/ # Responsible agents
├── id/ # Identifiers
├── ar/ # Agent roles
├── re/ # Resource embodiments
└── prov/ # Provenance graphs
The directory structure is determined by dir_split_number (entities per subdirectory) and items_per_file (entities per JSON file). For example, with dir_split_number: 10000 and items_per_file: 1000, entity br/060/15234 is stored in br/060/20000/16000.zip.