Processing
The main Meta process reads CSV files, curates the data, generates RDF, and uploads to a triplestore.
Running Meta
Section titled “Running Meta”uv run python -m oc_meta.run.meta_process -c meta_config.yamlWhat happens during processing
Section titled “What happens during processing”1. Preparation
Section titled “1. Preparation”- Creates output directories (
info_dir,output_csv_dir,output_rdf_dir) - Initializes Redis connection for OMID counter handling
- Generates
time_agnostic_library_config.jsonfor provenance queries (if it doesn’t exist) - Loads list of already processed files from cache to skip them
2. Data curation
Section titled “2. Data curation”The Curator processes each CSV row:
- Parses identifiers from the
idcolumn and validates their syntax (DOI regex, ORCID checksum, ISSN checksum, etc.) - Normalizes metadata: title casing, date format standardization, author name parsing
- Uses ResourceFinder to query the triplestore and check if entities already exist
- Builds in-memory graphs (
everything_everywhere_allatonce) with data from existing entities - Outputs a curated CSV file with normalized data and assigned OMIDs
3. Index building
Section titled “3. Index building”During curation, Meta builds indexes that map:
index_id_ra: Identifiers → Responsible agent OMIDsindex_id_br: Identifiers → Bibliographic resource OMIDsre_index: Resource embodiment dataar_index: Agent role sequences (author/editor chains)VolIss: Volume/issue structure for venues
These indexes avoid repeated SPARQL queries during RDF creation.
4. RDF creation
Section titled “4. RDF creation”The Creator generates RDF using GraphSet:
- Bibliographic resources (BR): Articles, books, journals, proceedings, etc.
- Responsible agents (RA): Authors, editors, publishers (persons or organizations)
- Identifiers (ID): DOIs, ORCIDs, ISSNs, ISBNs, etc.
- Agent roles (AR): Proxy entities linking BR to RA, with role type and sequence (hasNext chain)
- Resource embodiments (RE): Page ranges
After entity creation, ProvSet generates provenance snapshots tracking creation time, responsible agent, and primary source.
5. Storage
Section titled “5. Storage”Meta runs four parallel processes using multiprocessing:
- Data RDF storage: Writes data entities to JSON-LD files (if
generate_rdf_files: true) - Provenance RDF storage: Writes provenance to JSON-LD files
- Data SPARQL generation: Generates SPARQL UPDATE queries for data triplestore
- Provenance SPARQL generation: Generates SPARQL UPDATE queries for provenance triplestore
After query generation, Meta uploads SPARQL queries to both triplestores using piccione.upload_sparql_updates.
Multiprocessing
Section titled “Multiprocessing”Meta processes CSV files sequentially (one at a time), but uses parallel processes within each file for I/O operations. This design is hard-coded for stability reasons: Virtuoso does not handle parallel SPARQL queries well.
For each file, Meta spawns up to 4 parallel processes:
- 2 for RDF file storage (data + provenance)
- 2 for SPARQL query generation (data + provenance)
Manual upload
Section titled “Manual upload”If the automatic upload fails mid-process (connection timeout, triplestore restart, etc.), you can retry with the manual upload script:
uv run python -m oc_meta.run.upload.on_triplestore <ENDPOINT_URL> <SPARQL_FOLDER>Options:
| Option | Default | Description |
|---|---|---|
--batch_size | 10 | Quadruples per batch |
--cache_file | ts_upload_cache.json | Track processed files |
--failed_file | failed_queries.txt | Log failed queries |
--stop_file | .stop_upload | Touch this file to stop gracefully |
The script tracks progress in the cache file, so you can restart without reprocessing completed files.
Error handling
Section titled “Error handling”Meta tracks progress in two files inside base_output_dir:
cache.txt: Lists successfully processed CSV files. On restart, Meta skips files already in this list.errors.txt: Logs failed files with their error messages (filename + traceback).
If a file fails, Meta logs the error and continues with the next file. At the end of a complete run, cache.txt is renamed with a timestamp (e.g., cache_2024-01-15T10_30_00.txt).
Output files
Section titled “Output files”When generate_rdf_files: true:
output/├── br/ # Bibliographic resources│ └── 060/ # Supplier prefix│ └── 10000/ # Entities 1-10000 (dir_split_number)│ ├── 1000.zip # Entities 1-1000 (items_per_file)│ ├── 2000.zip # Entities 1001-2000│ └── ...├── ra/ # Responsible agents├── id/ # Identifiers├── ar/ # Agent roles├── re/ # Resource embodiments└── prov/ # Provenance graphsThe directory structure is determined by dir_split_number (entities per subdirectory) and items_per_file (entities per JSON file). For example, with dir_split_number: 10000 and items_per_file: 1000, entity br/060/15234 is stored in br/060/20000/16000.zip.