Processing

Processing#

The main Meta process reads CSV files, curates the data, generates RDF, and uploads to a triplestore.

Running Meta#

uv run python -m oc_meta.run.meta_process -c meta_config.yaml

What happens during processing#

1. Preparation#

Creates output directories (info_dir, output_csv_dir, output_rdf_dir)
Initializes filesystem-based counter handling via FilesystemCounterHandler (counters stored in info_dir)
Generates time_agnostic_library_config.json for provenance queries (if it doesn’t exist)
Loads list of already processed files from cache to skip them

2. Data curation#

The Curator processes each CSV row:

Parses identifiers from the id column and validates their syntax (DOI regex, ORCID checksum, ISSN checksum, etc.)
Normalizes metadata: title casing, date format standardization, author name parsing
Uses ResourceFinder to query the triplestore and check if entities already exist
Builds in-memory dict indexes with data from existing entities
Outputs a curated CSV file with normalized data and assigned OMIDs

3. Index building#

During curation, Meta builds indexes that map:

index_id_ra: Identifiers → Responsible agent OMIDs
index_id_br: Identifiers → Bibliographic resource OMIDs
re_index: Resource embodiment data
ar_index: Agent role sequences (author/editor chains)
VolIss: Volume/issue structure for venues

These indexes avoid repeated SPARQL queries during RDF creation.

4. RDF creation#

The Creator generates RDF using GraphSet:

Bibliographic resources (BR): Articles, books, journals, proceedings, etc.
Responsible agents (RA): Authors, editors, publishers (persons or organizations)
Identifiers (ID): DOIs, ORCIDs, ISSNs, ISBNs, etc.
Agent roles (AR): Proxy entities linking BR to RA, with role type and sequence (hasNext chain)
Resource embodiments (RE): Page ranges

After entity creation, ProvSet generates provenance snapshots tracking creation time, responsible agent, and primary source.

5. Storage#

Meta runs four parallel processes using multiprocessing:

Data RDF storage: Writes data entities to JSON-LD files
Provenance RDF storage: Writes provenance to JSON-LD files
Data SPARQL generation: Generates SPARQL UPDATE queries for data triplestore
Provenance SPARQL generation: Generates SPARQL UPDATE queries for provenance triplestore

After query generation, Meta uploads SPARQL queries to both triplestores using piccione.upload_sparql_updates.

Multiprocessing#

Meta processes CSV files sequentially (one at a time), but uses parallel processes within each file for I/O operations. This design is hard-coded for stability reasons: Virtuoso does not handle parallel SPARQL queries well.

For each file, Meta spawns up to 4 parallel processes:

2 for RDF file storage (data + provenance)
2 for SPARQL query generation (data + provenance)

Manual upload#

If the automatic upload fails mid-process (connection timeout, triplestore restart, etc.), you can retry with the manual upload script:

uv run python -m oc_meta.run.upload.on_triplestore <ENDPOINT_URL> <SPARQL_FOLDER>

Options:

Option	Default	Description
`--batch_size`	10	Quadruples per batch
`--cache_file`	ts_upload_cache.json	Track processed files
`--failed_file`	failed_queries.txt	Log failed queries
`--stop_file`	.stop_upload	Touch this file to stop gracefully

The script tracks progress in the cache file, so you can restart without reprocessing completed files.

Error handling#

Meta tracks progress in two files inside base_output_dir:

cache.txt: Lists successfully processed CSV files. On restart, Meta skips files already in this list.
errors.txt: Logs failed files with their error messages (filename + traceback).

If a file fails, Meta logs the error and continues with the next file. At the end of a complete run, cache.txt is renamed with a timestamp (e.g., cache_2024-01-15T10_30_00.txt).

Output files#

RDF files are always generated. When rdf_files_only: false (default), SPARQL queries are also produced:

output/
├── br/                      # Bibliographic resources
│   └── 060/                 # Supplier prefix
│       └── 10000/           # Entities 1-10000 (dir_split_number)
│           ├── 1000.zip     # Entities 1-1000 (items_per_file)
│           ├── 2000.zip     # Entities 1001-2000
│           └── ...
├── ra/                      # Responsible agents
├── id/                      # Identifiers
├── ar/                      # Agent roles
├── re/                      # Resource embodiments
└── prov/                    # Provenance graphs

The directory structure is determined by dir_split_number (entities per subdirectory) and items_per_file (entities per JSON file). For example, with dir_split_number: 10000 and items_per_file: 1000, entity br/060/15234 is stored in br/060/20000/16000.zip.