Preprocessing

Preprocessing#

The preprocessing script filters and optimizes CSV files before they enter the main Meta pipeline. This step removes duplicates and filters out IDs that already exist in the database.

What preprocessing does#

Filters existing IDs that are already in Redis or a SPARQL triplestore
Removes duplicates across all input files
Splits output into smaller, manageable chunks

Storage backends#

The script supports two backends for checking whether identifiers already exist.

Redis#

The default mode checks IDs against a Redis database that maps external identifiers to OMIDs. This database is generated by the meta2redis.py script from the OpenCitations Index repository.

The meta2redis.py script populates three Redis databases, but this preprocessing script only uses database 10 (db_br), which contains bibliographic resource identifiers (DOI, ISBN, PMID, etc.).

Each identifier is stored as a Redis set (using SADD), with the identifier as key and the corresponding OMIDs as set members:

doi:10.1234/example -> {omid:br/0601234}
issn:1234-5678 -> {omid:br/0605678}

The preprocessing script uses EXISTS to check if an identifier is present in the database. Redis checks are parallelized per-file across multiple workers.

SPARQL#

When a --sparql-endpoint is provided, the script queries a SPARQL triplestore (QLever) instead of Redis. This mode:

Reads all CSV files in parallel
Collects all unique identifiers across files
Batches identifiers into SPARQL VALUES queries (batch size: 30)
Executes queries in parallel using ProcessPoolExecutor (groups of 100 queries per worker)
Filters out rows where all identifiers already exist

This approach minimizes network roundtrips by querying all unique identifiers at once, rather than one at a time.

Basic usage#

With Redis:

uv run python -m oc_meta.run.meta.preprocess_input <INPUT_DIR> <OUTPUT_DIR> --redis-port <PORT>

With SPARQL:

uv run python -m oc_meta.run.meta.preprocess_input <INPUT_DIR> <OUTPUT_DIR> \
  --sparql-endpoint "http://localhost:8805?access-token=TOKEN"

Example with all options#

Redis mode:

uv run python -m oc_meta.run.meta.preprocess_input input/ output/ \
  --redis-port 6379 \
  --rows-per-file 5000 \
  --redis-host 192.168.1.100 \
  --redis-db 10 \
  --workers 8

SPARQL mode:

uv run python -m oc_meta.run.meta.preprocess_input input/ output/ \
  --sparql-endpoint "http://localhost:8805?access-token=TOKEN" \
  --rows-per-file 5000 \
  --workers 24

Options#

Option	Required	Default	Description
`input_dir`	Yes	-	Directory containing input CSV files
`output`	Yes	-	Output path: directory for split files, or path ending in `.csv` for single file
`--redis-port`	Yes (Redis mode)	-	Redis port
`--sparql-endpoint`	Yes (SPARQL mode)	-	SPARQL endpoint URL (alternative to Redis)
`--rows-per-file`	No	3000	Split output into files of N rows each
`--single-file`	No	-	Write all output rows to a single CSV file
`--redis-host`	No	localhost	Redis hostname
`--redis-db`	No	10	Redis database number
`--workers`	No	4	Number of parallel workers

Either --redis-port or --sparql-endpoint must be provided. When --sparql-endpoint is set, Redis options are ignored.

--rows-per-file and --single-file are mutually exclusive. With --single-file, the output path determines the file name: if it ends in .csv, that exact path is used; otherwise, a merged.csv file is created in the given directory.

Progress display#

During execution, the script shows a progress bar with:

Current phase being processed
Progress percentage and count (e.g., 50% 5/10)
Elapsed time
Estimated time remaining

Output report#

When finished, the script prints a summary table:

               Processing Report
┏━━━━━━━━━━━━━━��━━━━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ Metric                       ┃ Value    ┃
┡━━━━━━━━━��━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩
│ Total input files processed  │ 10       │
│ Total input rows             │ 150000   │
│ Rows discarded (duplicates)  │ 12500    │
│ Rows discarded (existing IDs)│ 8200     │
│ Rows written to output       │ 129300   │
│                              │          │
│ Duplicate rows %             │ 8.3%     │
│ Existing IDs %               │ 5.5%     │
│ Processed rows %             │ 86.2%    │
└──────��───────────────────────┴──────────┘