Skip to content

Preprocessing

The preprocessing script filters and optimizes CSV files before they enter the main Meta pipeline. This step removes duplicates and filters out IDs that already exist in the database.

  1. Filters existing IDs that are already in Redis or a SPARQL triplestore
  2. Removes duplicates across all input files
  3. Splits output into smaller, manageable chunks

The script supports two backends for checking whether identifiers already exist.

The default mode checks IDs against a Redis database that maps external identifiers to OMIDs. This database is generated by the meta2redis.py script from the OpenCitations Index repository.

The meta2redis.py script populates three Redis databases, but this preprocessing script only uses database 10 (db_br), which contains bibliographic resource identifiers (DOI, ISBN, PMID, etc.).

Each identifier is stored as a Redis set (using SADD), with the identifier as key and the corresponding OMIDs as set members:

doi:10.1234/example -> {omid:br/0601234}
issn:1234-5678 -> {omid:br/0605678}

The preprocessing script uses EXISTS to check if an identifier is present in the database. Redis checks are parallelized per-file across multiple workers.

When a --sparql-endpoint is provided, the script queries a SPARQL triplestore (QLever) instead of Redis. This mode:

  1. Reads all CSV files in parallel
  2. Collects all unique identifiers across files
  3. Batches identifiers into SPARQL VALUES queries (batch size: 30)
  4. Executes queries in parallel using ProcessPoolExecutor (groups of 100 queries per worker)
  5. Filters out rows where all identifiers already exist

This approach minimizes network roundtrips by querying all unique identifiers at once, rather than one at a time.

With Redis:

Terminal window
uv run python -m oc_meta.run.meta.preprocess_input <INPUT_DIR> <OUTPUT_DIR> --redis-port <PORT>

With SPARQL:

Terminal window
uv run python -m oc_meta.run.meta.preprocess_input <INPUT_DIR> <OUTPUT_DIR> \
--sparql-endpoint "http://localhost:8805?access-token=TOKEN"

Redis mode:

Terminal window
uv run python -m oc_meta.run.meta.preprocess_input input/ output/ \
--redis-port 6379 \
--rows-per-file 5000 \
--redis-host 192.168.1.100 \
--redis-db 10 \
--workers 8

SPARQL mode:

Terminal window
uv run python -m oc_meta.run.meta.preprocess_input input/ output/ \
--sparql-endpoint "http://localhost:8805?access-token=TOKEN" \
--rows-per-file 5000 \
--workers 24
OptionRequiredDefaultDescription
input_dirYes-Directory containing input CSV files
outputYes-Output path: directory for split files, or path ending in .csv for single file
--redis-portYes (Redis mode)-Redis port
--sparql-endpointYes (SPARQL mode)-SPARQL endpoint URL (alternative to Redis)
--rows-per-fileNo3000Split output into files of N rows each
--single-fileNo-Write all output rows to a single CSV file
--redis-hostNolocalhostRedis hostname
--redis-dbNo10Redis database number
--workersNo4Number of parallel workers

Either --redis-port or --sparql-endpoint must be provided. When --sparql-endpoint is set, Redis options are ignored.

--rows-per-file and --single-file are mutually exclusive. With --single-file, the output path determines the file name: if it ends in .csv, that exact path is used; otherwise, a merged.csv file is created in the given directory.

During execution, the script shows a progress bar with:

  • Current phase being processed
  • Progress percentage and count (e.g., 50% 5/10)
  • Elapsed time
  • Estimated time remaining

When finished, the script prints a summary table:

Processing Report
┏━━━━━━━━━━━━━━��━━━━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ Metric ┃ Value ┃
┡━━━━━━━━━��━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩
│ Total input files processed │ 10 │
│ Total input rows │ 150000 │
│ Rows discarded (duplicates) │ 12500 │
│ Rows discarded (existing IDs)│ 8200 │
│ Rows written to output │ 129300 │
│ │ │
│ Duplicate rows % │ 8.3% │
│ Existing IDs % │ 5.5% │
│ Processed rows % │ 86.2% │
└──────��───────────────────────┴──────────┘