Preprocessing#

The preprocessing script filters and optimizes CSV files before they enter the main Meta pipeline. This step removes duplicates and filters out IDs that already exist in the database.

What preprocessing does#

  1. Filters existing IDs that are already in Redis or a SPARQL triplestore

  2. Removes duplicates across all input files

  3. Splits output into smaller, manageable chunks

Storage backends#

The script supports two backends for checking whether identifiers already exist.

Redis#

The default mode checks IDs against a Redis database that maps external identifiers to OMIDs. This database is generated by the meta2redis.py script from the OpenCitations Index repository.

The meta2redis.py script populates three Redis databases, but this preprocessing script only uses database 10 (db_br), which contains bibliographic resource identifiers (DOI, ISBN, PMID, etc.).

Each identifier is stored as a Redis set (using SADD), with the identifier as key and the corresponding OMIDs as set members:

doi:10.1234/example -> {omid:br/0601234}
issn:1234-5678 -> {omid:br/0605678}

The preprocessing script uses EXISTS to check if an identifier is present in the database. Redis checks are parallelized per-file across multiple workers.

SPARQL#

When a --sparql-endpoint is provided, the script queries a SPARQL triplestore (QLever) instead of Redis. This mode:

  1. Reads all CSV files in parallel

  2. Collects all unique identifiers across files

  3. Batches identifiers into SPARQL VALUES queries (batch size: 30)

  4. Executes queries in parallel using ProcessPoolExecutor (groups of 100 queries per worker)

  5. Filters out rows where all identifiers already exist

This approach minimizes network roundtrips by querying all unique identifiers at once, rather than one at a time.

Basic usage#

With Redis:

uv run python -m oc_meta.run.meta.preprocess_input <INPUT_DIR> <OUTPUT_DIR> --redis-port <PORT>

With SPARQL:

uv run python -m oc_meta.run.meta.preprocess_input <INPUT_DIR> <OUTPUT_DIR> \
  --sparql-endpoint "http://localhost:8805?access-token=TOKEN"

Example with all options#

Redis mode:

uv run python -m oc_meta.run.meta.preprocess_input input/ output/ \
  --redis-port 6379 \
  --rows-per-file 5000 \
  --redis-host 192.168.1.100 \
  --redis-db 10 \
  --workers 8

SPARQL mode:

uv run python -m oc_meta.run.meta.preprocess_input input/ output/ \
  --sparql-endpoint "http://localhost:8805?access-token=TOKEN" \
  --rows-per-file 5000 \
  --workers 24

Options#

Option

Required

Default

Description

input_dir

Yes

-

Directory containing input CSV files

output

Yes

-

Output path: directory for split files, or path ending in .csv for single file

--redis-port

Yes (Redis mode)

-

Redis port

--sparql-endpoint

Yes (SPARQL mode)

-

SPARQL endpoint URL (alternative to Redis)

--rows-per-file

No

3000

Split output into files of N rows each

--single-file

No

-

Write all output rows to a single CSV file

--redis-host

No

localhost

Redis hostname

--redis-db

No

10

Redis database number

--workers

No

4

Number of parallel workers

Either --redis-port or --sparql-endpoint must be provided. When --sparql-endpoint is set, Redis options are ignored.

--rows-per-file and --single-file are mutually exclusive. With --single-file, the output path determines the file name: if it ends in .csv, that exact path is used; otherwise, a merged.csv file is created in the given directory.

Progress display#

During execution, the script shows a progress bar with:

  • Current phase being processed

  • Progress percentage and count (e.g., 50% 5/10)

  • Elapsed time

  • Estimated time remaining

Output report#

When finished, the script prints a summary table:

               Processing Report
┏━━━━━━━━━━━━━━��━━━━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ Metric                       ┃ Value    ┃
┡━━━━━━━━━��━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩
│ Total input files processed  │ 10       │
│ Total input rows             │ 150000   │
│ Rows discarded (duplicates)  │ 12500    │
│ Rows discarded (existing IDs)│ 8200     │
│ Rows written to output       │ 129300   │
│                              │          │
│ Duplicate rows %             │ 8.3%     │
│ Existing IDs %               │ 5.5%     │
│ Processed rows %             │ 86.2%    │
└──────��───────────────────────┴──────────┘