Preprocessing
The preprocessing script filters and optimizes CSV files before they enter the main Meta pipeline. This step removes duplicates and filters out IDs that already exist in the database.
What preprocessing does
Section titled “What preprocessing does”- Filters existing IDs that are already in Redis or a SPARQL triplestore
- Removes duplicates across all input files
- Splits output into smaller, manageable chunks
Storage backends
Section titled “Storage backends”The script supports two backends for checking whether identifiers already exist.
The default mode checks IDs against a Redis database that maps external identifiers to OMIDs. This database is generated by the meta2redis.py script from the OpenCitations Index repository.
The meta2redis.py script populates three Redis databases, but this preprocessing script only uses database 10 (db_br), which contains bibliographic resource identifiers (DOI, ISBN, PMID, etc.).
Each identifier is stored as a Redis set (using SADD), with the identifier as key and the corresponding OMIDs as set members:
doi:10.1234/example -> {omid:br/0601234}issn:1234-5678 -> {omid:br/0605678}The preprocessing script uses EXISTS to check if an identifier is present in the database. Redis checks are parallelized per-file across multiple workers.
SPARQL
Section titled “SPARQL”When a --sparql-endpoint is provided, the script queries a SPARQL triplestore (QLever) instead of Redis. This mode:
- Reads all CSV files in parallel
- Collects all unique identifiers across files
- Batches identifiers into SPARQL
VALUESqueries (batch size: 30) - Executes queries in parallel using ProcessPoolExecutor (groups of 100 queries per worker)
- Filters out rows where all identifiers already exist
This approach minimizes network roundtrips by querying all unique identifiers at once, rather than one at a time.
Basic usage
Section titled “Basic usage”With Redis:
uv run python -m oc_meta.run.meta.preprocess_input <INPUT_DIR> <OUTPUT_DIR> --redis-port <PORT>With SPARQL:
uv run python -m oc_meta.run.meta.preprocess_input <INPUT_DIR> <OUTPUT_DIR> \ --sparql-endpoint "http://localhost:8805?access-token=TOKEN"Example with all options
Section titled “Example with all options”Redis mode:
uv run python -m oc_meta.run.meta.preprocess_input input/ output/ \ --redis-port 6379 \ --rows-per-file 5000 \ --redis-host 192.168.1.100 \ --redis-db 10 \ --workers 8SPARQL mode:
uv run python -m oc_meta.run.meta.preprocess_input input/ output/ \ --sparql-endpoint "http://localhost:8805?access-token=TOKEN" \ --rows-per-file 5000 \ --workers 24Options
Section titled “Options”| Option | Required | Default | Description |
|---|---|---|---|
input_dir | Yes | - | Directory containing input CSV files |
output | Yes | - | Output path: directory for split files, or path ending in .csv for single file |
--redis-port | Yes (Redis mode) | - | Redis port |
--sparql-endpoint | Yes (SPARQL mode) | - | SPARQL endpoint URL (alternative to Redis) |
--rows-per-file | No | 3000 | Split output into files of N rows each |
--single-file | No | - | Write all output rows to a single CSV file |
--redis-host | No | localhost | Redis hostname |
--redis-db | No | 10 | Redis database number |
--workers | No | 4 | Number of parallel workers |
Either --redis-port or --sparql-endpoint must be provided. When --sparql-endpoint is set, Redis options are ignored.
--rows-per-file and --single-file are mutually exclusive. With --single-file, the output path determines the file name: if it ends in .csv, that exact path is used; otherwise, a merged.csv file is created in the given directory.
Progress display
Section titled “Progress display”During execution, the script shows a progress bar with:
- Current phase being processed
- Progress percentage and count (e.g.,
50% 5/10) - Elapsed time
- Estimated time remaining
Output report
Section titled “Output report”When finished, the script prints a summary table:
Processing Report┏━━━━━━━━━━━━━━��━━━━━━━━━━━━━━━┳━━━━━━━━━━┓┃ Metric ┃ Value ┃┡━━━━━━━━━��━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩│ Total input files processed │ 10 ││ Total input rows │ 150000 ││ Rows discarded (duplicates) │ 12500 ││ Rows discarded (existing IDs)│ 8200 ││ Rows written to output │ 129300 ││ │ ││ Duplicate rows % │ 8.3% ││ Existing IDs % │ 5.5% ││ Processed rows % │ 86.2% │└──────��───────────────────────┴──────────┘