<!--
SPDX-FileCopyrightText: 2026 Arcangelo Massari <arcangelo.massari@unibo.it>

SPDX-License-Identifier: ISC
-->

# Preprocessing

The preprocessing script filters and optimizes CSV files before they enter the main Meta pipeline. This step removes duplicates and filters out IDs that already exist in the database.

## What preprocessing does

1. **Filters existing IDs** that are already in Redis or a SPARQL triplestore
2. **Removes duplicates** across all input files
3. **Splits output** into smaller, manageable chunks

## Storage backends

The script supports two backends for checking whether identifiers already exist.

### Redis

The default mode checks IDs against a Redis database that maps external identifiers to OMIDs. This database is generated by the [`meta2redis.py`](https://github.com/opencitations/index/blob/master/scripts/meta2redis.py) script from the OpenCitations Index repository.

The `meta2redis.py` script populates three Redis databases, but this preprocessing script only uses **database 10** (`db_br`), which contains bibliographic resource identifiers (DOI, ISBN, PMID, etc.).

Each identifier is stored as a Redis **set** (using `SADD`), with the identifier as key and the corresponding OMIDs as set members:

```
doi:10.1234/example -> {omid:br/0601234}
issn:1234-5678 -> {omid:br/0605678}
```

The preprocessing script uses `EXISTS` to check if an identifier is present in the database. Redis checks are parallelized per-file across multiple workers.

### SPARQL

When a `--sparql-endpoint` is provided, the script queries a SPARQL triplestore (QLever) instead of Redis. This mode:

1. Reads all CSV files in parallel
2. Collects all unique identifiers across files
3. Batches identifiers into SPARQL `VALUES` queries (batch size: 30)
4. Executes queries in parallel using ProcessPoolExecutor (groups of 100 queries per worker)
5. Filters out rows where all identifiers already exist

This approach minimizes network roundtrips by querying all unique identifiers at once, rather than one at a time.

## Basic usage

With Redis:

```bash
uv run python -m oc_meta.run.meta.preprocess_input <INPUT_DIR> <OUTPUT_DIR> --redis-port <PORT>
```

With SPARQL:

```bash
uv run python -m oc_meta.run.meta.preprocess_input <INPUT_DIR> <OUTPUT_DIR> \
  --sparql-endpoint "http://localhost:8805?access-token=TOKEN"
```

## Example with all options

Redis mode:

```bash
uv run python -m oc_meta.run.meta.preprocess_input input/ output/ \
  --redis-port 6379 \
  --rows-per-file 5000 \
  --redis-host 192.168.1.100 \
  --redis-db 10 \
  --workers 8
```

SPARQL mode:

```bash
uv run python -m oc_meta.run.meta.preprocess_input input/ output/ \
  --sparql-endpoint "http://localhost:8805?access-token=TOKEN" \
  --rows-per-file 5000 \
  --workers 24
```

## Options

| Option | Required | Default | Description |
|--------|----------|---------|-------------|
| `input_dir` | Yes | - | Directory containing input CSV files |
| `output` | Yes | - | Output path: directory for split files, or path ending in `.csv` for single file |
| `--redis-port` | Yes (Redis mode) | - | Redis port |
| `--sparql-endpoint` | Yes (SPARQL mode) | - | SPARQL endpoint URL (alternative to Redis) |
| `--rows-per-file` | No | 3000 | Split output into files of N rows each |
| `--single-file` | No | - | Write all output rows to a single CSV file |
| `--redis-host` | No | localhost | Redis hostname |
| `--redis-db` | No | 10 | Redis database number |
| `--workers` | No | 4 | Number of parallel workers |

Either `--redis-port` or `--sparql-endpoint` must be provided. When `--sparql-endpoint` is set, Redis options are ignored.

`--rows-per-file` and `--single-file` are mutually exclusive. With `--single-file`, the output path determines the file name: if it ends in `.csv`, that exact path is used; otherwise, a `merged.csv` file is created in the given directory.

## Progress display

During execution, the script shows a progress bar with:
- Current phase being processed
- Progress percentage and count (e.g., `50% 5/10`)
- Elapsed time
- Estimated time remaining

## Output report

When finished, the script prints a summary table:

```
               Processing Report
┏━━━━━━━━━━━━━━��━━━━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ Metric                       ┃ Value    ┃
┡━━━━━━━━━��━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩
│ Total input files processed  │ 10       │
│ Total input rows             │ 150000   │
│ Rows discarded (duplicates)  │ 12500    │
│ Rows discarded (existing IDs)│ 8200     │
│ Rows written to output       │ 129300   │
│                              │          │
│ Duplicate rows %             │ 8.3%     │
│ Existing IDs %               │ 5.5%     │
│ Processed rows %             │ 86.2%    │
└──────��───────────────────────┴──────────┘
```
