Merge entities
The merge script processes CSV files with merge instructions and consolidates duplicate entities.
uv run python -m oc_meta.run.merge.entities <CSV_FOLDER> <META_CONFIG> <RESP_AGENT> [OPTIONS]| Argument | Description |
|---|---|
CSV_FOLDER | Folder with merge instruction CSVs |
META_CONFIG | Path to Meta config file |
RESP_AGENT | Responsible agent URI for provenance |
| Option | Default | Description |
|---|---|---|
--entity_types | ra br id | Entity types to merge (space-separated) |
--stop_file | stop.out | File to trigger graceful stop |
--workers | 4 | Parallel workers |
Examples
Section titled “Examples”Basic merge:
uv run python -m oc_meta.run.merge.entities \ groups/ \ meta_config.yaml \ https://w3id.org/oc/meta/prov/pa/1With more workers:
uv run python -m oc_meta.run.merge.entities \ groups/ \ meta_config.yaml \ https://w3id.org/oc/meta/prov/pa/1 \ --workers 8Merge only bibliographic resources:
uv run python -m oc_meta.run.merge.entities \ groups/ \ meta_config.yaml \ https://w3id.org/oc/meta/prov/pa/1 \ --entity_types brCSV input format
Section titled “CSV input format”Each CSV file should have:
surviving_entity,merged_entitieshttps://w3id.org/oc/meta/br/060/1,https://w3id.org/oc/meta/br/060/2;https://w3id.org/oc/meta/br/060/3Use output from find duplicates or group entities.
What the merge does
Section titled “What the merge does”For each row, the script:
- Loads entities from RDF files
- Copies identifiers from merged entities to surviving entity
- Fills metadata gaps (title, date, etc.) from merged entities
- Updates references in other entities pointing to merged entities
- Keeps author/editor chains from surviving entity (merged entity’s chains are discarded)
- Records provenance for the merge operation
- Invalidates merged entities marking them as merged
- Writes updated RDF back to files
- Uploads changes to triplestore
File locking
Section titled “File locking”The script uses FileLock from oc_ocdm.Storer to prevent concurrent writes to the same file. Even with proper grouping, locks provide a safety net.
Graceful interruption
Section titled “Graceful interruption”To stop processing cleanly:
touch stop.outThe script will:
- Finish current merge operations
- Save progress
- Exit with status code 0
To resume, run the same command again. Already-processed files are skipped.
Progress tracking
Section titled “Progress tracking”The script tracks processed files in memory. If interrupted and resumed, it re-processes from the beginning of the current file but skips completed files.
For very long-running merges, monitor output for progress:
Processing group_0001.csv: 45/100 entitiesProcessing group_0001.csv: 46/100 entities...