Merge entities#
The merge script processes CSV files with merge instructions and consolidates duplicate entities.
Usage#
uv run python -m oc_meta.run.merge.entities <CSV_FOLDER> <META_CONFIG> <RESP_AGENT> [OPTIONS]
Argument |
Description |
|---|---|
|
Folder with merge instruction CSVs |
|
Path to Meta config file |
|
Responsible agent URI for provenance |
Option |
Default |
Description |
|---|---|---|
|
ra br id |
Entity types to merge (space-separated) |
|
stop.out |
File to trigger graceful stop |
|
4 |
Parallel workers |
Examples#
Basic merge:
uv run python -m oc_meta.run.merge.entities \
groups/ \
meta_config.yaml \
https://w3id.org/oc/meta/prov/pa/1
With more workers:
uv run python -m oc_meta.run.merge.entities \
groups/ \
meta_config.yaml \
https://w3id.org/oc/meta/prov/pa/1 \
--workers 8
Merge only bibliographic resources:
uv run python -m oc_meta.run.merge.entities \
groups/ \
meta_config.yaml \
https://w3id.org/oc/meta/prov/pa/1 \
--entity_types br
CSV input format#
Each CSV file should have:
surviving_entity,merged_entities
https://w3id.org/oc/meta/br/060/1,https://w3id.org/oc/meta/br/060/2;https://w3id.org/oc/meta/br/060/3
Use output from find duplicates or group entities.
What the merge does#
For each row, the script:
Loads entities from RDF files
Copies identifiers from merged entities to surviving entity
Fills metadata gaps (title, date, etc.) from merged entities
Updates references in other entities pointing to merged entities
Keeps author/editor chains from surviving entity (merged entity’s chains are discarded)
Records provenance for the merge operation
Invalidates merged entities marking them as merged
Writes updated RDF back to files
Uploads changes to triplestore
File locking#
The script uses FileLock from oc_ocdm.Storer to prevent concurrent writes to the same file. Even with proper grouping, locks provide a safety net.
Graceful interruption#
To stop processing cleanly:
touch stop.out
The script will:
Finish current merge operations
Save progress
Exit with status code 0
To resume, run the same command again. Already-processed files are skipped.
Progress tracking#
The script tracks processed files in memory. If interrupted and resumed, it re-processes from the beginning of the current file but skips completed files.
For very long-running merges, monitor output for progress:
Processing group_0001.csv: 45/100 entities
Processing group_0001.csv: 46/100 entities
...