Merge overview#
The merge tools find duplicate entities and consolidate them, combining their data and updating all references.
Workflow#
Find duplicates - Scan RDF files to find entities sharing identifiers
Group entities - Prepare for parallel processing
Execute merge - Consolidate entities with provenance tracking
Track history - Reconstruct what was merged (optional)
Find duplicates:
uv run python -m oc_meta.run.find.duplicated_entities /data/rdf duplicates.csv br
Group for parallel processing:
uv run python -m oc_meta.run.merge.group_entities duplicates.csv groups/ meta_config.yaml
Merge:
uv run python -m oc_meta.run.merge.entities groups/ meta_config.yaml https://w3id.org/oc/meta/prov/pa/1
Optional - see what was merged:
uv run python -m oc_meta.run.find.merged_entities -c meta_config.yaml -o merged.csv --entity-type br
Available tools#
Tool |
Purpose |
|---|---|
Scan RDF files for duplicate identifiers and entities |
|
Prepare duplicates for parallel merging |
|
Execute merge operations |
|
Check merge results and generate fix queries |
|
Extract completed merges into a single file |
|
Reconstruct merge history from provenance |
What happens during merge#
When entity B is merged into entity A:
Identifiers from B are added to A
Metadata from B fills gaps in A (titles, dates, etc.)
Relationships pointing to B are redirected to A
Author/editor chains from A are kept (B’s chains are discarded)
Provenance records the merge operation
Entity B is marked as merged and invalidated
The surviving entity (A) becomes the canonical representation. The merged entity (B) is preserved in provenance for historical queries but is no longer active.