Skip to content

Merge overview

The merge tools find duplicate entities and consolidate them, combining their data and updating all references.

  1. Find duplicates - Scan RDF files to find entities sharing identifiers
  2. Group entities - Prepare for parallel processing
  3. Execute merge - Consolidate entities with provenance tracking
  4. Track history - Reconstruct what was merged (optional)

Find duplicates:

Terminal window
uv run python -m oc_meta.run.find.duplicated_entities /data/rdf duplicates.csv br

Group for parallel processing:

Terminal window
uv run python -m oc_meta.run.merge.group_entities duplicates.csv groups/ meta_config.yaml

Merge:

Terminal window
uv run python -m oc_meta.run.merge.entities groups/ meta_config.yaml https://w3id.org/oc/meta/prov/pa/1

Optional - see what was merged:

Terminal window
uv run python -m oc_meta.run.find.merged_entities -c meta_config.yaml -o merged.csv --entity-type br
ToolPurpose
Find duplicatesScan RDF files for duplicate identifiers and entities
Group entitiesPrepare duplicates for parallel merging
Merge entitiesExecute merge operations
Verify mergeCheck merge results and generate fix queries
Compact CSVExtract completed merges into a single file
Merge historyReconstruct merge history from provenance

When entity B is merged into entity A:

  1. Identifiers from B are added to A
  2. Metadata from B fills gaps in A (titles, dates, etc.)
  3. Relationships pointing to B are redirected to A
  4. Author/editor chains from A are kept (B’s chains are discarded)
  5. Provenance records the merge operation
  6. Entity B is marked as merged and invalidated

The surviving entity (A) becomes the canonical representation. The merged entity (B) is preserved in provenance for historical queries but is no longer active.