Skip to content

Merge entities

The merge script processes CSV files with merge instructions and consolidates duplicate entities.

Terminal window
uv run python -m oc_meta.run.merge.entities <CSV_FOLDER> <META_CONFIG> <RESP_AGENT> [OPTIONS]
ArgumentDescription
CSV_FOLDERFolder with merge instruction CSVs
META_CONFIGPath to Meta config file
RESP_AGENTResponsible agent URI for provenance
OptionDefaultDescription
--entity_typesra br idEntity types to merge (space-separated)
--stop_filestop.outFile to trigger graceful stop
--workers4Parallel workers

Basic merge:

Terminal window
uv run python -m oc_meta.run.merge.entities \
groups/ \
meta_config.yaml \
https://w3id.org/oc/meta/prov/pa/1

With more workers:

Terminal window
uv run python -m oc_meta.run.merge.entities \
groups/ \
meta_config.yaml \
https://w3id.org/oc/meta/prov/pa/1 \
--workers 8

Merge only bibliographic resources:

Terminal window
uv run python -m oc_meta.run.merge.entities \
groups/ \
meta_config.yaml \
https://w3id.org/oc/meta/prov/pa/1 \
--entity_types br

Each CSV file should have:

surviving_entity,merged_entities
https://w3id.org/oc/meta/br/060/1,https://w3id.org/oc/meta/br/060/2;https://w3id.org/oc/meta/br/060/3

Use output from find duplicates or group entities.

For each row, the script:

  1. Loads entities from RDF files
  2. Copies identifiers from merged entities to surviving entity
  3. Fills metadata gaps (title, date, etc.) from merged entities
  4. Updates references in other entities pointing to merged entities
  5. Keeps author/editor chains from surviving entity (merged entity’s chains are discarded)
  6. Records provenance for the merge operation
  7. Invalidates merged entities marking them as merged
  8. Writes updated RDF back to files
  9. Uploads changes to triplestore

The script uses FileLock from oc_ocdm.Storer to prevent concurrent writes to the same file. Even with proper grouping, locks provide a safety net.

To stop processing cleanly:

Terminal window
touch stop.out

The script will:

  1. Finish current merge operations
  2. Save progress
  3. Exit with status code 0

To resume, run the same command again. Already-processed files are skipped.

The script tracks processed files in memory. If interrupted and resumed, it re-processes from the beginning of the current file but skips completed files.

For very long-running merges, monitor output for progress:

Processing group_0001.csv: 45/100 entities
Processing group_0001.csv: 46/100 entities
...