Skip to content

Find duplicates

These scripts scan RDF files in ZIP archives to find duplicates that need merging.

Finds identifier entities that share the same value, indicating duplicates in the id/ folder.

Terminal window
uv run python -m oc_meta.run.find.duplicated_ids <FOLDER_PATH> <CSV_PATH> [OPTIONS]
ArgumentDescription
FOLDER_PATHPath to folder containing the id/ subfolder with ZIP files
CSV_PATHOutput CSV file for duplicates
OptionDefaultDescription
--chunk-size5000ZIP files to process per chunk (results saved to temp files between chunks)
--temp-dirsystem tempDirectory for temporary files

Example:

Terminal window
uv run python -m oc_meta.run.find.duplicated_ids /data/meta/rdf duplicated_ids.csv
surviving_entity,merged_entities
https://w3id.org/oc/meta/id/0601,https://w3id.org/oc/meta/id/0602; https://w3id.org/oc/meta/id/0603

The surviving entity is arbitrarily selected from the duplicate set.

Finds bibliographic resources or responsible agents that share identifiers.

Terminal window
uv run python -m oc_meta.run.find.duplicated_entities <FOLDER_PATH> <CSV_PATH> <RESOURCE_TYPE>
ArgumentDescription
FOLDER_PATHPath to RDF folder (should contain br/ and/or ra/)
CSV_PATHOutput CSV file
RESOURCE_TYPEbr for bibliographic resources, ra for responsible agents, both for both

Find duplicate bibliographic resources:

Terminal window
uv run python -m oc_meta.run.find.duplicated_entities /data/rdf dup_br.csv br

Find duplicate responsible agents:

Terminal window
uv run python -m oc_meta.run.find.duplicated_entities /data/rdf dup_ra.csv ra

Find both:

Terminal window
uv run python -m oc_meta.run.find.duplicated_entities /data/rdf dup_all.csv both
surviving_entity,merged_entities
https://w3id.org/oc/meta/br/0601,https://w3id.org/oc/meta/br/0602; https://w3id.org/oc/meta/br/0603

The surviving entity is arbitrarily selected from the duplicate set.

duplicated_ids: Finds identifier entities (id/) that have the same scheme and literal value. For example, two ID entities both representing doi:10.1234/a are duplicates.

duplicated_entities: Finds BR or RA entities that reference the same identifier URI. For example:

  • br/0601 has datacite:hasIdentifier pointing to id/0610
  • br/0602 has datacite:hasIdentifier pointing to id/0610

These share the same identifier entity, so they’re duplicates.

The duplicated_entities script uses Union-Find to handle transitive relationships. If A shares an identifier with B, and B shares an identifier with C, then A, B, and C are all grouped together even if A and C share no direct identifier.

/data/meta/rdf/
├── br/
│ └── 060/
│ └── 10000/
│ ├── 1000.zip
│ └── ...
├── ra/
│ └── ...
└── id/
└── ...

Use the output CSV with:

  1. Group entities - Prepare for parallel merging
  2. Merge entities - Execute the merge