Find duplicates

These scripts scan RDF files in ZIP archives to find duplicates that need merging.

Find duplicate identifiers

Finds identifier entities that share the same value, indicating duplicates in the id/ folder.

uv run python -m oc_meta.run.find.duplicated_ids <FOLDER_PATH> <CSV_PATH> [OPTIONS]

Argument	Description
`FOLDER_PATH`	Path to folder containing the `id/` subfolder with ZIP files
`CSV_PATH`	Output CSV file for duplicates

Option	Default	Description
`--chunk-size`	5000	ZIP files to process per chunk (results saved to temp files between chunks)
`--temp-dir`	system temp	Directory for temporary files

Example:

uv run python -m oc_meta.run.find.duplicated_ids /data/meta/rdf duplicated_ids.csv

Output format

surviving_entity,merged_entities
https://w3id.org/oc/meta/id/0601,https://w3id.org/oc/meta/id/0602; https://w3id.org/oc/meta/id/0603

The surviving entity is arbitrarily selected from the duplicate set.

Find duplicate entities

Finds bibliographic resources or responsible agents that share identifiers.

uv run python -m oc_meta.run.find.duplicated_entities <FOLDER_PATH> <CSV_PATH> <RESOURCE_TYPE>

Argument	Description
`FOLDER_PATH`	Path to RDF folder (should contain `br/` and/or `ra/`)
`CSV_PATH`	Output CSV file
`RESOURCE_TYPE`	`br` for bibliographic resources, `ra` for responsible agents, `both` for both

Find duplicate bibliographic resources:

uv run python -m oc_meta.run.find.duplicated_entities /data/rdf dup_br.csv br

Find duplicate responsible agents:

uv run python -m oc_meta.run.find.duplicated_entities /data/rdf dup_ra.csv ra

Find both:

uv run python -m oc_meta.run.find.duplicated_entities /data/rdf dup_all.csv both

Output format

surviving_entity,merged_entities
https://w3id.org/oc/meta/br/0601,https://w3id.org/oc/meta/br/0602; https://w3id.org/oc/meta/br/0603

The surviving entity is arbitrarily selected from the duplicate set.

How duplicates are detected

duplicated_ids: Finds identifier entities (id/) that have the same scheme and literal value. For example, two ID entities both representing doi:10.1234/a are duplicates.

duplicated_entities: Finds BR or RA entities that reference the same identifier URI. For example:

br/0601 has datacite:hasIdentifier pointing to id/0610
br/0602 has datacite:hasIdentifier pointing to id/0610

These share the same identifier entity, so they’re duplicates.

The duplicated_entities script uses Union-Find to handle transitive relationships. If A shares an identifier with B, and B shares an identifier with C, then A, B, and C are all grouped together even if A and C share no direct identifier.

Expected directory structure

/data/meta/rdf/
├── br/
│   └── 060/
│       └── 10000/
│           ├── 1000.zip
│           └── ...
├── ra/
│   └── ...
└── id/
    └── ...

Next steps

Use the output CSV with:

Group entities - Prepare for parallel merging
Merge entities - Execute the merge