Find duplicates
These scripts scan RDF files in ZIP archives to find duplicates that need merging.
Find duplicate identifiers
Section titled “Find duplicate identifiers”Finds identifier entities that share the same value, indicating duplicates in the id/ folder.
uv run python -m oc_meta.run.find.duplicated_ids <FOLDER_PATH> <CSV_PATH> [OPTIONS]| Argument | Description |
|---|---|
FOLDER_PATH | Path to folder containing the id/ subfolder with ZIP files |
CSV_PATH | Output CSV file for duplicates |
| Option | Default | Description |
|---|---|---|
--chunk-size | 5000 | ZIP files to process per chunk (results saved to temp files between chunks) |
--temp-dir | system temp | Directory for temporary files |
Example:
uv run python -m oc_meta.run.find.duplicated_ids /data/meta/rdf duplicated_ids.csvOutput format
Section titled “Output format”surviving_entity,merged_entitieshttps://w3id.org/oc/meta/id/0601,https://w3id.org/oc/meta/id/0602; https://w3id.org/oc/meta/id/0603The surviving entity is arbitrarily selected from the duplicate set.
Find duplicate entities
Section titled “Find duplicate entities”Finds bibliographic resources or responsible agents that share identifiers.
uv run python -m oc_meta.run.find.duplicated_entities <FOLDER_PATH> <CSV_PATH> <RESOURCE_TYPE>| Argument | Description |
|---|---|
FOLDER_PATH | Path to RDF folder (should contain br/ and/or ra/) |
CSV_PATH | Output CSV file |
RESOURCE_TYPE | br for bibliographic resources, ra for responsible agents, both for both |
Find duplicate bibliographic resources:
uv run python -m oc_meta.run.find.duplicated_entities /data/rdf dup_br.csv brFind duplicate responsible agents:
uv run python -m oc_meta.run.find.duplicated_entities /data/rdf dup_ra.csv raFind both:
uv run python -m oc_meta.run.find.duplicated_entities /data/rdf dup_all.csv bothOutput format
Section titled “Output format”surviving_entity,merged_entitieshttps://w3id.org/oc/meta/br/0601,https://w3id.org/oc/meta/br/0602; https://w3id.org/oc/meta/br/0603The surviving entity is arbitrarily selected from the duplicate set.
How duplicates are detected
Section titled “How duplicates are detected”duplicated_ids: Finds identifier entities (id/) that have the same scheme and literal value. For example, two ID entities both representing doi:10.1234/a are duplicates.
duplicated_entities: Finds BR or RA entities that reference the same identifier URI. For example:
br/0601hasdatacite:hasIdentifierpointing toid/0610br/0602hasdatacite:hasIdentifierpointing toid/0610
These share the same identifier entity, so they’re duplicates.
The duplicated_entities script uses Union-Find to handle transitive relationships. If A shares an identifier with B, and B shares an identifier with C, then A, B, and C are all grouped together even if A and C share no direct identifier.
Expected directory structure
Section titled “Expected directory structure”/data/meta/rdf/├── br/│ └── 060/│ └── 10000/│ ├── 1000.zip│ └── ...├── ra/│ └── ...└── id/ └── ...Next steps
Section titled “Next steps”Use the output CSV with:
- Group entities - Prepare for parallel merging
- Merge entities - Execute the merge