Verify merge#
These scripts verify that merge operations completed correctly by checking RDF files, provenance, and the triplestore. If issues are found, they generate SPARQL queries to fix them.
Scripts#
Three scripts are available, one for each entity type:
Script |
Entity type |
|---|---|
|
Bibliographic resources (BR) |
|
Responsible agents (RA) |
|
Identifiers (ID) |
Usage#
uv run python -m oc_meta.run.merge.check_merged_brs_results <CSV_FOLDER> <RDF_DIR> --meta_config <CONFIG> --query_output <OUTPUT_DIR>
Argument |
Description |
|---|---|
|
Folder containing merge CSV files (with |
|
Path to RDF directory |
|
Path to meta configuration file |
|
Folder where fix queries will be saved |
Example#
uv run python -m oc_meta.run.merge.check_merged_brs_results \
groups/ \
/data/rdf \
--meta_config meta_config.yaml \
--query_output fix_queries/
What gets checked#
For each row marked as Done=True in the CSV files:
RDF files:
Surviving entity exists
Merged entities are deleted
Entity constraints are valid (types, identifiers, required properties)
Provenance:
Correct number of snapshots
Sequential snapshot numbering
Generation and invalidation timestamps
Derivation chain (
prov:wasDerivedFrom)Merge snapshots derived from multiple sources
Triplestore (SPARQL):
Surviving entity exists
Merged entities don’t exist
No references to merged entities remain
Entity-specific constraints#
BR (bibliographic resources):
Must be
fabio:ExpressionAt most two types
At least one identifier
At most one title, partOf, publication date, sequence identifier
RA (responsible agents):
Must be
foaf:AgentExactly one type
At least one identifier
At least one name property (name, givenName, or familyName)
ID (identifiers):
Must be
datacite:IdentifierExactly one
usesIdentifierSchemeExactly one
hasLiteralValue
Fix queries#
When issues are found with merged entities that still exist or are still referenced, the script generates SPARQL UPDATE queries in the output folder:
fix_queries/
├── update_12345.sparql
├── update_12346.sparql
└── ...
Each query deletes the merged entity’s triples and redirects references to the surviving entity.
Parallel processing#
The scripts use multiprocessing to check entities in parallel. They group entities by file to minimize file I/O (each RDF file is opened once for all entities it contains).