Verify merge

Verify merge#

These scripts verify that merge operations completed correctly by checking RDF files, provenance, and the triplestore. If issues are found, they generate SPARQL queries to fix them.

Scripts#

Three scripts are available, one for each entity type:

Script	Entity type
`check_merged_brs_results.py`	Bibliographic resources (BR)
`check_merged_ras_results.py`	Responsible agents (RA)
`check_merged_ids_results.py`	Identifiers (ID)

Usage#

uv run python -m oc_meta.run.merge.check_merged_brs_results <CSV_FOLDER> <RDF_DIR> --meta_config <CONFIG> --query_output <OUTPUT_DIR>

Argument	Description
`CSV_FOLDER`	Folder containing merge CSV files (with `Done` column)
`RDF_DIR`	Path to RDF directory
`--meta_config`	Path to meta configuration file
`--query_output`	Folder where fix queries will be saved

Example#

uv run python -m oc_meta.run.merge.check_merged_brs_results \
  groups/ \
  /data/rdf \
  --meta_config meta_config.yaml \
  --query_output fix_queries/

What gets checked#

For each row marked as Done=True in the CSV files:

RDF files:

Surviving entity exists
Merged entities are deleted
Entity constraints are valid (types, identifiers, required properties)

Provenance:

Correct number of snapshots
Sequential snapshot numbering
Generation and invalidation timestamps
Derivation chain (prov:wasDerivedFrom)
Merge snapshots derived from multiple sources

Triplestore (SPARQL):

Surviving entity exists
Merged entities don’t exist
No references to merged entities remain

Entity-specific constraints#

BR (bibliographic resources):

Must be fabio:Expression
At most two types
At least one identifier
At most one title, partOf, publication date, sequence identifier

RA (responsible agents):

Must be foaf:Agent
Exactly one type
At least one identifier
At least one name property (name, givenName, or familyName)

ID (identifiers):

Must be datacite:Identifier
Exactly one usesIdentifierScheme
Exactly one hasLiteralValue

Fix queries#

When issues are found with merged entities that still exist or are still referenced, the script generates SPARQL UPDATE queries in the output folder:

fix_queries/
├── update_12345.sparql
├── update_12346.sparql
└── ...

Each query deletes the merged entity’s triples and redirects references to the surviving entity.

Parallel processing#

The scripts use multiprocessing to check entities in parallel. They group entities by file to minimize file I/O (each RDF file is opened once for all entities it contains).