Verify merge

These scripts verify that merge operations completed correctly by checking RDF files, provenance, and the triplestore. If issues are found, they generate SPARQL queries to fix them.

Scripts

Three scripts are available, one for each entity type:

Script	Entity type
`check_merged_brs_results.py`	Bibliographic resources (BR)
`check_merged_ras_results.py`	Responsible agents (RA)
`check_merged_ids_results.py`	Identifiers (ID)

Usage

uv run python -m oc_meta.run.merge.check_merged_brs_results <CSV_FOLDER> <RDF_DIR> --meta_config <CONFIG> --query_output <OUTPUT_DIR>

Argument	Description
`CSV_FOLDER`	Folder containing merge CSV files (with `Done` column)
`RDF_DIR`	Path to RDF directory
`--meta_config`	Path to meta configuration file
`--query_output`	Folder where fix queries will be saved

Example

uv run python -m oc_meta.run.merge.check_merged_brs_results \
  groups/ \
  /data/rdf \
  --meta_config meta_config.yaml \
  --query_output fix_queries/

What gets checked

For each row marked as Done=True in the CSV files:

RDF files:

Surviving entity exists
Merged entities are deleted
Entity constraints are valid (types, identifiers, required properties)

Provenance:

Correct number of snapshots
Sequential snapshot numbering
Generation and invalidation timestamps
Derivation chain (prov:wasDerivedFrom)
Merge snapshots derived from multiple sources

Triplestore (SPARQL):

Surviving entity exists
Merged entities don’t exist
No references to merged entities remain

Entity-specific constraints

BR (bibliographic resources):

Must be fabio:Expression
At most two types
At least one identifier
At most one title, partOf, publication date, sequence identifier

RA (responsible agents):

Must be foaf:Agent
Exactly one type
At least one identifier
At least one name property (name, givenName, or familyName)

ID (identifiers):

Must be datacite:Identifier
Exactly one usesIdentifierScheme
Exactly one hasLiteralValue

Fix queries

When issues are found with merged entities that still exist or are still referenced, the script generates SPARQL UPDATE queries in the output folder:

fix_queries/
├── update_12345.sparql
├── update_12346.sparql
└── ...

Each query deletes the merged entity’s triples and redirects references to the surviving entity.

Parallel processing

The scripts use multiprocessing to check entities in parallel. They group entities by file to minimize file I/O (each RDF file is opened once for all entities it contains).