Skip to content

Verify merge

These scripts verify that merge operations completed correctly by checking RDF files, provenance, and the triplestore. If issues are found, they generate SPARQL queries to fix them.

Three scripts are available, one for each entity type:

ScriptEntity type
check_merged_brs_results.pyBibliographic resources (BR)
check_merged_ras_results.pyResponsible agents (RA)
check_merged_ids_results.pyIdentifiers (ID)
Terminal window
uv run python -m oc_meta.run.merge.check_merged_brs_results <CSV_FOLDER> <RDF_DIR> --meta_config <CONFIG> --query_output <OUTPUT_DIR>
ArgumentDescription
CSV_FOLDERFolder containing merge CSV files (with Done column)
RDF_DIRPath to RDF directory
--meta_configPath to meta configuration file
--query_outputFolder where fix queries will be saved
Terminal window
uv run python -m oc_meta.run.merge.check_merged_brs_results \
groups/ \
/data/rdf \
--meta_config meta_config.yaml \
--query_output fix_queries/

For each row marked as Done=True in the CSV files:

RDF files:

  • Surviving entity exists
  • Merged entities are deleted
  • Entity constraints are valid (types, identifiers, required properties)

Provenance:

  • Correct number of snapshots
  • Sequential snapshot numbering
  • Generation and invalidation timestamps
  • Derivation chain (prov:wasDerivedFrom)
  • Merge snapshots derived from multiple sources

Triplestore (SPARQL):

  • Surviving entity exists
  • Merged entities don’t exist
  • No references to merged entities remain

BR (bibliographic resources):

  • Must be fabio:Expression
  • At most two types
  • At least one identifier
  • At most one title, partOf, publication date, sequence identifier

RA (responsible agents):

  • Must be foaf:Agent
  • Exactly one type
  • At least one identifier
  • At least one name property (name, givenName, or familyName)

ID (identifiers):

  • Must be datacite:Identifier
  • Exactly one usesIdentifierScheme
  • Exactly one hasLiteralValue

When issues are found with merged entities that still exist or are still referenced, the script generates SPARQL UPDATE queries in the output folder:

fix_queries/
├── update_12345.sparql
├── update_12346.sparql
└── ...

Each query deletes the merged entity’s triples and redirects references to the surviving entity.

The scripts use multiprocessing to check entities in parallel. They group entities by file to minimize file I/O (each RDF file is opened once for all entities it contains).