Skip to content

Group entities

The grouping script analyzes merge instructions and groups related entities together. This enables parallel processing without conflicts.

Terminal window
uv run python -m oc_meta.run.merge.group_entities <CSV_FILE> <OUTPUT_DIR> <META_CONFIG> [OPTIONS]
ArgumentDescription
CSV_FILECSV file with merge instructions (from find duplicates)
OUTPUT_DIRDirectory for grouped CSV files
META_CONFIGPath to Meta config file
OptionDefaultDescription
--min_group_size50Minimum entities per group
Terminal window
uv run python -m oc_meta.run.merge.group_entities \
duplicates.csv \
groups/ \
meta_config.yaml \
--min_group_size 100

Queries the SPARQL endpoint to find all entities related to those being merged:

  • Author/editor references
  • Publisher references
  • Venue containment
  • Identifier assignments

Uses a Union-Find (disjoint set) algorithm to group entities that share relationships. If A is related to B, and B is related to C, then A, B, and C end up in the same group.

Entities sharing the same RDF file path are grouped together. The script calculates file paths from OMIDs using the config settings (supplier_prefix, dir_split_number, items_per_file).

For example, these entities share file br/060/10000/1000.zip:

  • br/060/1
  • br/060/500
  • br/060/999

Small independent groups are combined until they reach min_group_size. Large interconnected groups are kept separate. The goal is balanced worker loads.

The script creates multiple CSV files in the output directory:

groups/
├── group_0001.csv
├── group_0002.csv
├── group_0003.csv
└── ...

Each file contains merge instructions for related entities that should be processed together.

From meta_config.yaml:

SettingPurpose
triplestore_urlQuery entity relationships
supplier_prefixCalculate file paths
dir_split_numberCalculate directory structure
items_per_fileCalculate file assignments
zip_output_rdfDetermine file extensions