Group entities

Group entities#

The grouping script analyzes merge instructions and groups related entities together. This enables parallel processing without conflicts.

Usage#

uv run python -m oc_meta.run.merge.group_entities <CSV_FILE> <OUTPUT_DIR> <META_CONFIG> [OPTIONS]

Argument	Description
`CSV_FILE`	CSV file with merge instructions (from find duplicates)
`OUTPUT_DIR`	Directory for grouped CSV files
`META_CONFIG`	Path to Meta config file

Option	Default	Description
`--min_group_size`	50	Minimum entities per group

Example#

uv run python -m oc_meta.run.merge.group_entities \
  duplicates.csv \
  groups/ \
  meta_config.yaml \
  --min_group_size 100

What the script does#

1. Identifies relationships#

Queries the SPARQL endpoint to find all entities related to those being merged:

Author/editor references
Publisher references
Venue containment
Identifier assignments

2. Groups by RDF connections#

Uses a Union-Find (disjoint set) algorithm to group entities that share relationships. If A is related to B, and B is related to C, then A, B, and C end up in the same group.

3. Groups by file range#

Entities sharing the same RDF file path are grouped together. The script calculates file paths from OMIDs using the config settings (supplier_prefix, dir_split_number, items_per_file).

For example, these entities share file br/060/10000/1000.zip:

br/060/1
br/060/500
br/060/999

4. Balances workloads#

Small independent groups are combined until they reach min_group_size. Large interconnected groups are kept separate. The goal is balanced worker loads.

Output#

The script creates multiple CSV files in the output directory:

groups/
├── group_0001.csv
├── group_0002.csv
├── group_0003.csv
└── ...

Each file contains merge instructions for related entities that should be processed together.

Config settings used#

From meta_config.yaml:

Setting	Purpose
`triplestore_url`	Query entity relationships
`supplier_prefix`	Calculate file paths
`dir_split_number`	Calculate directory structure
`items_per_file`	Calculate file assignments
`zip_output_rdf`	Determine file extensions