Group entities#

The grouping script analyzes merge instructions and groups related entities together. This enables parallel processing without conflicts.

Usage#

uv run python -m oc_meta.run.merge.group_entities <CSV_FILE> <OUTPUT_DIR> <META_CONFIG> [OPTIONS]

Argument

Description

CSV_FILE

CSV file with merge instructions (from find duplicates)

OUTPUT_DIR

Directory for grouped CSV files

META_CONFIG

Path to Meta config file

Option

Default

Description

--min_group_size

50

Minimum entities per group

Example#

uv run python -m oc_meta.run.merge.group_entities \
  duplicates.csv \
  groups/ \
  meta_config.yaml \
  --min_group_size 100

What the script does#

1. Identifies relationships#

Queries the SPARQL endpoint to find all entities related to those being merged:

  • Author/editor references

  • Publisher references

  • Venue containment

  • Identifier assignments

2. Groups by RDF connections#

Uses a Union-Find (disjoint set) algorithm to group entities that share relationships. If A is related to B, and B is related to C, then A, B, and C end up in the same group.

3. Groups by file range#

Entities sharing the same RDF file path are grouped together. The script calculates file paths from OMIDs using the config settings (supplier_prefix, dir_split_number, items_per_file).

For example, these entities share file br/060/10000/1000.zip:

  • br/060/1

  • br/060/500

  • br/060/999

4. Balances workloads#

Small independent groups are combined until they reach min_group_size. Large interconnected groups are kept separate. The goal is balanced worker loads.

Output#

The script creates multiple CSV files in the output directory:

groups/
├── group_0001.csv
├── group_0002.csv
├── group_0003.csv
└── ...

Each file contains merge instructions for related entities that should be processed together.

Config settings used#

From meta_config.yaml:

Setting

Purpose

triplestore_url

Query entity relationships

supplier_prefix

Calculate file paths

dir_split_number

Calculate directory structure

items_per_file

Calculate file assignments

zip_output_rdf

Determine file extensions