Group entities
The grouping script analyzes merge instructions and groups related entities together. This enables parallel processing without conflicts.
uv run python -m oc_meta.run.merge.group_entities <CSV_FILE> <OUTPUT_DIR> <META_CONFIG> [OPTIONS]| Argument | Description |
|---|---|
CSV_FILE | CSV file with merge instructions (from find duplicates) |
OUTPUT_DIR | Directory for grouped CSV files |
META_CONFIG | Path to Meta config file |
| Option | Default | Description |
|---|---|---|
--min_group_size | 50 | Minimum entities per group |
Example
Section titled “Example”uv run python -m oc_meta.run.merge.group_entities \ duplicates.csv \ groups/ \ meta_config.yaml \ --min_group_size 100What the script does
Section titled “What the script does”1. Identifies relationships
Section titled “1. Identifies relationships”Queries the SPARQL endpoint to find all entities related to those being merged:
- Author/editor references
- Publisher references
- Venue containment
- Identifier assignments
2. Groups by RDF connections
Section titled “2. Groups by RDF connections”Uses a Union-Find (disjoint set) algorithm to group entities that share relationships. If A is related to B, and B is related to C, then A, B, and C end up in the same group.
3. Groups by file range
Section titled “3. Groups by file range”Entities sharing the same RDF file path are grouped together. The script calculates file paths from OMIDs using the config settings (supplier_prefix, dir_split_number, items_per_file).
For example, these entities share file br/060/10000/1000.zip:
br/060/1br/060/500br/060/999
4. Balances workloads
Section titled “4. Balances workloads”Small independent groups are combined until they reach min_group_size. Large interconnected groups are kept separate. The goal is balanced worker loads.
Output
Section titled “Output”The script creates multiple CSV files in the output directory:
groups/├── group_0001.csv├── group_0002.csv├── group_0003.csv└── ...Each file contains merge instructions for related entities that should be processed together.
Config settings used
Section titled “Config settings used”From meta_config.yaml:
| Setting | Purpose |
|---|---|
triplestore_url | Query entity relationships |
supplier_prefix | Calculate file paths |
dir_split_number | Calculate directory structure |
items_per_file | Calculate file assignments |
zip_output_rdf | Determine file extensions |