Data curation
Validates, normalizes, and cleans bibliographic metadata from CSV files. Handles identifier validation, duplicate detection, and data normalization.
Data curation
Validates, normalizes, and cleans bibliographic metadata from CSV files. Handles identifier validation, duplicate detection, and data normalization.
RDF generation
Converts curated data into RDF following the OpenCitations Data Model. Creates bibliographic resources, responsible agents, and identifiers with provenance tracking.
Duplicate detection
Identifies duplicate entities across the dataset by analyzing identifiers in RDF files. Groups related entities for batch processing.
Entity merging
Merges duplicate entities using Union-Find algorithm. Handles bibliographic resources, responsible agents, and identifiers with parallel processing.
Entity editing
Modifies existing RDF entities: add, update, or delete triples. Generates provenance snapshots for each modification.
CSV generation
Generates CSV dumps from RDF data. Extracts bibliographic metadata back to tabular format for analysis or migration.
Info dir management
Manages Redis counters for entity numbering. Rebuilds counters from RDF files after system recovery or data import.
Migration tools
Imports RDF from external sources, extracts subsets from triplestores, and converts provenance formats.
Benchmarks
Measures processing performance with synthetic data. Supports scalability analysis across dataset sizes.
Install:
pip install oc_metaRun the main processing pipeline:
python -m oc_meta.run.meta_process -c meta_config.yamlMeta expects CSV files with these columns:
| Column | Description |
|---|---|
id | Space-separated identifiers (doi:10.1162/qss_a_00292 pmid:38034492) |
title | Title of the work |
author | Semicolon-separated names with optional identifiers (Peroni, Silvio [orcid:0000-0003-0530-4305]; Shotton, David) |
pub_date | ISO 8601 date (2024-01-22, 2024-01, or 2024) |
venue | Container title with optional identifier (Quantitative Science Studies [issn:2641-3337]) |
volume | Volume number |
issue | Issue number |
page | Page range (50-75) |
type | Resource type (journal article, book chapter, proceedings article, etc.) |
publisher | Publisher name with optional identifier (MIT Press [crossref:281]) |
editor | Same format as author |
See the CSV format reference for the complete specification.