Getting started
Installation
Section titled “Installation”Install via pip:
pip install oc_metaFor development, clone the repository and use uv:
git clone https://github.com/opencitations/oc_meta.gitcd oc_metauv syncPrerequisites
Section titled “Prerequisites”Meta requires:
- Python 3.10+
- Redis for counter handling and caching
- Triplestore (Virtuoso or Blazegraph) for RDF storage
For local development, you can use Docker.
Redis:
docker run -d --name redis -p 6379:6379 redis:latestVirtuoso (data):
docker run -d --name virtuoso-data -p 8890:8890 -p 1111:1111 openlink/virtuoso-opensource-7:latestVirtuoso (provenance):
docker run -d --name virtuoso-prov -p 8891:8890 -p 1112:1111 openlink/virtuoso-opensource-7:latestYour first run
Section titled “Your first run”- Create a configuration file (
meta_config.yaml):
triplestore_url: "http://127.0.0.1:8890/sparql"provenance_triplestore_url: "http://127.0.0.1:8891/sparql"base_iri: "https://w3id.org/oc/meta/"context_path: "https://w3id.org/oc/corpus/context.json"resp_agent: "https://w3id.org/oc/meta/prov/pa/1"source: "https://api.crossref.org/"
redis_host: "localhost"redis_port: 6379redis_db: 0redis_cache_db: 1
supplier_prefix: "060"dir_split_number: 10000items_per_file: 1000
input_csv_dir: "/path/to/input"- Prepare input CSV with these columns:
| Column | Example |
|---|---|
id | doi:10.1162/qss_a_00292 |
title | OpenCitations Meta |
author | Peroni, Silvio [orcid:0000-0003-0530-4305]; Shotton, David |
pub_date | 2024-01-22 |
venue | Quantitative Science Studies [issn:2641-3337] |
volume | 5 |
issue | 1 |
page | 50-75 |
type | journal article |
publisher | MIT Press [crossref:281] |
editor | (same format as author) |
See CSV format for supported identifiers and formats
- Run processing:
uv run python -m oc_meta.run.meta_process -c meta_config.yamlSee the configuration reference for all available options.
Typical workflow
Section titled “Typical workflow”A production workflow usually follows these steps:
- Preprocess - Deduplicate input and filter existing IDs
- Process - Run the main Meta pipeline
- Verify - Check that all identifiers were processed correctly
Preprocess (optional but recommended):
uv run python -m oc_meta.run.meta.preprocess_input input/ preprocessed/ --storage-type redisProcess:
uv run python -m oc_meta.run.meta_process -c meta_config.yamlVerify:
uv run python -m oc_meta.run.meta.check_results meta_config.yaml --output report.txtNext steps
Section titled “Next steps”- Configuration reference - All configuration options
- Preprocessing - Filter and deduplicate input data
- Processing - How the pipeline works
- CSV format - Input format and supported identifiers