Getting started#
Installation#
Install via pip:
pip install oc_meta
For development, clone the repository and use uv:
git clone https://github.com/opencitations/oc_meta.git
cd oc_meta
uv sync
Prerequisites#
Meta requires:
Python 3.10+
Triplestore (Virtuoso or Blazegraph) for RDF storage
For local development, you can use Docker.
Virtuoso (data):
docker run -d --name virtuoso-data -p 8890:8890 -p 1111:1111 openlink/virtuoso-opensource-7:latest
Virtuoso (provenance):
docker run -d --name virtuoso-prov -p 8891:8890 -p 1112:1111 openlink/virtuoso-opensource-7:latest
Your first run#
Create a configuration file (
meta_config.yaml):
triplestore_url: "http://127.0.0.1:8890/sparql"
provenance_triplestore_url: "http://127.0.0.1:8891/sparql"
base_iri: "https://w3id.org/oc/meta/"
resp_agent: "https://w3id.org/oc/meta/prov/pa/1"
source: "https://api.crossref.org/"
supplier_prefix: "060"
dir_split_number: 10000
items_per_file: 1000
input_csv_dir: "/path/to/input"
Prepare input CSV with these columns:
Column |
Example |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(same format as author) |
See CSV format for supported identifiers and formats
Run processing:
uv run python -m oc_meta.run.meta_process -c meta_config.yaml
See the configuration reference for all available options.
Typical workflow#
A production workflow usually follows these steps:
Preprocess - Deduplicate input and filter existing IDs
Process - Run the main Meta pipeline
Verify - Check that all identifiers were processed correctly
Preprocess (optional but recommended):
uv run python -m oc_meta.run.meta.preprocess_input input/ preprocessed/ --redis-port 6379
Process:
uv run python -m oc_meta.run.meta_process -c meta_config.yaml
Verify:
uv run python -m oc_meta.run.meta.check_results meta_config.yaml report.json
Next steps#
Configuration reference - All configuration options
Preprocessing - Filter and deduplicate input data
Processing - How the pipeline works
CSV format - Input format and supported identifiers