Skip to content

Configuration

Meta process requires a YAML configuration file. Here’s a complete reference with all available options.

# Triplestore endpoints
triplestore_url: "http://127.0.0.1:8890/sparql"
provenance_triplestore_url: "http://127.0.0.1:8891/sparql"
# RDF settings
base_iri: "https://w3id.org/oc/meta/"
context_path: "https://w3id.org/oc/corpus/context.json"
# Provenance
resp_agent: "https://w3id.org/oc/meta/prov/pa/1"
source: "https://api.crossref.org/"
# Redis
redis_host: "localhost"
redis_port: 6379
redis_db: 0
redis_cache_db: 1
# File organization
supplier_prefix: "060"
dir_split_number: 10000
items_per_file: 1000
default_dir: "_"
# Input/output
input_csv_dir: "/path/to/input"
output_rdf_dir: "/path/to/output"
generate_rdf_files: false
zip_output_rdf: true
# Processing options
silencer: ["author", "editor", "publisher"]
normalize_titles: true
use_doi_api_service: false
# Virtuoso bulk loading (optional)
virtuoso_bulk_load:
enabled: false
data_container: "virtuoso-data"
prov_container: "virtuoso-prov"
data_mount_dir: "/srv/meta/data_bulk"
prov_mount_dir: "/srv/meta/prov_bulk"
bulk_load_dir: "/database/bulk_load"
OptionTypeDescription
triplestore_urlstringSPARQL endpoint for data storage
provenance_triplestore_urlstringSPARQL endpoint for provenance storage
OptionTypeDescription
base_iristringBase IRI for generated entity URIs
context_pathstringJSON-LD context URL referenced in output files

The context_path URL is embedded in output JSON-LD files as "@context": "https://w3id.org/oc/corpus/context.json". This keeps output files small (they reference the context instead of embedding it). The context defines namespace prefixes (e.g., fabio:, datacite:, prism:) that map to the OpenCitations Data Model vocabularies.

OptionTypeDescription
resp_agentstringURI of the responsible agent for provenance
sourcestringPrimary source URI for provenance tracking
OptionTypeDefaultDescription
redis_hoststringlocalhostRedis server hostname
redis_portint6379Redis server port
redis_dbint0Database for OMID counters
redis_cache_dbint1Database for identifier cache

Meta uses Redis for two purposes:

  • OMID counters (redis_db): Stores sequential counters for generating unique entity URIs. Each entity type (br, ra, id, ar, re) has its own counter that increments to produce URIs like https://w3id.org/oc/meta/br/060/1, br/060/2, etc. Managed by oc_ocdm.RedisCounterHandler.

  • Upload cache (redis_cache_db): Tracks which SPARQL files have already been uploaded to the triplestore. When uploading is interrupted and resumed, Meta skips files already in the cache. Managed by piccione.CacheManager.

OptionTypeDefaultDescription
supplier_prefixstring-Prefix for OMID URIs (e.g., “060”)
dir_split_numberint10000Entities per subdirectory
items_per_fileint1000Entities per RDF file
default_dirstring”_“Directory name when no prefix exists

The supplier prefix identifies which OpenCitations dataset an entity belongs to. The prefix appears in entity URIs: https://w3id.org/oc/meta/br/060/1 where 060 identifies Meta as the source. Other prefixes used by OpenCitations include 010 (Wikidata), 020 (Crossref), and 040 (Dryad). See the complete supplier prefix table for all available prefixes.

Note: Some existing entities have prefixes like 0610, 0620, etc. (pattern 06[1-9]0). This was used in the past for multiprocessing, where different processes worked on separate directories. This approach is now deprecated due to stability issues with Virtuoso, which does not handle parallel queries well.

These options control how RDF files are organized on disk:

output_rdf_dir/
└── br/ # Entity type (br, ra, id, ar, re)
└── 060/ # Supplier prefix (or default_dir if none)
├── 10000/ # dir_split_number: entities 1-10000
│ ├── 1000.json # items_per_file: entities 1-1000
│ ├── 2000.json # entities 1001-2000
│ └── ...
└── 20000/ # entities 10001-20000
├── 11000.json
└── ...
  • dir_split_number: Creates subdirectories to avoid having too many files in one folder. With dir_split_number: 10000, entities 1-10000 go in 10000/, entities 10001-20000 go in 20000/, etc.

  • items_per_file: Controls how many entities are stored per JSON file. With items_per_file: 1000, entities 1-1000 go in 1000.json, entities 1001-2000 go in 2000.json, etc.

  • default_dir: When entities have no supplier prefix (e.g., during migration from older formats), this directory name is used instead. Typically set to _.

OptionTypeDefaultDescription
input_csv_dirstring-Directory containing input CSV files
output_rdf_dirstring-Directory for RDF output (if enabled)
generate_rdf_filesboolfalseGenerate RDF files in addition to SPARQL
zip_output_rdfbooltrueCompress RDF files to ZIP archives
OptionTypeDefaultDescription
silencerlist[]Fields to skip during updates
normalize_titlesbooltrueNormalize title casing
use_doi_api_serviceboolfalseQuery DOI API for metadata

The silencer option accepts a list of field names: author, editor, and publisher. Meta always works in addition mode (it never overwrites existing data). The silencer prevents adding new elements to an existing sequence. For example, if silencer: ["author"] is set and a resource already has authors, new authors from the CSV will not be added to the existing author chain.

Bulk loading bypasses SPARQL INSERT and uses Virtuoso’s native loader:

Note: We have observed empirically that Virtuoso’s database tends to lose integrity when using bulk loading. While it improves performance and speed, we recommend keeping this option disabled.

OptionTypeDescription
virtuoso_bulk_load.enabledboolEnable bulk loading mode
virtuoso_bulk_load.data_containerstringDocker container name for data triplestore
virtuoso_bulk_load.prov_containerstringDocker container name for provenance triplestore
virtuoso_bulk_load.data_mount_dirstringHost directory mounted in data container
virtuoso_bulk_load.prov_mount_dirstringHost directory mounted in prov container
virtuoso_bulk_load.bulk_load_dirstringPath inside container for bulk load

Requirements for bulk loading:

  1. Both Virtuoso instances must run in Docker containers
  2. Host directories must be mounted as volumes
  3. The bulk load directory must be in DirsAllowed in virtuoso.ini

Example Docker setup:

Terminal window
docker run -d --name virtuoso-data \
-v /srv/meta/data_bulk:/database/bulk_load \
-p 8890:8890 -p 1111:1111 \
openlink/virtuoso-opensource-7:latest

Example virtuoso.ini:

[Parameters]
DirsAllowed = ., /database, /database/bulk_load

When you run Meta with a config file, it automatically generates time_agnostic_library_config.json in the same directory. This file is used by the provenance tracking system and shouldn’t be edited manually.