Configuration
Meta process requires a YAML configuration file. Here’s a complete reference with all available options.
Complete example
Section titled “Complete example”# Triplestore endpointstriplestore_url: "http://127.0.0.1:8890/sparql"provenance_triplestore_url: "http://127.0.0.1:8891/sparql"
# RDF settingsbase_iri: "https://w3id.org/oc/meta/"context_path: "https://w3id.org/oc/corpus/context.json"
# Provenanceresp_agent: "https://w3id.org/oc/meta/prov/pa/1"source: "https://api.crossref.org/"
# Redisredis_host: "localhost"redis_port: 6379redis_db: 0redis_cache_db: 1
# File organizationsupplier_prefix: "060"dir_split_number: 10000items_per_file: 1000default_dir: "_"
# Input/outputinput_csv_dir: "/path/to/input"output_rdf_dir: "/path/to/output"generate_rdf_files: falsezip_output_rdf: true
# Processing optionssilencer: ["author", "editor", "publisher"]normalize_titles: trueuse_doi_api_service: false
# Virtuoso bulk loading (optional)virtuoso_bulk_load: enabled: false data_container: "virtuoso-data" prov_container: "virtuoso-prov" data_mount_dir: "/srv/meta/data_bulk" prov_mount_dir: "/srv/meta/prov_bulk" bulk_load_dir: "/database/bulk_load"Option reference
Section titled “Option reference”Triplestore settings
Section titled “Triplestore settings”| Option | Type | Description |
|---|---|---|
triplestore_url | string | SPARQL endpoint for data storage |
provenance_triplestore_url | string | SPARQL endpoint for provenance storage |
RDF settings
Section titled “RDF settings”| Option | Type | Description |
|---|---|---|
base_iri | string | Base IRI for generated entity URIs |
context_path | string | JSON-LD context URL referenced in output files |
The context_path URL is embedded in output JSON-LD files as "@context": "https://w3id.org/oc/corpus/context.json". This keeps output files small (they reference the context instead of embedding it). The context defines namespace prefixes (e.g., fabio:, datacite:, prism:) that map to the OpenCitations Data Model vocabularies.
Provenance
Section titled “Provenance”| Option | Type | Description |
|---|---|---|
resp_agent | string | URI of the responsible agent for provenance |
source | string | Primary source URI for provenance tracking |
| Option | Type | Default | Description |
|---|---|---|---|
redis_host | string | localhost | Redis server hostname |
redis_port | int | 6379 | Redis server port |
redis_db | int | 0 | Database for OMID counters |
redis_cache_db | int | 1 | Database for identifier cache |
Meta uses Redis for two purposes:
-
OMID counters (
redis_db): Stores sequential counters for generating unique entity URIs. Each entity type (br, ra, id, ar, re) has its own counter that increments to produce URIs likehttps://w3id.org/oc/meta/br/060/1,br/060/2, etc. Managed byoc_ocdm.RedisCounterHandler. -
Upload cache (
redis_cache_db): Tracks which SPARQL files have already been uploaded to the triplestore. When uploading is interrupted and resumed, Meta skips files already in the cache. Managed bypiccione.CacheManager.
File organization
Section titled “File organization”| Option | Type | Default | Description |
|---|---|---|---|
supplier_prefix | string | - | Prefix for OMID URIs (e.g., “060”) |
dir_split_number | int | 10000 | Entities per subdirectory |
items_per_file | int | 1000 | Entities per RDF file |
default_dir | string | ”_“ | Directory name when no prefix exists |
The supplier prefix identifies which OpenCitations dataset an entity belongs to. The prefix appears in entity URIs: https://w3id.org/oc/meta/br/060/1 where 060 identifies Meta as the source. Other prefixes used by OpenCitations include 010 (Wikidata), 020 (Crossref), and 040 (Dryad). See the complete supplier prefix table for all available prefixes.
Note: Some existing entities have prefixes like
0610,0620, etc. (pattern06[1-9]0). This was used in the past for multiprocessing, where different processes worked on separate directories. This approach is now deprecated due to stability issues with Virtuoso, which does not handle parallel queries well.
These options control how RDF files are organized on disk:
output_rdf_dir/└── br/ # Entity type (br, ra, id, ar, re) └── 060/ # Supplier prefix (or default_dir if none) ├── 10000/ # dir_split_number: entities 1-10000 │ ├── 1000.json # items_per_file: entities 1-1000 │ ├── 2000.json # entities 1001-2000 │ └── ... └── 20000/ # entities 10001-20000 ├── 11000.json └── ...-
dir_split_number: Creates subdirectories to avoid having too many files in one folder. Withdir_split_number: 10000, entities 1-10000 go in10000/, entities 10001-20000 go in20000/, etc. -
items_per_file: Controls how many entities are stored per JSON file. Withitems_per_file: 1000, entities 1-1000 go in1000.json, entities 1001-2000 go in2000.json, etc. -
default_dir: When entities have no supplier prefix (e.g., during migration from older formats), this directory name is used instead. Typically set to_.
Input/output
Section titled “Input/output”| Option | Type | Default | Description |
|---|---|---|---|
input_csv_dir | string | - | Directory containing input CSV files |
output_rdf_dir | string | - | Directory for RDF output (if enabled) |
generate_rdf_files | bool | false | Generate RDF files in addition to SPARQL |
zip_output_rdf | bool | true | Compress RDF files to ZIP archives |
Processing options
Section titled “Processing options”| Option | Type | Default | Description |
|---|---|---|---|
silencer | list | [] | Fields to skip during updates |
normalize_titles | bool | true | Normalize title casing |
use_doi_api_service | bool | false | Query DOI API for metadata |
The silencer option accepts a list of field names: author, editor, and publisher. Meta always works in addition mode (it never overwrites existing data). The silencer prevents adding new elements to an existing sequence. For example, if silencer: ["author"] is set and a resource already has authors, new authors from the CSV will not be added to the existing author chain.
Virtuoso bulk loading
Section titled “Virtuoso bulk loading”Bulk loading bypasses SPARQL INSERT and uses Virtuoso’s native loader:
Note: We have observed empirically that Virtuoso’s database tends to lose integrity when using bulk loading. While it improves performance and speed, we recommend keeping this option disabled.
| Option | Type | Description |
|---|---|---|
virtuoso_bulk_load.enabled | bool | Enable bulk loading mode |
virtuoso_bulk_load.data_container | string | Docker container name for data triplestore |
virtuoso_bulk_load.prov_container | string | Docker container name for provenance triplestore |
virtuoso_bulk_load.data_mount_dir | string | Host directory mounted in data container |
virtuoso_bulk_load.prov_mount_dir | string | Host directory mounted in prov container |
virtuoso_bulk_load.bulk_load_dir | string | Path inside container for bulk load |
Requirements for bulk loading:
- Both Virtuoso instances must run in Docker containers
- Host directories must be mounted as volumes
- The bulk load directory must be in
DirsAllowedinvirtuoso.ini
Example Docker setup:
docker run -d --name virtuoso-data \ -v /srv/meta/data_bulk:/database/bulk_load \ -p 8890:8890 -p 1111:1111 \ openlink/virtuoso-opensource-7:latestExample virtuoso.ini:
[Parameters]DirsAllowed = ., /database, /database/bulk_loadGenerated files
Section titled “Generated files”When you run Meta with a config file, it automatically generates time_agnostic_library_config.json in the same directory. This file is used by the provenance tracking system and shouldn’t be edited manually.