Skip to content

Extract subset

Extracts a subset of RDF data from a SPARQL endpoint by querying instances of a specified class (or from a file of entity URIs) and recursively following URI references. Outputs the result in N-Quads or N-Triples format.

Terminal window
uv run python -m oc_meta.run.migration.extract_subset [options]
ParameterRequiredDefaultDescription
--endpointNohttp://localhost:8890/sparqlSPARQL endpoint URL
--classNohttp://purl.org/spar/fabio/ExpressionClass URI to extract instances of (mutually exclusive with --entities-file)
--entities-fileNo-File with entity URIs to extract, one per line (mutually exclusive with --class)
--limitNo1000Maximum number of initial entities
--outputNooutput.nqOutput file name
--compressNoFalseCompress output with gzip
--retriesNo5Maximum retries for failed queries
--no-graphsNoFalseDisable named graph queries and output N-Triples instead of N-Quads
  1. Discovers entities by querying instances of a class, or loads them from a file
  2. For each entity, fetches all triples (or quads) where it appears as subject
  3. Recursively processes any URI found as object
  4. Serializes the collected data as N-Quads (default) or N-Triples (--no-graphs)

Extract 500 bibliographic resources with their related entities:

Terminal window
uv run python -m oc_meta.run.migration.extract_subset \
--endpoint http://localhost:8890/sparql \
--class http://purl.org/spar/fabio/Expression \
--limit 500 \
--output subset.nq.gz \
--compress
  • Create test datasets from production data
  • Extract samples for debugging
  • Migrate specific portions of a triplestore