Skip to content

Extract subset

Extracts a subset of RDF data from a SPARQL endpoint by querying instances of a specified class and recursively following URI references. Outputs the result in N-Quads format.

Terminal window
uv run python -m oc_meta.run.migration.extract_subset [options]
ParameterRequiredDefaultDescription
--endpointNohttp://localhost:8890/sparqlSPARQL endpoint URL
--classNohttp://purl.org/spar/fabio/ExpressionClass URI to extract instances of (mutually exclusive with --predicate)
--predicateNo-Predicate URI for entity discovery (mutually exclusive with --class)
--limitNo1000Maximum number of initial entities
--outputNooutput.nqOutput file name
--compressNoFalseCompress output with gzip
--retriesNo5Maximum retries for failed queries
--no-graphsNoFalseDisable named graph queries and output N-Triples instead of N-Quads
--no-recurseNoFalseDo not recursively follow URI objects
  1. Queries the endpoint for subjects that are instances of the specified class
  2. For each entity, fetches all triples where it appears as subject
  3. Recursively processes any URI found as object
  4. Serializes the collected data as N-Quads

Extract 500 bibliographic resources with their related entities:

Terminal window
uv run python -m oc_meta.run.migration.extract_subset \
--endpoint http://localhost:8890/sparql \
--class http://purl.org/spar/fabio/Expression \
--limit 500 \
--output subset.nq.gz \
--compress
  • Create test datasets from production data
  • Extract samples for debugging
  • Migrate specific portions of a triplestore