Extract subset#
Extracts a subset of RDF data from a SPARQL endpoint by querying instances of a specified class (or from a file of entity URIs) and recursively following URI references. Outputs the result in N-Quads or N-Triples format.
Usage#
uv run python -m oc_meta.run.migration.extract_subset [options]
Parameters#
Parameter |
Required |
Default |
Description |
|---|---|---|---|
|
No |
SPARQL endpoint URL |
|
|
No |
Class URI to extract instances of (mutually exclusive with |
|
|
No |
- |
File with entity URIs to extract, one per line (mutually exclusive with |
|
No |
1000 |
Maximum number of initial entities |
|
No |
output.nq |
Output file name |
|
No |
False |
Compress output with gzip |
|
No |
5 |
Maximum retries for failed queries |
|
No |
False |
Disable named graph queries and output N-Triples instead of N-Quads |
Process#
Discovers entities by querying instances of a class, or loads them from a file
For each entity, fetches all triples (or quads) where it appears as subject
Recursively processes any URI found as object
Serializes the collected data as N-Quads (default) or N-Triples (
--no-graphs)
Example#
Extract 500 bibliographic resources with their related entities:
uv run python -m oc_meta.run.migration.extract_subset \
--endpoint http://localhost:8890/sparql \
--class http://purl.org/spar/fabio/Expression \
--limit 500 \
--output subset.nq.gz \
--compress
Use cases#
Create test datasets from production data
Extract samples for debugging
Migrate specific portions of a triplestore