Querying HDT documents¶
Getting started¶
The primary way of using rdflib-hdt
is the rdflib_hdt.HDTStore
class.
Upon creation, it searches for an index file in the same directory than the HDT file you wish to load.
For example, if you load a file /home/awesome-user/test.hdt, rdflib_hdt.HDTDocument
will look for the index file
/home/awesome-user/test.hdt.index.v1-1.
Warning
By default, an HDTStore discards RDF Terms with invalid UTF-8 encoding. You can change this behavior with the safe_mode parameter of the constructor.
Note
Missing indexes are generated automatically, but be careful, as it requires to load all HDT triples in memory!
from rdflib import Graph
from rdflib_hdt import HDTStore
from rdflib.namespace import FOAF
# Load an HDT file. Missing indexes are generated automatically
# You can provide the index file by putting them in the same directory than the HDT file.
store = HDTStore("test.hdt")
# Display some metadata about the HDT document itself
print(f"Number of RDF triples: {len(store)}")
print(f"Number of subjects: {store.nb_subjects}")
print(f"Number of predicates: {store.nb_predicates}")
print(f"Number of objects: {store.nb_objects}")
print(f"Number of shared subject-object: {store.nb_shared}")
Executing SPARQL queries¶
Using the RDFlib API, you can also execute SPARQL queries over an HDT document.
If you do so, we recommend that you first call the rdflib_hdt.optimize_sparql()
function, which optimize
the RDFlib SPARQL query engine in the context of HDT documents.
from rdflib import Graph
from rdflib_hdt import HDTStore, optimize_sparql
# Calling this function optimizes the RDFlib SPARQL engine for HDT documents
optimize_sparql()
graph = Graph(store=HDTStore("test.hdt"))
# You can execute SPARQL queries using the regular RDFlib API
qres = graph.query("""
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?name ?friend WHERE {
?a foaf:knows ?b.
?a foaf:name ?name.
?b foaf:name ?friend.
}""")
for row in qres:
print(f"{row.name} knows {row.friend}")
Note
Calling the rdflib_hdt.optimize_sparql()
function triggers a global modification of the RDFlib SPARQL engine.
However, executing SPARQL queries using other RDFlib stores will continue to work as before,
so you can safely call this function at the beginning of your code.