Low-level Usage

Loading HDT files

The main class for directly manipulating HDT document using rdflib_hdt is rdflib_hdt.HDTDocument. Upon creation, it searches for an index file in the same directory than the HDT file you wish to load. For example, if you load a file /home/awesome-user/test.hdt, rdflib_hdt.HDTDocument will look for the index file /home/awesome-user/test.hdt.index.v1-1.

Warning

By default, an HDTDocument discards RDF Terms with invalid UTF-8 encoding. You can change this behavior with the safe_mode parameter of the constructor.

Note

Missing indexes are generated automatically, but be careful, as it requires to load all HDT triples in memory!

from rdflib_hdt import HDTDocument

# Load an HDT file.
# Missing indexes are generated automatically, add False as the second argument to disable them
document = HDTDocument("test.hdt")

# Display some metadata about the HDT document itself
print(f"Number of RDF triples: {document.total_triples}")
print(f"Number of subjects: {document.nb_subjects}")
print(f"Number of predicates: {document.nb_predicates}")
print(f"Number of objects: {document.nb_objects}")
print(f"Number of shared subject-object: {document.nb_shared}")

Searching for triples

You can search for all RDF triples in the HDT file matching a triple pattern using rdflib_hdt.HDTDocument.search(). It returns a 2-element tuple: an rdflib_hdt.HDTIterator over the matching RDF triples and the estimated triple pattern cardinality.

Note

The rdflib_hdt.HDTDocument.search() method also accepts limit and offset parameters, to perform range queries over the matchinf RDF triples.

from rdflib.namespace import FOAF
from rdflib_hdt import HDTDocument
document = HDTDocument("test.hdt")

# Fetch all triples that matches { ?s foaf:name ?o }
# Use None to indicates variables
triples, cardinality = document.search((None, FOAF("name"), None))

print(f"Cardinality of (?s foaf:name ?o): {cardinality}")
for s, p, o in triples:
  print(triple)

# The search also support limit and offset
triples, cardinality = document.search((None, FOAF("name"), None), limit=10, offset=100)
# etc ...

Searching for triple IDs

A typical HDT document encodes a triple’s subject, predicate, and object as unique integers, named TripleID. For example, the triple ("ex:Toto", "ex:type", "ex:Person") can be encoded as (1, 2, 3). An rdflib_hdt.HDTDocument allows for searching RDF triples and retrieving them in this format, using the rdflib_hdt.HDTDocument.search_ids() method, which takes the same parameters as the rdflib_hdt.HDTDocument.search() method.

Note

You can transform RDF terms or RDF triples to/from TripleIDs using the rdflib_hdt.HDTDocument.from_tripleid(), rdflib_hdt.HDTDocument.to_tripleid(), rdflib_hdt.HDTDocument.term_to_id(), and rdflib_hdt.HDTDocument.id_to_term() methods.

from rdflib_hdt import HDTDocument
document = HDTDocument("test.hdt")

(triples, cardinality) = document.search_ids((None, None, None))

for s, p, o in triples:
  print(s, p, o) # will print 3-element tuples of integers

  # convert a triple ID to a string format
  print(f"TripleID {(s, p, o)} = RDF Triple {document.from_tripleid(s, p, o)}")

  # print only the subject
  print(f"ID {s} = Term {document.id_to_term(s, 0)}")

Join evaluation

An HDT document also provides support for evaluating joins over a set of triples patterns.

from rdflib_hdt import HDTDocument
from rdflib import Variable
from rdflib.namespace import FOAF, RDF

document = HDTDocument("test.hdt")

# find the names of two entities that know each other
tp_a = (Variable("a"), FOAF("knows"), Variable("b"))
tp_b = (Variable("a"), FOAF("name"), Variable("name"))
tp_c = (Variable("b"), FOAF("name"), Variable("friend"))
query = set([tp_a, tp_b, tp_c])

iterator = document.search_join(query)
print(f"Estimated join cardinality: {len(iterator)}")

# Join results are produced as ResultRow, like in the RDFlib SPARQL API
for row in iterator:
 print(f"{row.name} knows {row.friend}")

Ordering

When searching for triples (either in a string or TripleID format), results are ordered by (subject, predicate, object). However, this order is not an order on string values, but an order on triple ids. For example, ("ex:2", "ex:type", "ex:Person") < ("ex:1", "ex:type", "ex:Person"), because their triple ids counterparts are (1, 2, 3) and (2, 2, 3).

For more details about this topic, please refer to the HDT journal article.

Handling non UTF-8 strings in Python

If the HDT document is encoded without valid UTF-8 encoding and it does not run in safe mode, querying the document may raise an UnicodeDecodeError. More details on how to convert string from C++ to Python here

To allow fine control over string conversion in this case, we doubled the API of the HDT document by adding the following methods (Parameters and documentation are the same as their standard counterparts).

  • rdflib_hdt.HDTDocument.search_triples_bytes() returns an iterator of triples as (py::bytes, py::bytes, py::bytes)

  • rdflib_hdt.HDTDocument.search_join_bytes() returns an iterator of sets of solutions mapping as py::set(py::bytes, py::bytes)

  • rdflib_hdt.HDTDocument.convert_tripleid_bytes() returns a triple as: (py::bytes, py::bytes, py::bytes)

  • rdflib_hdt.HDTDocument.convert_id_bytes() returns a py::bytes

from rdflib_hdt import HDTDocument

 # Load an HDT file.
 # Missing indexes are generated automatically, add False as the second argument to disable them
document = HDTDocument("test.hdt")
it = document.search_triple_bytes("", "", "")

for s, p, o in it:
  print(s, p, o) # print b'...', b'...', b'...'
  # now decode it, or handle any error
  try:
    s, p, o = s.decode('UTF-8'), p.decode('UTF-8'), o.decode('UTF-8')
  except UnicodeDecodeError as err:
    # try another other codecs
    pass