Wrongful Indexing
Some years ago
I commented on the number and type of indexes that can be used for tuples. At the time, I pointed out that indexing triples required 3 indexes, and there were 2 appropriate sets of indexes to use. Similarly, quads can be indexed with 6 indexes, and there are 4 such sets. In both cases (5-tuples get silly, requiring 10 indexes and there are 12 possible sets). In each case, I said that each set of indexes would work just as well as the others, and so I always selected the set that included the natural ordering of the tuples.
So for RDF triples, the two sets of indexes are ordered by:
subject,predicate,object
predicate,object,subject
object,subject,predicate
and
object,predicate,subject
predicate,subject,object
subject,object,predicate
For convenience I have always chosen the first set, as this includes the natural ordering of subject/predicate/object, but it looks like I was wrong.
In using these indexes I've always presumed random 3-tuples, but in reality the index is representing RDF. Whenever I thought about the data I was looking for, this seemed OK, but that's because I tended to think about properties on resources, and not other RDF structures. In particular, I was failing to consider lists.
Since first building RDF indexes (2001) and writing about them (2004) I've learnt a lot about functional programming. This, in turn, led to an appreciation of lists, particularly in algorithms. I'm still not enamored of them in on-disk structures, but I do appreciate their utility and elegance in many applications. So it was only natural that when I was representing RDF graphs with Scala and I needed to read lists, then I used some trivial recursive code to build a Scala list, and it all looked great. But then I decided to port the Graph class to Java to avoid including the Scala Jars for a really lightweight library.
I'd like to point out that I'm talking about a library function that can read a well-formed RDF list and return a list in whatever programming language the library is implemented in. The remainder of this post is going to presume that the lists are well formed, since any alternatives can never be returned as a list in an API anyway.
Reading a list usually involves the subject/predicate/object (SPO) index. You start by looking up the head of the list as a subject, then the predicates
rdf:first
for the data at that point in the list, and
rdf:rest
for the rest of the list. Rinse and repeat until
rdf:rest
yields a value of
rdf:nil
. So for each node in the list, there is a lookup by subject, followed by two lookups by predicate. This is perfect for the SPO index.
However, it's been bugging me that I have such a general approach, when the structure is predetermined. Why look up these two predicates so generally, when we know exactly what we want? What if we reduce the set we're looking in to just the predicates that we want and then go looking for the subjects? That would mean looking first by predicate, then subject, then object, leading to a PSO index. So what does that algorithm look like?
First, look up the
rdf:rest
predicate, leading to an index of subject/object containing all list structures. Next, look up the
rdf:rest
predicate, retrieving subject/objects containing all the list data. Now to iterate down the list no longer involves finding the subject followed by the predicate, in order to read the next list node, but rather it just requires finding the subject, and the list node is in the corresponding object. Similarly with the data stored in the node. We're still doing a fixed number of lookups in an index, which means that the overall complexity does not change at all. Tree indexes will still give
O(log(N)) complexity, and hash indexes will still give
O(1) complexity. However, each step can involve disk seeks, so it's worth seeing the difference.
To compare more directly, using an SPO index requires every node to:
- Lookup across the entire graph by subject.
- Lookup across the subject (2 or 3 predicates) for
rdf:first
. - Lookup across the subject (2 or 3 predicates) for
rdf:rest
.
For the PSO index we have some initial setup:
- Lookup across the entire graph for the
rdf:first
predicate. - Lookup across the entire graph for the
rdf:rest
predicate.
Then for every node:
- Lookup the
rdf:first
data for the value. - Lookup the
rdf:rest
data for the next node.
It's important to note a few things, particularly for tree indexes. Trees are the most likely structure used when using a disk, so I'm going to concentrate on them. The number of subjects in a graph tends to scale up with the size of the graph, while the number of predicates is bounded. This is because predicates are used to express a model, with each predicate indicating a certain relationship. Any system trying to deal with the model needs some idea of the concepts it is dealing with, so it's
almost impossible to deal with completely arbitrary relationships. If we know what the relationships are ahead of time, then there must be a fixed number of them. In contrast, subjects represent individuals, and these can be completely unbounded. So if we look up an entire graph to find a particular subject, then we may have to dive down a very deep tree to find that subject. Looking across the entire graph for a given predicate will never have to go very deep, because there are so few of them.
So the first algorithm (using the SPO index) iteratively looks across every subject in the graph for each node in the list. The next two lookups are trivial, since nodes in a list will only have properties of
rdf:first
,
rdf:rest
and possibly
rdf:type
. The data associated with these properties will almost certainly be in the same block where the subject was found, meaning that there will be no more disk seeks.
The second algorithm (using the PSO index) does a pair of lookups across every predicate in the graph. The expected number of disk seeks to find the first predicate is significantly fewer than for any of the "subject" searches in the first algorithm. Given how few predicates are in the system, then finding the second predicate may barely involve any disk seeks at all, particularly since the first search will have populated the disk cache with a good portion of the tree, and the similarities in the URIs of the predicates is likely to make both predicates very close to each other. Of course, this presumes that the predicates are even in a tree. Several systems (including one I'm writing right now) treat predicates differently because of how few there are. Indeed, a lot of systems will cache them in a hashtable, regardless of the on-disk structure. So the initial lookup is very inexpensive.
The second algorithm then iterates down the list, just like the first one does. However, this time, instead of searching for the nodes out of every subject in the list, it will now be just searching for these nodes in the subjects that appear as list nodes. While lists are commonly used in some RDF structures, the subjects in all the lists typically form a very small minority out of all the subjects in a graph. Consequently, depending on the type and depth of trees being used, iterating through a list with the second algorithm, could result in a 2 or three (or more) fewer disk seeks for each node. That's a saving that can add up.
Solid State Disks
I've been talking about disk seeks, but this is an artificial restriction imposed by spinning disk drives. Solid State Disks (SSDs) don't have this limitation.
People have been promoting solid state drives (SSDs) for some years now, but I've yet to use them myself. In fact, most people I know are still using traditional spinning platters. The prices difference is still a big deal, and for really large data, disk drives are still the only viable option. But this will change one day, so am I right to be concerned about disk seeks?
Disk seeks are a function of data locality. When data has to be stored somewhere else on a disk, the drive head must physically seek across the surface to this new address. SSDs don't require anything to move, but there are still costs in addressing scattered data.
While it is possible to address every bit of memory in a device in one step, in practice this is never done. This is because the complexity of the circuit grows exponentially as you try to address more and more data in one step. Instead, the memory is broken up into "banks". A portion of the address can now be used to select a bank, allowing the remaining bits in the address to select the required memory in just that bank. This works well, but it does lead to some delays. Selecting a new bank requires "setup", "hold" and "settling" times, all leading to delays. These delays are an order of magnitude smaller than seek delays for a spinning disk, but they do represent a limit on the speed of the device. So while SSDs are much faster than disk drives, there are still limits to their speed, and improvements in data locality can still have a significant impact on performance.