Thursday, May 13, 2004

Addendum
I noticed someone referring to the MySQL indexes I described last Monday. I thought I'd better make an extra comment on this.

These indexes allow triples (or quads in this case) to be easily found given any combination of subject, predicate, object and model. However, the indexes don't come for free! For a start, having this number of indexes will seriously slow down loading data. Administrators of large databases normally drop indexes to do insertions, and then add them again afterwards for just this reason. Unfortunately, adding indexes like this to a large amount of data will take a long time as well (just not as long as adding to indexed tables).

Also important is that these indexes all take up space themselves. Significant space. Disk is cheap, so this may not be an issue, but it is definately something to keep in mind.

Kowari approaches things a little differently. It stores data in 6 different orders, corresponding to:

  1. Subject, Predicate, Object, Model
  2. Predicate, Object, Subject, Model
  3. Object, Subject, Predicate, Model
  4. Model, Subject, Predicate, Object
  5. Model, Predicate, Object, Subject
  6. Model, Object, Subject, Predicate
Each ordering effectively forms a binary-searchable index. If you look carefully, you'll notice that any combination of 1, 2, 3 or 4 elements can be searched for with an appropriate index. For instance, if I'm looking for all instances of a predicate in a particular model (a predicate-model search) I can use the 5th index above. All relationships between a given subject and object (a subject-object search) can be done with the 2nd index. To search for this again within a given model only (ie. a subject-object-model search) would use the 6th index.

So Kowari is able to re-use its indexes for different types of queries while MySQL has to needs new ones for each different type. For instance, the Kowari index of (Subject, Predicate, Object, Model) fulfills the requirements of the following MySQL indexes:
  • Subject
  • Subject,Predicate
  • Subject,Predicate,Object
  • Subject,Predicate,Object,Model

When we were back at using just 3 elements, there were 6 possible indexes available. To search for any combination of subject, predicate and object could be done with either of 2 mutually exclusive sets of 3 indexes. We chose:
  1. Subject, Predicate, Object
  2. Predicate, Object, Subject
  3. Object, Subject, Predicate
But we could have chosen the other set.

When we moved to a fourth element (the model), we went through all the combinations available for the indexes (all 24 of them: 4! = 24), and it turned out that there are several sets of indexes which can be used to search for any combination of elements. This time the sets each had 6 indexes in them. We chose the set that most closely resembled our current indexes (notice that they match the 3-element indexes, with the model on the end, and again with the model on the front). Not only did that set of indexes make more intuitive sense for us (some of the other sets had the elements in a haphazard order) but it allowed us to reuse a lot of code without modification.

The cost of all these indexes is drive space, although it is much less than a relational database would use. It also means that data must be inserted into 6 different indexes, and it must be ordered as it goes in. This is the main reason why bulk insertions are limited in speed, though recently DM has done a lot to improve this. (One possible improvement would be to load the data unordered, and then order it for each of the indexes in the background, much as I mentioned in previous blogs for the hashtable based indexes. This is the equivalent of an SQL database dropping its indexes for an insert and adding the indexes afterwards). The purpose of the hashtable based indexes that I've discussed in the last few days has been to try and speed these load times. It also would speed up queries to O(1), as they are currently O(ln(n)) due to the binary searching of the index.

DM has noted on several occasions that usage patterns mean we only tend to use 4 of out 6 indexes. We could chop 30% off our storage requirements and loading speed if we were to remove them, but if we ever need them again, then searching would be painfully slow. If we write a new open source version of a triplestore, and Tucana choose to adopt it for Kowari/TKS, then this would not be an issue anyway.

No comments: