Tuesday, December 02, 2008

Dropping Indexes

One of the optimizations I'm making for XA 1.1 is the removal of 3 of our 6 statement indexes. The reason for this is pretty clear: they're almost never used. Why would I want to double our space, and double our contention for the hard drive on data structures that are superfluous?

To date, Mulgara's indexes have been completely symmetric. I still want to maintain this with respect to subject, predicate and object, but I don't really see the need for it with graphs. (That said, the 2-column index in XA2 will have optimizations around common predicates, but in general there will still be symmetry). I've had people say that they want to use millions of graphs, but in reality I've yet to see it. The query languages (TQL, SPARQL, etc) haven't really supported large numbers of graphs anyway.

The index orderings we've had to date have been:
  SPOG
POSG
OSPG
GSPO
GPOS
GOSP
For G=Graph, S=Subject, P=Predicate, O=Object.

For anyone unfamiliar with these indexes, they permit a group of statements to be found given any possible pattern of 0, 1, 2, 3 or 4 elements.

The first 3 indexes allow for searching on statements that may occur in any graph. However, almost all queries identify the graphs to be searched in, meaning that we always end up binding the "graph" node before looking for statements. That means that the first 3 indexes are almost never used. However, it's the "almost" which is my problem at the moment.

Fortunately, the first 3 indexes can be easily emulated with our "System graph". This countains a list of all the known graphs, particularly the graphs stored with the "System Resolver" (this is the part of the system that uses the above indexes). Using this information, it is possible to pre-bind the graph node for every possible query. However, I really want to do this at the lowest possible level, so the interface on the resolver remains unchanged.

Dropping the first 3 indexes went smoothly, and 97% of the tests still work (OK, it's 96.99%, but who's counting?). However, the emulation of these indexes will probably take me a few days. That's a shame, as I'd prefer to get it all into the next release, but since I want to do a release before I go to Australia for Christmas (on Monday) then I'm pretty sure I can't do it in time (not if I want robust testing anyway).

Compromises

Emulating the indexes which allow unbound graphs, means that I'll need to bind the graph to a series of ALL the graphs in the system. Then for each of those graphs, I'll need to re-execute the resolution of the graph pattern being resolver. That means that for these types of queries, then it will increase in complexity with the number of graphs in the system. This goes completely against what we want in Mulgara, but as I said, it's such a rarely used feature that the cost seems mitigated.

I had thought that I'd be doing a query to find the graphs, and then join this to the resolution of the graph pattern that we want, but that failed to take several things into account. First, resolutions from the resolver come back with a particular order, and the kind of join I was proposing was not going to be ordered the way we wanted (it would have been ordered for within each graph, and then ordered within the graph). Reordering may have been prohibitively expensive (depending on context), so this was out.

It was while thinking through on this that I realized I can create a new Tuples "append" operation. The new append will take arguments that all have the same variables and the same ordering, and will perform a streamed merge-sort. This should give me exactly what I want.

So the next thing I need is the complete list of graphs to bind the "G" node to when querying the indexes. I have thought that I'd be doing a query of the system graph for this, but my thinking has moved on from there. To start with, in order to make this query, I'll need the local node value for <rdf:type> the URI for the "type" of graphs stored in the system resolver, and the system graph itself (a relative URI of <#>). The creation of these occurs during bootstrapping, and is fortunately over before any possibility of my "unusual" queries.

While thinking about getting the local node values for these URIs, it occurred to me that something similar to the mechanism to do this can be used to record whenever a graph is being created in the system graph. That means that I can store each of the graphs in a list (and re-populate this list on startup with a simple constraint resolution). This list then becomes the basis for the graph bindings when I'm trying to emulate the missing indexes.

My first concern was that this might take too much space, thereby limiting the number of graphs that someone can have (as I said, some people have proposed using millions), but then I realized that my merge-join was going to need to reference the same number of resolutions as the number of graphs, and this would take more RAM anyway. It's really a moot point anyway, since the system would choke from performing a million lookups before you need to worry about an Out Of Memory condition. All this reminds me... I should worry too much about optimizations at such at early juncture. Premature optimization is the root of all evil.

Anyway, I'll probably spend a day on this, and may even get it all going, but I won't have it tested in time for a release before the weekend. I'd better let Amit (my boss) know that he won't get it until Christmas. :-)

4 comments:

Unknown said...

... depends on where you define the start of premature optimization ;-) If you skip the graph-indexes and XA2 optimizes the predicates column (if I understood that correctly), where will there be room left for some (efficient) flexibility? I already thought about moving from complex contexts to refining vocabularies through subproperties (at least partiallly). But if the properties column gets optimized too and the vocabularies are already optimized - where's the room where the not so streamlined usecases live?

Paula said...

The room for flexibility comes from the new indexing scheme. Instead of storing the statements 3 (or 6) different ways, the statements only get stored once. The URIs then get mapped to 3 (or 4) lists of statements that those URIs participate in. The first list is where the URI is used as a subject. The second list is where it is used as a predicate, and the third list is where it gets used as an object. If a fourth list is used, then that's where the URI is the graph.

The "general" plan is that each URI is stored in a tree, and the nodes in the tree contain the heads of these lists. However, a limited number of predicates can have their lists stored in their own files, meaning that you don't have to go to the tree to find them. Any predicate URI that doesn't get mapped to a file, will fall back to being searched for in the tree.

Clear as mud? :-)

Unknown said...

clear as mud! but hey: i just trust you, like i always did ;-)

nicodemus said...

can someone please explain to me what building an enterprise scale-data management system really means for someone who knows close to nothing about these data collection systems? Thanks!