Tuesday, December 02, 2008

Dropping Indexes

One of the optimizations I'm making for XA 1.1 is the removal of 3 of our 6 statement indexes. The reason for this is pretty clear: they're almost never used. Why would I want to double our space, and double our contention for the hard drive on data structures that are superfluous?

To date, Mulgara's indexes have been completely symmetric. I still want to maintain this with respect to subject, predicate and object, but I don't really see the need for it with graphs. (That said, the 2-column index in XA2 will have optimizations around common predicates, but in general there will still be symmetry). I've had people say that they want to use millions of graphs, but in reality I've yet to see it. The query languages (TQL, SPARQL, etc) haven't really supported large numbers of graphs anyway.

The index orderings we've had to date have been:
  SPOG
POSG
OSPG
GSPO
GPOS
GOSP
For G=Graph, S=Subject, P=Predicate, O=Object.

For anyone unfamiliar with these indexes, they permit a group of statements to be found given any possible pattern of 0, 1, 2, 3 or 4 elements.

The first 3 indexes allow for searching on statements that may occur in any graph. However, almost all queries identify the graphs to be searched in, meaning that we always end up binding the "graph" node before looking for statements. That means that the first 3 indexes are almost never used. However, it's the "almost" which is my problem at the moment.

Fortunately, the first 3 indexes can be easily emulated with our "System graph". This countains a list of all the known graphs, particularly the graphs stored with the "System Resolver" (this is the part of the system that uses the above indexes). Using this information, it is possible to pre-bind the graph node for every possible query. However, I really want to do this at the lowest possible level, so the interface on the resolver remains unchanged.

Dropping the first 3 indexes went smoothly, and 97% of the tests still work (OK, it's 96.99%, but who's counting?). However, the emulation of these indexes will probably take me a few days. That's a shame, as I'd prefer to get it all into the next release, but since I want to do a release before I go to Australia for Christmas (on Monday) then I'm pretty sure I can't do it in time (not if I want robust testing anyway).

Compromises

Emulating the indexes which allow unbound graphs, means that I'll need to bind the graph to a series of ALL the graphs in the system. Then for each of those graphs, I'll need to re-execute the resolution of the graph pattern being resolver. That means that for these types of queries, then it will increase in complexity with the number of graphs in the system. This goes completely against what we want in Mulgara, but as I said, it's such a rarely used feature that the cost seems mitigated.

I had thought that I'd be doing a query to find the graphs, and then join this to the resolution of the graph pattern that we want, but that failed to take several things into account. First, resolutions from the resolver come back with a particular order, and the kind of join I was proposing was not going to be ordered the way we wanted (it would have been ordered for within each graph, and then ordered within the graph). Reordering may have been prohibitively expensive (depending on context), so this was out.

It was while thinking through on this that I realized I can create a new Tuples "append" operation. The new append will take arguments that all have the same variables and the same ordering, and will perform a streamed merge-sort. This should give me exactly what I want.

So the next thing I need is the complete list of graphs to bind the "G" node to when querying the indexes. I have thought that I'd be doing a query of the system graph for this, but my thinking has moved on from there. To start with, in order to make this query, I'll need the local node value for <rdf:type> the URI for the "type" of graphs stored in the system resolver, and the system graph itself (a relative URI of <#>). The creation of these occurs during bootstrapping, and is fortunately over before any possibility of my "unusual" queries.

While thinking about getting the local node values for these URIs, it occurred to me that something similar to the mechanism to do this can be used to record whenever a graph is being created in the system graph. That means that I can store each of the graphs in a list (and re-populate this list on startup with a simple constraint resolution). This list then becomes the basis for the graph bindings when I'm trying to emulate the missing indexes.

My first concern was that this might take too much space, thereby limiting the number of graphs that someone can have (as I said, some people have proposed using millions), but then I realized that my merge-join was going to need to reference the same number of resolutions as the number of graphs, and this would take more RAM anyway. It's really a moot point anyway, since the system would choke from performing a million lookups before you need to worry about an Out Of Memory condition. All this reminds me... I should worry too much about optimizations at such at early juncture. Premature optimization is the root of all evil.

Anyway, I'll probably spend a day on this, and may even get it all going, but I won't have it tested in time for a release before the weekend. I'd better let Amit (my boss) know that he won't get it until Christmas. :-)

Size

Disk usage is probably the second most common question I get about Mulgara, after speed. So to complement the plots from Monday, I've also plotted the disk usage for these "number" graphs.

The lowest line represents the space being used for URIs and Literals. The upper line is for the statements themselves. For convenience, the top line is the sum of the other two.

This storage mechanism is doing no compression on the data whatsoever. The current code in XA2 is already using an order of magnitude less space, both because of more intelligent storage, and also because many blocks will be gzip compressed in our structures. Andrae's reasoning for that is that while CPUs are getting faster all the time, disks are not. This means that any processing we do on the data is essentially free, since the CPU can usually be done in less than the time it takes to wait for a hard drive to return a result, even a solid state drive.

I should note that these graphs are all on a version of XA1.1 that is not yet released (it's in SVN, but not in the trunk yet). I've been hoping to get this into the next release, but because I'm doing a release by the end of this week, then I'm thinking it will have to be in the release after (before Christmas).

Monday, December 01, 2008

Mulgara Stats

The one question everyone asks me about Mulgara is always some variation on "How does it scale?" It's never easy to answer, as it depends on the hardware you're using, the OS (Linux does memory mapping better than Windows), and of course, the data.

I wanted to put off benchmarking until XA2 was released (early next year). I've also been hoping to have decent hardware to do it with, though I'm less sure about when that might happen. However, I've improved things by releasing the XA1.1 system recently, and it doesn't hurt to see how things run on desktop hardware.

RDF data varies according to the number of triples, the number of distinct URIs and literals, and the size of the literals. Some data uses only a few predicates, a modest number of URIs, and enormous literals. Other data uses only a few short literals, and has lots of URIs. Then there is the number of triples being stored. As an example, doing complete RDFS inferencing will introduce no new resources, but can increase the number of triples in a graph by an order of magnitude.

There are various standard sets and I intend to explore them, but in the meantime I'm going with a metric that Kowari/TKS used back in the days of Tucana, when we had a big 64bit machine with fast RAID storage. Funnily enough, these days I have a faster CPU, but I still don't have access to storage that is as fast as that box had.

The data I've been working with is the "Numbers" data set that I first created about 4 years ago. I tweaked the program a little bit, adding a couple of options and updating the output. There are probably better ways to model numbers, but the point of this is just to grow a large data set, and it does that well. You can find the code here.

Hardware

The computer I've been using is my MacBook Pro, which comes with the following specs:
Mac OS 10.5.5
2.6 GHz Intel Core 2 Duo
4GB 667 MHz DDR2 SDRAM
4MB L2 Cache
HDD: Hitachi HTS722020K9SA00
186.31GB
Native Command Queuing: Yes
Queue Depth: 32
File System Journaled HFS+

Note that there is nothing about this machine that is even slightly optimized for benchmarking. If I had any sense, I'd be using Linux, and I wouldn't have a journaled filesystem (since Mulgara maintains its own integrity). Even if I couldn't have RAID, it would still be beneficial to use another hard drive. But as I said, this is a standard desktop configuration.

Also, being a desktop system, it was impossible to shut down everything else, though I did turn off backups, and had as few running programs as possible.

The Test

I used a series of files, generating numbers to each million mark, up to 30 million. The number of triples was approximately 8-9 times the largest number, with the numbers from 1 to 30 million generating 267,592,533 triples, or a little over a quarter of a billion triples.

Each load was done with a clean start to Mulgara, and was done in a single transaction. The data was loaded from a gzipped RDF/XML file. I ignored caching in RAM, since the data far exceeded the amount of RAM that I had.

At the conclusion of the load, I ran a query to count the data. We still have linear counting complexity, so this is expected to be an expensive operation (this will change soon).

Due to the time needed for larger loads, I skipped most of the loads in the 20 millions. However, the curve for load times is smooth enough that interpolation is easy. The curve for counting is all over the place, but you'll have to live with that.


The axis on the left is the number of seconds for loading, and the axis on the right is the number of seconds for counting. The X-axis is the number of triples loaded.

Counting was less that a second up to the 8 million mark (70.8 million triples). This would be because most of the index could fix into memory. While the trees in the indexes do get shuffled around as they grow, I don't think that explains the volatility in the counting times I'm guessing that external processes had a larger influence here, since the total time was still within just a few minutes (as opposed to the full day required to load the quarter billion triples in the final load).

Overall, the graph looks to be gradually increasing beyond linear growth. From experience with tests on XA1, we found linear growth, followed by an elbow, and then an asymptotic approach to a new, much steeper gradient. This occurred at the point where RAM could no longer effectively cache the indexes. If that is happening here, then the new gradient is still somewhere beyond where I've tested.

My next step is to start profiling load times with the XA1 store. I don't have any real comparison here, except that I know that there is a point somewhere in the middle of this graph (depending on my RAM) where XA1 will suddenly turn upwards. I've already seen this from Ronald's tests, but I've yet to chart it against this data.

I'm also very excited to see how this will compare with XA2. I'm meeting Andrae in Brisbane next week, so I'll find out more about the progress then.