Monday, December 01, 2008

Mulgara Stats

The one question everyone asks me about Mulgara is always some variation on "How does it scale?" It's never easy to answer, as it depends on the hardware you're using, the OS (Linux does memory mapping better than Windows), and of course, the data.

I wanted to put off benchmarking until XA2 was released (early next year). I've also been hoping to have decent hardware to do it with, though I'm less sure about when that might happen. However, I've improved things by releasing the XA1.1 system recently, and it doesn't hurt to see how things run on desktop hardware.

RDF data varies according to the number of triples, the number of distinct URIs and literals, and the size of the literals. Some data uses only a few predicates, a modest number of URIs, and enormous literals. Other data uses only a few short literals, and has lots of URIs. Then there is the number of triples being stored. As an example, doing complete RDFS inferencing will introduce no new resources, but can increase the number of triples in a graph by an order of magnitude.

There are various standard sets and I intend to explore them, but in the meantime I'm going with a metric that Kowari/TKS used back in the days of Tucana, when we had a big 64bit machine with fast RAID storage. Funnily enough, these days I have a faster CPU, but I still don't have access to storage that is as fast as that box had.

The data I've been working with is the "Numbers" data set that I first created about 4 years ago. I tweaked the program a little bit, adding a couple of options and updating the output. There are probably better ways to model numbers, but the point of this is just to grow a large data set, and it does that well. You can find the code here.

Hardware

The computer I've been using is my MacBook Pro, which comes with the following specs:
Mac OS 10.5.5
2.6 GHz Intel Core 2 Duo
4GB 667 MHz DDR2 SDRAM
4MB L2 Cache
HDD: Hitachi HTS722020K9SA00
186.31GB
Native Command Queuing: Yes
Queue Depth: 32
File System Journaled HFS+

Note that there is nothing about this machine that is even slightly optimized for benchmarking. If I had any sense, I'd be using Linux, and I wouldn't have a journaled filesystem (since Mulgara maintains its own integrity). Even if I couldn't have RAID, it would still be beneficial to use another hard drive. But as I said, this is a standard desktop configuration.

Also, being a desktop system, it was impossible to shut down everything else, though I did turn off backups, and had as few running programs as possible.

The Test

I used a series of files, generating numbers to each million mark, up to 30 million. The number of triples was approximately 8-9 times the largest number, with the numbers from 1 to 30 million generating 267,592,533 triples, or a little over a quarter of a billion triples.

Each load was done with a clean start to Mulgara, and was done in a single transaction. The data was loaded from a gzipped RDF/XML file. I ignored caching in RAM, since the data far exceeded the amount of RAM that I had.

At the conclusion of the load, I ran a query to count the data. We still have linear counting complexity, so this is expected to be an expensive operation (this will change soon).

Due to the time needed for larger loads, I skipped most of the loads in the 20 millions. However, the curve for load times is smooth enough that interpolation is easy. The curve for counting is all over the place, but you'll have to live with that.


The axis on the left is the number of seconds for loading, and the axis on the right is the number of seconds for counting. The X-axis is the number of triples loaded.

Counting was less that a second up to the 8 million mark (70.8 million triples). This would be because most of the index could fix into memory. While the trees in the indexes do get shuffled around as they grow, I don't think that explains the volatility in the counting times I'm guessing that external processes had a larger influence here, since the total time was still within just a few minutes (as opposed to the full day required to load the quarter billion triples in the final load).

Overall, the graph looks to be gradually increasing beyond linear growth. From experience with tests on XA1, we found linear growth, followed by an elbow, and then an asymptotic approach to a new, much steeper gradient. This occurred at the point where RAM could no longer effectively cache the indexes. If that is happening here, then the new gradient is still somewhere beyond where I've tested.

My next step is to start profiling load times with the XA1 store. I don't have any real comparison here, except that I know that there is a point somewhere in the middle of this graph (depending on my RAM) where XA1 will suddenly turn upwards. I've already seen this from Ronald's tests, but I've yet to chart it against this data.

I'm also very excited to see how this will compare with XA2. I'm meeting Andrae in Brisbane next week, so I'll find out more about the progress then.

2 comments:

Rob van Dort said...

Need a fast hard drive: try a SSD (Solid State Disk). Fastest SSD to date seems to be Intel X25-M.
Access time: 0.1 ms, read 238,7 MB/s, write 68,6 MB/s.
http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3403

Paula said...

I'd love one Rob, but I can't afford it. Got one you can give me? :-)