Thursday, September 18, 2008

Disclaimer

New babies are wonderful, but the resulting sleep patterns are far from optimal. Please excuse me if I stop making sense halfway through any of the ensuing sentences.

Hash Tables

Whenever I have a spare moment (and sometimes when I don't) I'm forever re-examining how I think RDF should be indexed. After all, I've already found a few different ways to do it, both in Mulgara and out of it, and each have their own pros and cons.

One of the most interesting indexes to consider is the Hash Table. Constant time reads and writes makes for a compelling argument in terms of scalability. Re-hashing an index during expansion is painful on systems that should scale, but I was recently reminded that amortized complexity is still linear, so I shouldn't be too scared.

Years ago TKS (the forerunner to both Mulgara and Kowari) used a few on-disk hash tables, but they proved ineffective for us, and we moved to trees. But many of our assumptions back in 2000 no longer apply to modern systems, and I've already found several things worth re-examining for this reason. On top of that, Andy Seaborne was discussing using them for Jena, and while I was initially dubious, on further reflection I can see the reasoning.

Pros

It's O(1): That's kind of a trump card. Everything else I have to say here is a discussion as to what could possibly be more important than being O(1).

Opaque Data: Data store in a hash is treated as an atom, meaning there is no ordering or other meaning on the data. While this creates problems (mentioned below) it also provides the opportunity to distribute the data across a cluster like Hadoop. That's a big deal these days.

Cons

Re-Hashing: The first problem I think about with on-disk hash tables is the cost of a re-hashing operation. These are expensive in memory, but on disk they are going to be frightful. Reading the original hash will be OK, as this is a linear scan through the file, but writing will be problematic, as the seeks are essentially random. That's a cost of N seeks, for N entries (ignoring seeks for reads, but they're amortized, and could even be on another drive). There may be some algorithms for clustering the writes, but if you're trying to scale on the size of your data, then this would be overwhelmed.

The best way to address this is to allocate as much space as you can, and to be generous when growing. That could be a problem for some systems, but if you're really in the business of scaling on data, then you'll be up for it.

Space: Hash tables require a lot of empty space to work, else you end up with a lot of hashing collisions, and those lovely O(1) properties go out the window (until you expand and re-hash, but I've already talked about that). I shouldn't really make a big deal out of this, especially when you consider that Mulgara was built using the idea that "disk is cheap", but it does still feel a little strange to be that lavish. Also, being extravagant with space can lead to speed issues as well, so it's always worth looking at with the critical eye, even if the final decision is to use the space.

No Ordering: Data in a hash table cannot be ordered. Well, OK, a linked hash table can do it, but you only want to link by insertion order, or else all your O(1) benefits are gone.

Hashes and SPARQL

Of the three cons I listed here, it's relatively easy to justify the concerns about re-hashing and space. In fact, once you decide that "space is no object", then re-hashing isn't such a big deal since you can just start with an enormous table that never (or almost never) gets rehashed.

The ordering issue bugged me for a while, and it was then that I realized that this actually works well for SPARQL. In fact, this looks like yet another case where the heritage of filtering is showing up again (though maybe it's a coincidence this time).

When you use the appropriate resolvers in Mulgara (in either TQL or SPARQL, since resolvers are just mapped onto graph names) then data can be selected by "range". This lets us select an ordered set of date/times, numbers, or strings that occur between a pair of boundaries. (Particularly useful for something like selecting events during a particular time window). It is even useful for selecting URIs based on namespace. These selections are then joined to the remainder of the query to create a result. The end effect is processing much less data than simply selecting it all, and FILTERing it down by the data that meets the given criteria. We always pursued this in Mulgara, as we found that filtering could slow down certain queries by orders of magnitude.

However, SPARQL was never designed for this kind of thing, and as a result it relies entirely on filtering to do its work. This usually bothers me, but for hash tables it actually works, since they don't provide the ability to select a range anyway, and hence require filtering if you want to use them.

To Tree or not to Tree?

I've been wedded to trees in Mulgara for so long that it feels weird just examining a system without them. Of course, I've already moved away from the use of trees with the new statement store design, but I still thought that the data pool still had to be ordered, and hence, no hash tables.

Now I can see the utility of using hash tables in this part of the system, providing you are prepared to using filtering for your results. Jena was always designed around these principles (it's easy to use, it's easy to implement, and it's correct), so I understand why Andy would be attracted to it. However, I know that range queries are a big deal in Mulgara, so we really do need a tree somewhere.

But perhaps we can mitigate some of the expense of tree indexes?

Trees are really only needed for two types of query: ranges of data (meaning literals); and selecting strings or URIs by prefix. Neither of these are common operations, and are certainly not needed during the time-consuming load operation. So perhaps loads could be done entirely with a fast hash index, and afterwards a slow tree-based indexer could come through to order everything. Background indexing is nothing new, and even AllegroGraph does it, though I'm not sure how to manage a range query while waiting for an index to proceed.

Another possibility would be to do inserts into a tree index, and simultaneously index the tree node with a hash index. After all, the tree nodes are not being reclaimed, and while their position in a tree may change, their data does not. This would require another seek/write during writing, but would save on log(N) seeks when looking to see if a string or URI exists, which is the single most common operation during a load. That way there would be no background indexing to worry about waiting for, and the most common task drops from log(N) to a single seek. Now that has promise. I'll have to see if I can think of a decent Hadoop angle for it.

So now I need to write the hash table file. We already have a few things that are close, so maybe I can leverage off one of those?

Other Stuff

There's a LOT to write about with Mulgara, but this year I've tended to do the work rather than write about it. I believe that this is a false economy, since writing about things provides me with an invaluable log of what I did and when, and also helps me work out just what I need to be doing.

On the other hand, late at night is not the time for me to be writing, especially when a baby is going to be waking me at various times between now and morning.

All the same, I'll mention that I now have a couple of cute little servlets that let me do "HTTP GET" requests with SPARQL protocol parameters, and I get back either XML or JSON (depending on the value of an optional parameter called out. Hmmm, maybe I should have called it format?). One of the servlets is for TQL queries, while the main one is for SPARQL.

These servlets also accept "HTTP POST" requests. In this case, the TQL servlet will allow commands that update data. The SPARQL servlet will eventually do this too, but not until I've implemented "SPARQL/Update". They will also accept MIME encoded files containing RDF data (RDF/XML, N3 and I think Turtle) and will load them into the default graph, which can be specified with the default-graph-uri parameter.

I haven't committed all of this code yet, since I ran into a bug when loading an RDF file. It turned out that this file finishes with the line:
</rdf:RDF>
This line does not finish with a newline character, and this is confusing the ARQ parser we are using. Of course, I could just wrap the InputStream object in something that appends a newline, but this is an unnecessary (and horrible) hack, so I decided to look for the source of the problem.

At this point I realized that we are still on Jena 2.1, while the world has moved on to 2.5.6. Hopefully a move to 2.5.6 will fix this issue, so I decided to upgrade the Jar. Of course, this led to 2 other jars (icu.jar and arq.jar) along with other tests failing (I think they were trying to compensate for a timezone bug, but this has been fixed now).

While trawling through the Mulgara XSD classes I found what I think is the problem (compensation code for Jena not handling 0 months, though now it should). While there, I also learnt that despite parsing everything needed, the same data was being send to a Jena object for parsing. This seems quite redundant. It is also one of the few places that Jena classes are used (as opposed to just the ARP parser), so it would be great to drop this dependency if I can.

So now a simple bug fix (not handling a missing newline character) seems to be leading me into all sorts of updates. Story of my life.

OK, now I'm falling asleep between words, and have even caught myself starting to type something I started dreaming on 3 occasions. I think I've overstayed on my blog.