Sunday, July 29, 2007

Conversations

I've just spent a week working in Novato, CA. While I didn't get much programming done, I did manage a few very productive conversations. I spent the whole week working with Alan, who is interested in Mulgara (for various reasons), and on Wednesday night I finally got to met with Peter from Radar Networks.

While describing the structure of Mulgara, and particularly the string pool, Alan had a number of astute observations. First of all, our 64 bit gNodes don't have a full 64 bit address space to work in, since each node ID is multiplied by the size of an index entry to get a file offset. This isn't an issue in terms of address space (we'd have to be allocating thousands of nodes a second for decades for this to be a problem), but it shows that there are several unused bits of addressable space that are unreachable. This provides opportunities for storing more type information in the ID.

This observation on the address space took on new relevancy when Peter mentioned that another RDF datastore tries to store as much data as possible directly in the indexes, rather than redirecting everything (except blank nodes) through their local equivalent to the string pool. This actually makes perfect sense to me, as the Mulgara string pool (really, it's a "URI and literal" pool) is able to fit a lot of data into less than 64 bits already. We'll only fit in short strings (7 ASCII characters or fewer), but most numeric and data/time data types should fit in here easily. Even if they can't, then we could still map a reduced set of values into this space (How many DateTime values really need more than, say, 58 bits?).

Indeed, I'm only considering the XA store here. When the XA2 store starts to come online it will have run-length encoded sets of triples in the blocks. This means that we can really stretch the length of what gets encoded in the indexes without diverging to the string pool.

The only thing that this approach might break would be some marginal uses of the Node Type and DataType resolvers. These resolvers are usually used to test or filter for node type information, and this function would not be affected. However, both resolvers are capable of being queried for all the contents of the string pool that meet the type criteria, and this function would be compromised. I'm not too worried though, as these functions are really only useful for administrative processes (and marginally at that). The only reason I allowed for this functionality in the first place was because I could, and because it was the natural semantic extension of the required operations. Besides, some of the other changes we might make to the string pool could invalidate this style of selection of "all uses of a given type".

Permanent Strings

The biggest impediment to load speed at the moment appears to the be string pool. It's not usually a big deal, but if you start to load a lot of string data (like from enwiki) then it really shows. Sure, we can cache pretty well (for future lookups), but when you are just writing a lot of string data then this isn't helping.

The use cases I've seen for this sort of thing usually involve loading a lot of data permanently, or loading it, dropping it, and then re-loading the data in a similar form. Either way, optimizing for writing/deleting strings seems pretty pointless. I'm thinking that we really need an index that lets us write strings quickly, at the expense of not being able to delete them (at least, not while the database is live).

I'm not too concerned about over optimizing for this usage pattern, as it can just be written as an alternative string pool, with selection made in the mulgara-config.xml file. It may also make more sense to make a write-once pool the default, as it seems that most people would prefer this.

I've been discussing this write-once pool with a few people now, but it was only while talking with Alan that I realized that almost everything I've proposed is already how Lucene works. We already support Lucene as the backend for a resolver, so it wouldn't be a big step to move it up to taking on many of the string pool functions. Factor in that many of the built in data types (short, int, character, etc) can be put into the indexes online, and the majority of things we need to index in the string pool end up being strings after all, which of course is what Lucene is all about. Lucene is a great system, and integration of projects like this is one of the big advantages of building open source projects.

It's been a while since I wrote to the Lucene API. I ought to pull out the docs and read them again.

No comments: