Sunday, January 13, 2008

Mulgara Correspondence

Recently I've been in a few email discussions with Andy Seaborne about the architecture of Mulgara. He's been looking at a new Jena-Mulgara bridge, but when he's had the time it appears he's been looking into how Mulgara works. There are certainly areas where Mulgara could be a lot better (distressingly so), so we will be changing a number of things in the not-to-distant-future. But in the meantime I'm more than happy to explain how things currently work. It's been a worthwhile exchange, as Andy knows what he's on about, so he's given me some good ideas. It's also nice to talk about some of the issues with indexing with someone who understands the needs, and can see the trade offs.

Since I wrote so much detail in some of these emails, I asked Andy (just before he suggested it himself) if he'd mind me posting some of the exchange up here. One could argue that if I hadn't been writing to him then I'd have had the time to write here, but the reality is that his questions got me moving whereas the self-motivation required for blogging has failed me of late.

There will be a lack of context from the emails, but hopefully I'll be able to edit it into submission. I should also issue a warning that what I wrote presumes you have some idea of what RDF is, and that you can look at the Mulgara code.

AVL Trees

If you want to keep read/write data on disk, and have it both ordered and efficient to search at the same time, then a tree is usually the best approach. There are other things you can do, but they all involve a tradeoff. Trees are usually considered the best thing to go with.

Databases usually have B-trees of some type (there are a few types). These work well, but Mulgara instead opted to go with AVL trees, with a sorted list associated with each node. This structure is much more efficient for writing data, but less efficient for deletion. This suits us well, as RDF is often loaded in bulk, and it gets updated regularly, but bulk deletions are less frequent. I mention the complexity of this later on.

Andy asked about our AVL trees, with comments showing that he was only looking at one of their uses. I think that understanding a particular application of this structure is easier if the general structure is understood first.

AVL trees are used in two places: The triple indexes (indices), and the "StringPool" (which is really an Object pool now).

The trees themselves don't hold large amounts of data. Instead each node holds a "payload" which is specific to the thing they are being used to index. In the case of the "triples" indexes, this payload includes:
  • The number of triples in the block being referenced.
  • The smallest triple stored in the block.
  • The largest triple stored in the block.
  • The ID of the 8K block where the triples are stored (up to 256 of them).
I'm only using the word "triple" because that's what we stored once upon a time (circa 2001). In reality, we store quads. On the first pass, the fourth value was a set of security values, but this quickly became a graph ID. Unfortunately, this happened back when everyone referred to graphs as "models", so you'll see the code uses the name "model" instead of "graph" everywhere. (I'd like to change this).

There is also some inefficiency, as we use a lot of 64 bit values, which means that there are a lot of bits set to zero. There are plans to change the on-disk storage this year to make things much more efficient. Fortunately, the storage is completely modular, so all we need to do to use a new storage mechanism is to enter the factory classes into an XML configuration file.

The code in org.mulgara.store.statement.xa.XAStatementStoreImpl, shows that there are 6 indices. These are ordered according to "columns" 0, 1, 2, and 3, with the following patterns: 0123, 1203, 2013, 3012, 3120, 3201. The numbers here are just a mapping of: Subject=0, Predicate=1, Object=2, Model=3. Of course, using this set of indices lets you find the result of any "triple pattern" (in SPARQL parlance) as a range inside the index, with the bounds of the range being found with a pair of binary searches.

We use AVL trees because they are faster for writing than B-Trees. This is because they have an O(1) complexity for write operations when doing insertions. They can have O(log(n)) complexity while deleting, but since RDF is supposed to be about asserting data rather than removing it, then the extra cost is usually OK. :-)

The other important thing to know about Mulgara AVL trees is that they are stored in phases. This means we have multiple roots for the trees, with each root representing a phase. All phases are read-only, except the most recent. The moment a phase updates a node, then it does a copy-on-write for that node, and all parents (transitively) up to a node that has already been copied for the current phase, or the root (whichever comes first). In this way, there can be multiple representations of the data on disk, meaning that old read operations are always valid, no matter what write operations have occurred since then. Results of a query are therefore referencing phases, the nodes of which can be reclaimed and reused when the result is closed, or garbage collected (we log a warning if the GC cleans up a phase).

Because all reads and writes are done on phases, the methods inside TripleAVLFile are of less interest than the methods in the inner class TripleAVLFile.Phase. Here you will find the find methods that select a range out of an index, based on one, two, or three fixed values.

The String Pool also uses an AVL tree (just one), though it has a very different payload. However, the whole phase mechanism is still there.

Object Pool

Andy noted that the comments for the ObjectPool class say that it is for reducing constructor overhead, but a cursory inspection revealed that they did more.

There's some complexity to avoid pool contention between multiple threads. Each pool contains an array of "type pools" (see the inner type called ObjectStack), indexed by an (manually) enumerated type. You want an object of type ID=5, then you go to element 5 in that array, and you get a pool for just that type. This pool is an ObjectStack, which is just an array that is managed as a stack.

Whenever a new ObjectPool is created it is chained onto a singleton ObjectPool called the SHARED_POOL. To avoid a synchronization bottleneck, each thread uses the pool that it created, but will fall back to using the "next" pool in the chain (almost always the SHARED_POOL) if it has run out of objects for some reason. Since this is only a fallback, then there shouldn't be much waiting.

I know that some people will cringe at the thought of doing object pooling with modern JVMs. However, when Mulgara was first written (back when it was called TKS) this sort of optimization was essential for efficient operation. With more recent JVMs, we have been advised to drop this pooling, but there have been a few reasons to hold back on making this change. First, we have tried to maintain a level of portability into new versions of the JVM (this is not always possible, but we have tried nonetheless), and this change could destroy performance on an older JVM. Second, we do some level of caching of objects while pooling them. This means that we don't always have to initialize objects when they are retrieved. Since some of this initialization comes from disk, and we aren't always comfortable relying on the buffer cache having what we need, then this may have an impact. Finally, it would take some work to remove all of the pooling we do, and recent profiles have not indicated that it is a problem for the moment. I'd hate to do all that work only to find that it did nothing for us, or worse, that it slowed things down.

32/64 Bits and Loading

Andy was curious about our rate of loading data on 32 bit system or 64 bit systems, given simple data of short literals and URIs (<100 characters or so). Unfortunately, I didn't have many answers here.

In 2004 Kowari could load 250,000,000 triples of the type described in under an hour on a 64 bit Linux box (lots of RAM, and a TB of striped disks). However, it seems that something happened in 2005 that slowed this down. I don't know for certain, but I've been really disappointed to see slow loads recently. However, I don't have a 64 bit Linux box to play with at the moment, so it's hard to compare apples with apples. After the SPARQL implementation is complete, profiling the loads will be my highest priority.

64 bit systems (excluding Apples, since they don't have a 64 bit JVM) operate differently to 32 bit systems. For a start, they memory map all their files (using an array of maps, since no single map can be larger than 2GB in Java). Also, I think that the "long" native type is really an architecturally native 64 bit value, instead of 2x32 bit values like they have to be on a 32 bit system. Since we do everything with 64 bit numbers, then this helps a lot.

After writing this, Inderbir was able to run a quick profile on a load of this type. He immediately found some heavily used code in org.mulgara.store.xa.Block where someone was doing a copy from one ByteBuffer to another by iterating over characters. I cannot imagine who would have done this, since only DavidM and I have had a need to ever be in there, and we certainly would not have done this. I also note that the operation involves copying the contents of ByteBuffers, but this doesn't make a lot of sense either, since the class was built an an abstraction to avoid exactly that (whenever possible).

I haven't seen the profile, but Inderbir said that a block copy gave him an immediate improvement of about 30%. I'd also like to check the stack trace to confirm if a block copy is really needed here anyway. Thinking about it, it might be needed for copying one part of a memory-mapped file to another, but it should be avoided for files that are being accessed with read/write operations.

Embedded Servers

Andy was also intrigued at the (sparse) documentation for Embedded Mulgara Servers. While on this track, he pointed out that LocalSession has "DO NOT USE" written across it. I've seen this comment, but don't know why it's there. I should look into what LocalSession was supposed to do. In the meantime, I recommended not worrying about it - I don't.

The Session implementation needed for local (non-client-server) access is org.mulgara.resolver.DatabaseSession. It should be fine, as this is what the server is using.

When doing RMI, you use RemoteSessionWrapperSession. I didn't name these things, but the standard here is that the part before "Wrapper" is the interface being wrapped, and the thing after "Wrapper" is the interface that is being presented. So RemoteSessionWrapperSession means that it's a Session that is a wrapper around a RemoteSession. The idea is to make the Session look completely local. The reason for wrapping is to pick up the RemoteExceptions needed for RMI and convert them into local exceptions. At the server end, you're presenting a SessionWrapperRemoteSession to RMI. This is wrapping a Session to look like a RemoteSession (meaning that all the methods declare that they throw RemoteException). Obviously, from the server's perspective, the Session being wrapped here must be local. And the session that is local for the server is DatabaseSession. So to "embed" a database in your code, you use a DatabaseSession.

The way to get one of these is to create an org.mulgara.resolver.Database, and call Database.newSession(). Databases need a lot of parameters, but most of them are just configuration parameters that are handled automatically by org.mulgara.resolver.DatabaseFactory. Look in this factory for the method:
 public static Database newDatabase(URI uri, File directory, MulgaraConfig config);
A MulgaraConfig is created with using the URL of an XML configuration file. By default, we use the one found in conf/mulgara-x-config.xml, which is loaded into the jar:
 URL configUrl = ClassLoader.getSystemResource("conf/mulgara-x-config.xml");
MulgaraConfig config = MulgaraConfig.unmarshal(new InputStreamReader(configUrl.openStream()));
config.validate();
(configUrl has a default of: jar:file:/path/to/jar/file/mulgara-1.1.1.jar!/conf/mulgara-x-config.xml)

As an aside, it's supposed to be possible to do all of this by creating an EmbeddedMulgaraServer with a ServerMBean parameter that isn't doing RMI. Unfortunately, there are no such ServerMBeans available. (Maybe I should write one?)

Also, I believe that the purpose of the embedded-dist Ant target is to create a Jar that has these classes along with all the supporting code, but without anything related to RMI. So the embedded Jar should be all you need for this, but I haven't used it myself, so I'm just making an educated guess. :-)

SPARQL

Since I had been working on this up until mid-December, it is worth noting where I am with it.

TQL includes graph names as a part of the query, and graph names have been URLs (not URIs) - meaning that they include information on how to find the server containing the graph. Unfortunately, the guys who wrote TQL integrated the session management into the query parsing (if that doesn't make you shake your head in disbelief then you're a more forgiving person than I am). I've successfully decoupled this, and now return an AST that a session manager can work with. This also means that graph names no longer have to describe the location of a server, meaning we can now support arbitrary URIs as graph names. This now puts the burden on the session manager to find a server, but that's easy enough to set up with configuration, a registry, or scanning the AST for graph names if we want backward compatibility.

The next part has been parsing SPARQL. (Something that Andy should be intimately familiar with, given that his name is all over the documents I reference).

With so many people talking about extensions to SPARQL, and after discussing this with a few other people, we decided to go with an LALR parser. This means I've had to write my own lexer/parser definition, instead of going with one of the available definitions, like the JavaCC definition written by Andy. We do have SableCC already in Mulgara, but everyone present agrees that this is a BAD LALR parser so I had to use something new. I chose Beaver/JFlex. It's going well, but I still have a lot of classes to write for the CST. The time taken to do this has me wondering if everyone is being a little too particular about the flexibility of an LALR solution, and maybe I should just go back to Andy's JavaCC definition. OTOH, I really like Beaver/JFlex and having an independent module that can do SPARQL using this parser may be a good thing.

Fortunately the SPARQL spec now has a pretty good grammar specification and terminals, though one or two elements seemed redundant, and I've jumped over them (such as PNAME_LN. Instead I defined IRIref ::= IRI_REF | PNAME_NS PN_LOCAL? ). I've been getting some simple CSTs out of it so far, but have a way to go yet.

Once I have it all parsing, of course, I have to transform the result into an AST. Fortunately, most of the SPARQL AST is compatible with Mulgara. The only exceptions are the FILTER operator (SPARQL's "constraints") and the OPTIONAL operator. I'm pretty sure I can handle OPTIONAL as a cross between a disjunction (which can leave unbound variables) and conjunctions (which matches variables between the left and the right). Filters should be easy, since all our resolutions are performed through nested, lazy evaluation of conjunctions and disjunctions. Handling the syntax of filters is another matter, but I expect it to be more time consuming than difficult.

Holdup

Since writing these comments about implementing SPARQL, I haven't had time to work on it again. Hopefully that will change soon with the new job. But in the meantime, the loss of time has me thinking that I should reconsider using a pre-built SPARQL definition for a less expressive parser, and come back to Beaver/JFlex at a later date.

I've heard that the Jena JavaCC grammar may be a little heavily geared towards Jena, but I've been given another definition by my friend Peter which is more general and apparently passes all the relevant tests. I suppose I should go and learn JavaCC now.

3 comments:

Anonymous said...

nice writeup!

leopard has a 64-bit java5 vm, btw :)

Paula said...

It has? I'm pleased to hear it! I'd heard that it hadn't, along with a lot of other missing Java stuff - like Java 6, but never bothered to check for myself.

Just proves that you can't believe everything you read (esp. this blog). Thank you!

Anonymous said...

yup, -d64 when launching