Monday, May 28, 2007

More AllegroGraph and Mulgara

This blog is all about helping me to remember stuff. The fact that anyone else is reading it is just a little too strange for me to remember most of the time. Please remember this when I go on and on about something I'm trying to write about, or when I'm a little too candid for my own good. It's more the former that concerns me in this post.

One important architectural item of AllegroGraph that I forgot to note was the description of their index writer. Richard mentioned that writing to the graph is fast, and that it plays "catch up" in the background while keeping the data live the whole time. If you perform a query during this period, then part of the query will be fast, and the "unindexed" data will come in more slowly. This is so close to the Mulgara XA2 design that I just had to mention it.

Aside from making XA2 faster to read the indexes, we really want to make them faster to write. Reading is actually fine, and I've yet to hear of any index-related problems in reading speed (there have been problems at various times, but they usually come from data type issues).

The fastest way to write triples is to simply write them down in the order they were given to you. You can't do a lot with this, but you can do this faster than any process can practically give them to you. This is the limit we can aim for.

Our original plan (based on skip lists) was going to accept writes in exactly this way. Indeed, this is exactly what you have to do with skip lists, as they are a write-once structure. This means that any new data has to be accumulated before it can be added to the index. Any subsequent query first uses the index (with efficient log(n) complexity) and then adds in the results of a linear search through the "writable" file. A background thread then merges this file in with the main index (into a separate file) and then brings the new file online. This is actually how Lucene does it.

Andrae has since left behind the idea of skip lists (they had a certain advantage in reading, due to forward-file access, but writing can be expensive) but the idea of building up an unordered list and flushing it in a set of background threads is still what we want to do. The result will be a fast write, like AllegroGraph has, but querying will get slower and slower as writes get further ahead of the indexing threads, again - just like AllegroGraph. The big difference is that we only have it planned, while they have it working. If only Mulgara had a dozen programmers still working on it. Oh well - if wishes were fishes...

Thinking of using a flat file with raw writes going into it reminds me of a related architectural piece.

One of the very first features we decided on for XA2 was an ability to have several concurrent writers. The phase trees and two-phase commit process we employ at the moment are perfect for maintaining ACID compliance, but they limit us to only a single writer at a time. If we wanted to go to multiple writers then we'd need some form of merging for multiple concurrent transactions. While it might be possible to merge branches of a phase tree, this approach is guaranteed to make ACID a total nightmare. Besides, it wouldn't really be a phase tree anymore if you try to merge branches (and nodes within those branches). The way to manage this is the way it's always done, and that is to keep transaction logs.

A transaction log is a list of two types of entry: insertions and deletions (or whiteouts). This information is basically what I was describing above. Just as I already stated, the current transaction needs to search this list along with the index whenever a query is resolved.

Fortunately, RDF is very kind to us when it comes to keeping a log. Statements either exist, or they don't. This means we only need to consider merging logs where opposite operations are recorded. e.g. If one transaction removes a statement, and a concurrent one removes and re-inserts that statement. In fact, an argument can even be made for not caring about merging these things, since almost any sequence of actions is valid.

Unfortunately RDFS/OWL isn't anywhere near as forgiving. These systems rely on multiple RDF statements for their structures. While the open world may make this perfectly legal (for most purposes), any practical application trying to talk to inconsistent data is going to choke. Merging transaction logs with these constraints will be more challenging, especially if a simple RDF interface is being used for insertion/deletion of statements. An RDFS or OWL (or even SKOS) interface could keep a track of the consistency of the structures, but trying to do this at the RDF level will require a lot of inferring of what the transaction is actually trying to accomplish. Indeed to allow the commit of OWL data, you'll need a full reasoner to check for consistency before allowing for a commit. The problem with doing something at the RDFS/OWL level is that we don't have an API at those levels. Hmmm, that reminds me that I need to get Sofa properly integrated again. That might even help us here.

But these are concerns for another time.

Back to the Conference

After the last talk on Monday we all went to the "Metaland" party. A presentation was going on, but it had been a long day for everyone, and there was free beer on offer. You can imagine how many people were paying attention. Maybe in future the beer can be handed out after the presentation, rather than before and during.

Amit and Rich from Topaz then went out to dinner and drinks with all of us from fourth codex. It was more fun that I thought any evening at a conference was going to be. If I embarrassed myself then no one has mentioned it... yet. Every phase of the evening seemed to involve a new type of drink.

While out, we had two people show up for fourth codex. The first was a new university graduate working for us in sales. This was her first day at work, and the commercial exhibits at the conference were on the next day. Talk about a trial by fire. The other guy was from the Exeura side of fourth codex, and had just flown in from Calabria, after 29 hours of traveling. He must have done something horrible in a previous life to have to go straight into a conference after a trip like that!

Reasoners

As promised, I checked the license for Pellet. It uses the MIT license, making it compatible with practically anything. Yay!

2 comments:

Anonymous said...

Did you submit your site at blogsearch.sg?

You can reach blogsearch by just typing blogsearch.sg in your browser window or click here

This is a service by bizleadsnet directory of web logs.

ENGINEERING

office 2007 keygen said...

This is my first time i visit here. I discovered so numerous interesting stuff in your weblog especially its discussion. From the tons of comments on your articles, I guess I am not the only one having all of the enjoyment here! maintain up the great work.