Monday, May 28, 2007

SemTech Day 1 (Still)

After David and Brian's talk I went to lunch, only to happen on Amit and Rich from Topaz. Topaz is an Open Source development group for the Public Library of Science (PLoS).

For anyone who doesn't know, it was Topaz along with the Software Freedom Law Center, who came to our defense when we encountered difficulties dealing with NGC last year. An explanation of this, along with the reason for the name change from Kowari to Mulgara can be heard in a recent podcast interview that David had with Paul Miller of Talis. Topaz has been developing with Mulgara, and used it in the deployment of the PLoS ONE system, which is why our logo made it onto their front page. More recently, they have started contributing to the project, and have even provided some direct funding. I like these guys. :-)

Lunch was spent discussing what Mulgara is doing, and where we hope to take it. Amit also asked about some of the internal operation of Mulgara, and the resulting description took me well into the next session. I hope I didn't miss anything important, but right now I can't even recall what I was hoping to attend, so with that justification I don't feel so bad. The discussions with Amit were worthwhile anyway, and it is my opinion that these impromptu meetings provide the real value of any conference.

AllegroGraph

I wrapped the discussion up in time to attend a talk I was very interested in, by Richard Newman of Franz Inc. It went by the unwieldy title of Working with SPARQL - Large Datasets, Fast Queries: What You Need to Know. This sounded fascinating, and I wasn't disappointed.

I'm embarrassed to say that I didn't know much about Franz Inc, nor the AllegroGraph RDF database. Such an obvious lapse is a clear indication that I need to get out on the web more. Still Richard's talk and subsequent Q&A went a long way to describing the system for me. In some cases he directly described internal details, while the explanation of feature implementations made it clear exactly what they had done in other instances.

It appears that AllegroGraph is architecturally similar to Mulgara in several ways, so I'm going to provide a bit of a comparison here.

Their approach to querying has had the advantage of being directly applied to SPARQL, which Mulgara is still working towards (hopefully soon), but since the basic structure of TQL and SPARQL is the same, then the parallels are easy to draw. There are some distinctions between the two projects, but most of the extras in AllegroGraph are already in the pipeline for Mulgara, such as storing 5-tuples, and a tableaux reasoner (they have Racer, while I'd like to bring in Pellet - which reminds me that I need to check Pellet's license).

Richard's explanation of simple unordered joins returning an immediate solution with no consumption of server memory revealed that AllegroGraph contains the full 6 indexes we use, and also does the same lazy join evaluation as Mulgara. However, when we move to 5-tuples we will only be putting a tuple ID into the first index (and not the remaining 5 indexes). We will also be introducing a new index based on this ID, so that a tuple can be found based on it's ID. Surprisingly, according to their introduction, AllegroGraph doesn't appear to have this index. Then again, maybe that's just a documentation issue. Incidentally, AllegraGraph seems to have a tuple ID for every entry, while our plan is to only include it for those statements which have been reified. This won't affect the storage requirements on the first index (the one being widened to 5), but will save space and time on the new index, which will only get entries when that statement gets reified.

AllegraGraph is using an ID-to-object mapping structure just like our "String Pool". I've commented in the past how this is an historical name, based on it originally storing only URIs and string literals, and that it should be renamed given that it now stores all the XML datatypes (and even user types). Franz must have gone through a similar process with AllegraGraph, given that their equivalent structure is called a "String Dictionary".

Getting back to the talk, Richard spent quite a bit of time talking about the ramifications of UNIQUE and ORDER BY. The UNIQUE keyword results in memory usage at the server end, though the result set is returned immediately. This is pretty obvious, given that the server need not do anything beyond store (and index!) those lines already encountered so they are not duplicated.

More difficult is implementing ORDER BY. Richard referred to server memory, but I suspect he was including secondary memory in that description. In our own case, we use a class called HybridTuples. This is a class that I started, but Andrae was the one to really implement. Now that I think about it, he probably didn't even look at my code. I think this class may have been one of the first things he did for Mulgara. The class uses memory up to a point, but then uses a file. It would be nice to memory map it when it can fit, but on a 32 bit system that is just too difficult to test for in Java, and it could cause other connections to fail with an Out Of Memory error (OOM). So on 32 bit systems HybridTuples uses standard explicit IO for managing the file, while on 64 bit systems it uses memory mapping (yes, even if the file is over 2GB in size).

All up, the talk did not provide me with many useful tips for SPARQL, though that may have been due to my experience with Mulgara. However, I found the description of AllegraGraph, its features and internal details to be invaluable.

After the talk Amit offered to introduce me to Jans Aasman, who is the director of engineering at Franz. When he heard that I was involved with Mulgara he made a comment I was to hear several times during the week, "It was a real shame. You guys were way ahead of the curve 2 years ago!" This comment was a great endorsement of our work, and yet very frustrating that a great opportunity was lost. All the same, Mulgara is starting to move along nicely again, so we should still have something worthwhile to offer.

I also commented to Jans about the internal structure of AllegraGraph, and he admitted that there isn't much you can do to hide the implementation of these things. I suppose much of what we do is straight forward. It's just a matter of competent people integrating the right components and techniques. On those few occasions where you DO come up with something new (like when we started indexing 3 ways on triples, and 6 ways on quads) then it doesn't take long before everyone does it. While I may form an emotional attachment to my own ideas (see the section marked "Indexing" for the reference here), I still love to see everyone helping everyone else improve the current state of the art (and hence, I hate patents - especially on software).

I'm not sure what others got out of it, but overall, I liked Richard's talk, though probably not for the insights into SPARQL, but rather the exposé of the engineering behind AllegraGraph.

4 comments:

jansaasman said...

I enjoyed your blog, thanks, Jans

Kool Dude said...

hey paul
i was looking to use allegrograph for rdf triples storage and then using jena to create ontology models from retrieved triples. I am getting lost at the bridge since my onto models are created empty using allegro graph jena implementation. I am contemplating several options here: using jena SDB but which documents only using RDBMS, using JRDF which i am not sure how would match with graph returned by AG, moving to mulgara which i know supports TQL but would like to stick with something that supports SPARQL. I would appreciate any help.
thanks
-amit

Paula said...

Good question.

It's going to depend on what you want to do. Jena is good, but is too slow when you get a lot of data. JRDF should match any returned data pretty well, though I don't know what it's storage system is like ATM. Using Mulgara with TQL is much more functional than SPARQL, but I totally agree that sticking to the standard is the way to go.

If you can figure out Allegragraph (and you never exceed their size limit) then that would seem to be a good solution. If Jena is fast enough for you, then that would be good too. And if you can hang on just a little, Mulgara will have SPARQL. :-)

Chocolate Bark Recipes said...

Thanks for a greeat read