Monday, May 26, 2008

Mulgara Alpha

My last few weeks were spent trying to get Mulgara's SPARQL interfaces ready before the Semantic Technology Conference 2008. I met the criteria Amit (from Topaz) and I had agreed to beforehand, which allowed me to get out an Alpha release for the next version of Mulgara. There are still a couple of things missing, but the basics are all there now.

The road to SPARQL took a couple of turns I hadn't expected.

Back in February we were approached by Aduna who asked if we would be willing to support a level of integration between Sesame and Mulgara. While none of the Mulgara developers had the time to work with them directly, we said that we would be very happy to try to support Aduna where we could. The majority of this work was done by James Leigh (a programmer who commands my respect more and more on a daily basis), and he was able to get it all done in remarkable time. Even more impressive was that his integration work is 100% SPARQL compliant, even though some of the underlying structure isn't quite there yet!

My own work was to:
  • Parse SPARQL queries.
  • Convert this into the Mulgara Algebra.
  • Write new algebraic operations in the Mulgara query engine.
The work by Aduna was going to overcome the need for the first and second tasks, but I had already completed the first when we heard from Aduna, with most of the work left to be done required for both the SAIL interface and my own SPARQL implementation. Since this was the case, I decided to continue with my own interface, since there wasn't going to be much redundant work from that point onwards. Even with both interfaces working correctly, the SAIL API will be the one to use, as it also includes a SPARQL Protocol endpoint, which I haven't looked at yet.

While the SAIL integration may have appeared to be independent from my own work, it turned out that James's contribution was invaluable. His need to pass all the SPARQL tests drove a lot of my query engine work, pointing out both missing features and bugs I was unaware of. I still have a couple of things to go, but James has been able to work around them at the higher layers for the time being. This has a performance penalty, but these will be dealt with in the next couple of weeks.

Notable Feature Implementations

Language Tags

One missing feature that completely floored me was that Mulgara was not supporting language tags on untyped literals. It turns out that this was slated for addition just as Tucana was closed, which is why it never made it. Even so, I must admit that I was surprised that it took that long for this feature to be scheduled!

Fortunately, language tags were quick and easy to implement. The main issue was in the existing tests, as nearly half of our files use literals with language tags in them, and none of the "expected results" included them.

Repeating Variables

Another issue was in "basic graph patterns" that use a repeating variable. Mulgara already had some code to deal with this, but it was failing in most cases. Unfortunately, I responded to this as a "bug report", and fell into the trap of fixing the existing code. I got it working after a day, only to be told the next day that it still failed if the variable is repeated in the position of the graph name.

At that point I stepped back from the problem, and realized that the solution was actually quite easy. All you need do is replace the repeating variable with a set of unique names, and create a conjunction of the constraint repeated with the variables in rotating positions. After mentioning this to Andrae he informed me that he'd worked this out a few years before (even though someone else was implementing the code at the time), but he forgot to let me know. Oh well, at least I'm doing it correctly now.

While looking to implement this fix, I realized that the best way to perform this substitution would be via Andrae's query transformation SPI. This lets you search through a query structure, and replace elements with something more appropriate for the engine to work with. It was while working with this I realized that it provides me with a tool that will let me solve a problem I've had for some time.

Transitive

The trans feature in Mulgara is a mechanism that lets the user mark the predicate in a constraint as transitive. While it works really well, the syntax in TQL is ugly. However, the query transformer offers an alternative. Instead of wrapping a standard constraint in a trans(...) operator, the predicate can be typed as being transitive in a separate constraint. I was tempted to use the URI of owl:TransitivePredicate for this task, but this will interfere with declarations in ontologies, so a local URI will be much more appropriate (something like mulgara:TransitivePredicate). The really cool thing is that this will be sharable with SPARQL queries as well. That means we can start opening some of our functionality up to SPARQL users, while not needing to extend the syntax of that language. In fact, there are a few functions we can implement in this way, allowing us to do a lot in SPARQL without sacrificing the speed and functionality of TQL.

Date Times

One question I regularly received from James was about date times. Unfortunately, Mulgara stores these canonically (using UTC), and hence does not round-trip these values. The solution is to store the timezone offset along with the value. Another tricky thing is to record if a time of "midnight" is recorded as "00:00:00" or as "24:00:00", as both are valid, and both need to be returned as they were provided, and not in a normalized form. I haven't done this one yet, but I expect to get it done by the end of the week.

I had a comment from Andy Seaborne that despite timezones being described in hours and minutes, this only requires a resolution of quarter-hour intervals, so I can probably squeeze this into some existing storage somewhere. I appreciate the advice, but it leaves me wondering which timezone appears with a 15 minute offset from its nearest neighbors!

In the meantime, James got around the problem by removing the xsd:dateTime specific code from the version of Mulgara he is working with, so it gets treated as an unknown type. This modification can be removed as soon as I fix the issue (which I expect to be by the end of this week).

Memorial Day

There is still an enormous amount of information to cover on Mulgara, SPARQL, and especially the SemTech conference, but I'm falling asleep as I type. It's currently Memorial Day here in the USA, and since getting back from the conference on Friday night, I've had a huge weekend with my family. Yesterday I took both of the boys in a trailer for "Bike the Drive", which is a lot more cycling than I've done for a few months. Swimming and running have kept me relatively fit, but it still tired me out! Consequently I just can't think now, so I'll pick this up again later.