Thursday, October 25, 2007

The Road to SPARQL

A long time ago, an interface was built for talking to Mulgara. At the time a query language was needed. We did implement RDQL from Jena in some earlier versions (that code is still in there), but quickly realized that we wanted more and slightly different functionality to what this language offered, and so TQL was born. Initially it was envisioned that there would be a text version for direct interaction with a user (interactive TQL, or iTQL), and an equivalent programmatic structure more appropriate for computers, written in XML (XML TQL, or xTQL). The latter was never developed, but this was the start of iTQL. (The lack of xTQL is the reason I've been advocating that we just call it TQL).

But a language cannot exist in a vacuum. Queries have to go somewhere. This led to the development of the ItqlInterpreter class, and the associated developer interface, ItqlInterpreterBean. These classes accept queries in the form of strings, and give back the appropriate response.

So far so good, but here is where it comes unstuck. For some reason, someone decided that because the URIs (really, URLs) of Mulgara's graphs describe the server that the query should be sent to, then the interpreter should perform the dispatch as soon as it sees the server in the query string. This led to ItqlInterpreter becoming a horrible mess, combining grammar parsing and automatic remote session management.

I've seen this mess many times, and much as I'd like to have fixed it, the effort to do so was beyond my limited resources to do so.

But now Mulgara needs to support SPARQL. While SPARQL is limited in its functionality, and imposes inefficiencies when interpreted literally (consider using OPTIONAL on a variable and FILTERing on if it is bound) the fact that it is a standard makes it extremely valuable to both the community, and to projects like Mulgara.

SPARQL has its own communications protocol for the end user, but internally it makes sense for us to continue using our own systems, especially when we need to continue maintaining TQL compatibility. What we'd like to do then, is to create an interface like ItqlInterpreter, which can parse SPARQL queries, and use the same session and connection code as the existing system. This means we can re-use the entire system between the two languages, with only the interpreter class differing, depending on the query language you want to use.

Inderbir has built a query parser for me, so all we'd need to do would be to build a query AST with this parser, and we have the new SPARQLInterpreter class. There are a couple of missing features (like filter and optional support), but these are entirely compatibile with the code we have now, and easily feasible extensions.

However, in order to make this happen, ItqlInterpreter finally had to be dragged (kicking and screaming) into the 21st century, by splitting up it functionality between parsing and connection management. This has been my personal project over the last few months, and I'm nearly there.

Connections

One of the things that I never liked about Mulgara is that I couldn't create connections for it like you do in databases like MySQL. There is the possibility of doing this with Session objects, but this has never been a user interface, and there were some operations that are not handled easily with sessions.

My solution was to create a Connection class. The idea is to create a connection to a server, and to issue queries on this connection. This allows some important functionality. To start with, it permits client control over their connections, such as connection pooling, and multiple parallel connections. It also enables the user to send queries to any server, regardless of the URIs described in the query. This is important for SPARQL, as it may not include a model name in the query, and so the server has to be established when forming the connection. This is also how the SPARQL network protocol works.

Sending queries to any server was not possible until recently, as a server always presumed that it held the graphs being queried for locally. However, serendipity lend a hand here, as I created the DistributedQuery resolver just a few months ago, enabling this functionality.

As an interesting aside, when I was creating the Connection code I discovered an unused class called Connection. This was written by Tom, and the accompanying documents explained that this was going to be used to clean up ItqlInterpreter, deprecating the old code. It was never completed, but it looks like I wasn't the only one who decided to fix things this way. It's a shame Tom didn't make further progress, or else I could have merged my effort in with his (saving myself some time).

Compatibility

So I don't break any existing Mulgara clients, my goal has been to provide 100% compatibility with ItqlInterpreterBean. To accomplish this, I've created a new AutoInterpreter class, which internally delegates query parsing to the real interpreter. The resulting AST is then queried for the desired server, and a connection is made, using caching wherever possible. Once this is set up, the query can be sent across the connection.

This took some work, but it now appears to mostly work. I initially had a few bugs where I forgot certain cases in the TQL syntax, such as backing up to the local client rather than the default of the remote server. But I have overcome many of these now.

Debugging

The main problem has been where ItqlInterpreterBean was built to allow specified sessions to be set, or operations that were set to operate directly on specified sessions. I think I'm emulating most of this behavior correctly, but it's taken me a while to track them all down.

I still have a number of failures and errors in my tests, but I'm working through them all quickly, so I'm pretty happy about the progress. Each time I fix a single bug, the number of problems drops by a dozen or more. The latest one I found was where the session factory is being directly invoked for graphs with a "file:" scheme in the URI. There is supposed to be a fallback to another session factory that knows how to handle protocols that aren't rmi, beep, or local, but it seems that I've missed it. It shouldn't be too hard to find in the morning.

One bug that has me concerned looks like some kind of resource leak. The test involved creates an ItqlInterpreterBean 1000 times. On each iteration a query is invoked on the object, and then the object is closed before the loop is iterated again. For some reason this is consistently failing at about the 631st iteration. I've added in some more logging, so I'm hoping to see the specific exception the next time I go through the tests.

One thing that caught me out a few times was the set of references to the original ItqlInterpreter, ItqlSession and ItqlSessionUI classes. So yesterday I removed all reference to these classes. This necessitated a cleanup of the various Ant scripts which defined them as entry points, but everything appears to work correctly now, which gives me hope that I got it all. The new code is now called TqlInterpreter, TqlSession and TqlSessionUI. While the names are similar, and the functionality is the same, most of it was re-written from the ground up. This gave me a more intimate view of the way these classes were built, leading to a few surprises.

One of the things the old UI code used to do was to block on reading a pipe, while simultaneously handling UI events. Only this pipe was never set to anything! It was totally dead code, but would never have been caught by a code usage analyzer, as it got run al time time (it could just never do anything). I decided to address this by having it read from, and process the standard input of the process (I suspect this was the initial intent, but I'm not sure). I don't know how useful it is, but it's sort of cute, as I can now send queries to the standard input while the UI is running, and not just paste into the UI.

I've added a few little features like this as I've progressed, though truth be told I can't remember them all! :-)

The other major thing I've been doing has been to fix up the bad code formatting that was imposed at some point in 2005, and to add generics. Sometimes this has proven to be difficult, but in the end it's worth it. It's not such a big issue with already working code, but generics make updating significantly easier, both by documenting what goes in and out of collections, and by doing some checking on the types being used. Unfortunately, there are some strange structures that makes generics difficult (trees built from maps with values that are also maps), so some of this work was time consuming. On the plus side, it's much easier to see what that code is now doing.

I hope to be through this set of fixes by the end of the week, so I can get a preliminary version of SPARQL going by next week. That will then let me start on those features of the SPARQL AST that we don't yet support.

1 comment:

Viagra Free Samples said...

Thanks for sharing your working notes with all the community because they could become very useful for any future project. I appreciate this very much.