Thursday, October 25, 2007

The Road to SPARQL

A long time ago, an interface was built for talking to Mulgara. At the time a query language was needed. We did implement RDQL from Jena in some earlier versions (that code is still in there), but quickly realized that we wanted more and slightly different functionality to what this language offered, and so TQL was born. Initially it was envisioned that there would be a text version for direct interaction with a user (interactive TQL, or iTQL), and an equivalent programmatic structure more appropriate for computers, written in XML (XML TQL, or xTQL). The latter was never developed, but this was the start of iTQL. (The lack of xTQL is the reason I've been advocating that we just call it TQL).

But a language cannot exist in a vacuum. Queries have to go somewhere. This led to the development of the ItqlInterpreter class, and the associated developer interface, ItqlInterpreterBean. These classes accept queries in the form of strings, and give back the appropriate response.

So far so good, but here is where it comes unstuck. For some reason, someone decided that because the URIs (really, URLs) of Mulgara's graphs describe the server that the query should be sent to, then the interpreter should perform the dispatch as soon as it sees the server in the query string. This led to ItqlInterpreter becoming a horrible mess, combining grammar parsing and automatic remote session management.

I've seen this mess many times, and much as I'd like to have fixed it, the effort to do so was beyond my limited resources to do so.

But now Mulgara needs to support SPARQL. While SPARQL is limited in its functionality, and imposes inefficiencies when interpreted literally (consider using OPTIONAL on a variable and FILTERing on if it is bound) the fact that it is a standard makes it extremely valuable to both the community, and to projects like Mulgara.

SPARQL has its own communications protocol for the end user, but internally it makes sense for us to continue using our own systems, especially when we need to continue maintaining TQL compatibility. What we'd like to do then, is to create an interface like ItqlInterpreter, which can parse SPARQL queries, and use the same session and connection code as the existing system. This means we can re-use the entire system between the two languages, with only the interpreter class differing, depending on the query language you want to use.

Inderbir has built a query parser for me, so all we'd need to do would be to build a query AST with this parser, and we have the new SPARQLInterpreter class. There are a couple of missing features (like filter and optional support), but these are entirely compatibile with the code we have now, and easily feasible extensions.

However, in order to make this happen, ItqlInterpreter finally had to be dragged (kicking and screaming) into the 21st century, by splitting up it functionality between parsing and connection management. This has been my personal project over the last few months, and I'm nearly there.

Connections

One of the things that I never liked about Mulgara is that I couldn't create connections for it like you do in databases like MySQL. There is the possibility of doing this with Session objects, but this has never been a user interface, and there were some operations that are not handled easily with sessions.

My solution was to create a Connection class. The idea is to create a connection to a server, and to issue queries on this connection. This allows some important functionality. To start with, it permits client control over their connections, such as connection pooling, and multiple parallel connections. It also enables the user to send queries to any server, regardless of the URIs described in the query. This is important for SPARQL, as it may not include a model name in the query, and so the server has to be established when forming the connection. This is also how the SPARQL network protocol works.

Sending queries to any server was not possible until recently, as a server always presumed that it held the graphs being queried for locally. However, serendipity lend a hand here, as I created the DistributedQuery resolver just a few months ago, enabling this functionality.

As an interesting aside, when I was creating the Connection code I discovered an unused class called Connection. This was written by Tom, and the accompanying documents explained that this was going to be used to clean up ItqlInterpreter, deprecating the old code. It was never completed, but it looks like I wasn't the only one who decided to fix things this way. It's a shame Tom didn't make further progress, or else I could have merged my effort in with his (saving myself some time).

Compatibility

So I don't break any existing Mulgara clients, my goal has been to provide 100% compatibility with ItqlInterpreterBean. To accomplish this, I've created a new AutoInterpreter class, which internally delegates query parsing to the real interpreter. The resulting AST is then queried for the desired server, and a connection is made, using caching wherever possible. Once this is set up, the query can be sent across the connection.

This took some work, but it now appears to mostly work. I initially had a few bugs where I forgot certain cases in the TQL syntax, such as backing up to the local client rather than the default of the remote server. But I have overcome many of these now.

Debugging

The main problem has been where ItqlInterpreterBean was built to allow specified sessions to be set, or operations that were set to operate directly on specified sessions. I think I'm emulating most of this behavior correctly, but it's taken me a while to track them all down.

I still have a number of failures and errors in my tests, but I'm working through them all quickly, so I'm pretty happy about the progress. Each time I fix a single bug, the number of problems drops by a dozen or more. The latest one I found was where the session factory is being directly invoked for graphs with a "file:" scheme in the URI. There is supposed to be a fallback to another session factory that knows how to handle protocols that aren't rmi, beep, or local, but it seems that I've missed it. It shouldn't be too hard to find in the morning.

One bug that has me concerned looks like some kind of resource leak. The test involved creates an ItqlInterpreterBean 1000 times. On each iteration a query is invoked on the object, and then the object is closed before the loop is iterated again. For some reason this is consistently failing at about the 631st iteration. I've added in some more logging, so I'm hoping to see the specific exception the next time I go through the tests.

One thing that caught me out a few times was the set of references to the original ItqlInterpreter, ItqlSession and ItqlSessionUI classes. So yesterday I removed all reference to these classes. This necessitated a cleanup of the various Ant scripts which defined them as entry points, but everything appears to work correctly now, which gives me hope that I got it all. The new code is now called TqlInterpreter, TqlSession and TqlSessionUI. While the names are similar, and the functionality is the same, most of it was re-written from the ground up. This gave me a more intimate view of the way these classes were built, leading to a few surprises.

One of the things the old UI code used to do was to block on reading a pipe, while simultaneously handling UI events. Only this pipe was never set to anything! It was totally dead code, but would never have been caught by a code usage analyzer, as it got run al time time (it could just never do anything). I decided to address this by having it read from, and process the standard input of the process (I suspect this was the initial intent, but I'm not sure). I don't know how useful it is, but it's sort of cute, as I can now send queries to the standard input while the UI is running, and not just paste into the UI.

I've added a few little features like this as I've progressed, though truth be told I can't remember them all! :-)

The other major thing I've been doing has been to fix up the bad code formatting that was imposed at some point in 2005, and to add generics. Sometimes this has proven to be difficult, but in the end it's worth it. It's not such a big issue with already working code, but generics make updating significantly easier, both by documenting what goes in and out of collections, and by doing some checking on the types being used. Unfortunately, there are some strange structures that makes generics difficult (trees built from maps with values that are also maps), so some of this work was time consuming. On the plus side, it's much easier to see what that code is now doing.

I hope to be through this set of fixes by the end of the week, so I can get a preliminary version of SPARQL going by next week. That will then let me start on those features of the SPARQL AST that we don't yet support.

Web Service Descriptions

Whenever I design ontologies and the tools to use them, I find myself thinking that I should somehow be describing more, and having the computer do more of the work for me. Currently OWL is about describing general structure and relationships, but I keep feeling like ontologies should describe behavior as well.

I'm not advocating the inclusion of an imperative style of operation description here. If that were the case, then we'd just be building programming languages in RDF (something I've been thinking of for a while - after all, it seems natural to represent ASTs or List structure in RDF). I'm thinking more along the lines of OCL for UML. This is a far more declarative approach, and describes what things are, rather than how they do it. Like we were taught in first-year programming, OCL is all about describing the pre-conditions and post-conditions of functions.

The main problem I have with OCL is that it looks far too much like a programming language, which is the very thing I'd like to avoid. All the same, it's on the right track. It would be interesting to see something like it in OWL, though OWL still has a long way to go before it is ready for this. The other problem is that OCL is about constraining a description, which seems to be at odds with the open-world model in OWL.

Still, the world appears to be ready for OWL to have something like this, even if OWL isn't ready to include it. Maybe I should be building an RDF/Lisp interpreter after all? :-)

The reason for me thinking along these lines is that I'd like to be able to describe web services in a truely interoperable way. Today we have WSDL and OWL-S, which are very good at describing the names of services and how to interface to them, but do little to describe what those services do. If we could really describe a service correctly, then any client could connect into an unknown server and discover exactly what that server was capable of and how to talk to it, all without user interaction, and without any prior understanding of what that server could do.

Ultimately, I think this is what we are striving for with the Semantic Web. It is a long way beyond us today, but the evolution of hardware and software in computers over the last 60 years have taught us that amazing things are achievable, so long as we take it one step at a time.

To a limited extent, there are already systems like BioMOBY, which use ontologies to query about services, and can work out for themselves how to connect to remote services and connect them together to create entirely new functions. There are still assumptions made about what kind of data is there and how to talk to it, but it includes a level of automation that is astounding for anyone familiar with WSDL standards.

When I last saw BioMOBY nearly 3 years ago, they were using the own RDF-like data structures, and were considering moving to RDF to gain the benefit of using public standards. I should check them out again, and see where they went with that. They certainly had some great ideas that I'd like to see implemented in a more general context.

Monday, October 22, 2007

Theory and Practice

The development of OWL comes from an extensive history of description logic research. This gives it a solid theoretical foundation for developing systems on, but that still doesn't make it practical.

There are numerous practical systems out there which do not have a solid theoretical foundation (and this can sometimes bite you when you really try to push a system), but can still be very useful for real world applications. After all, Gödel showed us that every system will either be incomplete or inconsistent (maybe that explains why Quantum Physics and General Relativity cannot both be right. Since Gödel's Theorem requires that there be an inconsistency somewhere in our description of the universe, then maybe that's it). :-)

If theoretically solid systems not an absolute requirement for practical applications, then are they really needed? I'd like to think so, but I don't have any proof of this. In fact, the opposite seems to be true. Systems with obvious technical flaws become successful, while those with a good theoretical underpinning languish. There are many reasons for failure, with social and marketing being among the more common. Ironically, the problems can also be technical.

John Sowa's essay on Fads and Fallacies about Logic describes how logic was originally derived from trying to formalize statements made in natural language. He also mentions that a common complaint made about modern logic is its unreadability. But this innocuous statement doesn't do the evolution justice. Consider this example in classical logic:
  • All men are mortal.
  • Socrates is a man.
  • Therefore: Socrates is mortal.
Now look at the following definition of modal logic:

Given a set of propositional letters p1,p2,..., the set of formulae of the modal logic K is the small set that:
  • contains p1,p2,...,
  • is closed under Boolean connectives, ∧, ∨ and ¬, and
  • if it contains Φ, then it also contains □Φ and ◇Φ.
The semantics of modal formulae is given by Kripke structures of M=⟨S,π,K⟩, where S is a set of states, π is a projection of propositional letters to sets of states, and K is the accessibility relation which is a binary relation on the states S. Then, for a modal formula Φ and a state sS, the expression M, sΦ is read as "Φ holds in M in state s". So:
  • M, spi     iff sπ(pi)
  • M, sΦ1Φ2    iff M,sΦ1 and M,sΦ2
  • M, sΦ1Φ2    iff M,sΦ1 or M,sΦ2
  • M, s ⊨ ¬Φ     iff M,sΦ
  • M, s ⊨ □Φ     iff there exists s'S with (s,s') ∈ K and M,s'Φ
  • M, s ⊨ ◇Φ     iff for all s'S if (s,s') ∈ K, then M,s'Φ


(Courtesy of The Description Logic Handbook. Sorry if it doesn't render properly for you... I tried! The UniCode character ⊨ (⊨) is rendered in Safari, but not in Firefox on my desktop - though it shows up on my notebook.)

While several generations removed, it can still be hard to see how this formalism descended from classical logic. Modal logic is on a firm theoretical foundation, and it is apparent that the above is a precise description, yet it is not the sort of tool that professional programmers are ever likely to use. This is because the complexity of understanding the formalism is a significant barrier to entry.

We see this time and again. Functional programming is superior to imperative programming in many instances, and yet the barrier to entry is too high for many programmers to use it. Many of the elements that were considered fundamental to Object Oriented Programming are avoided or ignored by many programmers, and are not even available in some supposedly Object Oriented languages (for instance, Java does not invoke methods by passing messages to objects). And many logic formalisms are overlooked, or simply not sufficiently understood for many of the applications for which there were intended.

After working with RDF, RDFS and OWL for some years now, I've started to come to the conclusion that these systems suffer from these same problems with the barrier to entry. It took me long enough to understand the complexities introduced in an open world model without a unique name assumption. Contrary to common assumption, RDFS domain and range is descriptive rather than prescriptive. And Cardinality restrictions rarely create inconsistencies.

Part of the problem stems from the fact that non-unique names and an open world are a complete different set of assumptions from the paradigms that programmers have been trained to deal with. It takes a real shift in thinking to understand this. Also, computers are good at working with the data they have stored. Working with data that is not stored is more the domain of mathematics: a field that has been receiving less attention in the industry in recent years, particularly as professionals have moved away from "Computer Science" and into "Information Technology". Even those of us who know better still resort to the expediency of using many closed world assumptions when storing RDF data.

Giving RDF, RDFS, and OWL to the general world of programmers today seems like a recipe for implementations of varying correctness with little hope of interoperability - the very thing that these technologies were designed to enable.

However, the RDF, RDFS and OWL were designed the way they are for very sound reasons. The internet is and open world. New information is being asserted all the time (and some information is being retracted, meaning that facts on the web are both temporal and non-monotonic, neither of which is dealt with by semantic web technologies, but let's deal with one problem at a time). There are often many different ways of referring to the same things (IP addresses and hostnames are two examples). URIs are the mechanism for identifying things on the internet, and while URIs may not be unique for a single resource, they do describe a single resource, and no other. All of these features were considered when RDF and OWL were developed, and the decisions made were good ones. Trying to build a system that caters to programmers presumptions by ignoring these characteristics of the internet would be ignoring the world as it is.

So I'm left thinking that the foundations of RDF and OWL are correct, but somehow we have to present them in such a way that programmers don't shoot themselves in the foot with them. Sometimes I think I have some ideas, but it's easy to become disheartened.

Certainly, I believe that education is required. To some extent this has been successful, as I've seen significant improvement in developers' understanding (my own included) in the last few years. We also need to provide tools which help guide developers along the right path, even if that means restricting some of their functionality in some instances. These have started to come online, but we have a long way to go.

Overall, I believe in the vision of the semantic web, but the people who will make it happen are the people who will write software to use it. OWL seems to be an impediment to the understanding they require, and the tools for the task are still rudimentary. It leaves me wondering what can be done to help.