Friday, April 30, 2010

Topaz and Ehcache


Don't ask what I did 2 days ago, because I forget. It's one of the reasons I need to blog more. I also forgot because my brain got clouded due to yesterday's tasks.

I didn't get a lot done yesterday for the simple reason that I was filling in paperwork for an immigration lawyer. For anyone who has ever had to do this mind numbing task, they probably know that you end up filling in mostly the same things that you filled in 12 months ago, but just subtly different, so there's no possibility of using copy/paste. They will also know that getting all of the information together can take half a day. Strangely, the hardest piece of information was my mother's birthday (why do they want this? I have no idea). There is a bizarre story behind this, that I won't go into right now, but with my mother asleep in Australia there was no way to ask her. Fortunately, I had two brothers online at the time: one lives here in the USA, and the other is a student in Australia (who was up late drinking with friends, and decided to get online to say hi before going to bed). Unfortunately, neither of them new either (the well-oiled brother being completely unaware of why we didn't know).

But I finally got it done, cleared my immediate email queue (only 65 to go!) and got down to work.

My first task was to get the Topaz version of Mulgara up and running the same way it used to run 10 months ago. I had already tried going back through the subversion history for the project (ah, that's one of the things I did two days ago!), but with no success. However, I had been able to find out that others have had this error with Ehcache. No one had a fix, since upgrading the library normally made their problem go away. Well I tried upgrading it myself, but without luck. Evidently the problem was in usage, but I didn't know if it was a problem in the code talking to Ehcache, or the XML configuration file that is uses. Since everything used to work without error, I figured that the code was probably OK, and that it was the configuration at fault. The complexity of the configuration file only deepened my suspicion.

I didn't want to learn the ins-and-outs of an Ehcache configuration, so my first non-lawyer related task yesterday was to look at the code where the exception was coming from (thank goodness the Java compiler includes line numbers in class files by default). So it turned out that Terracotta (the company who provides Ehcache) have a nice navigable HTML versions of all their opensource code, which made this task much more pleasant than having to get it all from Subversion. This led me to the line that was throwing the exception, which looked like:
List localCachePeers = cacheManager.getCachePeerListener("RMI").getBoundCachePeers();
Great, a compound statement. OK, so I use them myself, but they're annoying when you debug. Was it cacheManager that was null or was it the return value from getCachePeerListener("RMI")?

At this point I jumped around in the code for a bit (I quite like those hyperlinks. I've seen them before too. I should figure out which project creates these pages), looking for what initialized cacheManager. I didn't find definitive proof that it was set, but it looked pretty good. So I looked at getCachePeerListener("RMI") and discovered that it was a lookup in a Hashmap. This is a prime candidate for returning null, and indeed the documentation for the method even states that it will return null if the scheme is not configured. Since the heartbeat code was making the presumption that it could perform an operation on the return value of this method, then the "RMI" scheme is evidently supposed to be configured in every configuration. The fact that it's possible for this method to return null (even if it's not supposed to) means that the calling code is not defensive enough (any kind of NullPointerException is unacceptable, even if you catch it and log it). Also, the fact that something is always supposed to be configured for "RMI" had me looking in the code to discover where listeners get registered. This turned out to come from some kind of configuration object, which looked like it had been built from an XML file.

So the problem appears to be the combination of something that's missing from the configuration file, and a presumption that it will be there (i.e. the code couldn't handle it if the item was missing). At this point I joined a forum and described the issue, both to point out that the code should be more defensive, and also to ask what is missing. In the meantime, I tried creating my own version of the library with a fix in it, and discovered that the issue did indeed go away. Then this morning I received a message explaining what I needed to configure, and also that the code now deals with the missing configuration. It still complains on every heartbeat (in 5 second intervals), but now it tells you what's wrong, and how to fix it:
WARNING: The RMICacheManagerPeerListener is missing. You need to configure
a cacheManagerPeerListenerFactory with
class="net.sf.ehcache.distribution.RMICacheManagerPeerListenerFactory"
in ehcache.xml.
Kudos to "gluck" for the quick response. (Hey, I just realized – "gluck" is from Brisbane. My home town!)

Incidentally, creating my own version of Ehcache was problematic in itself. It's a Maven project, and when I tried to build "package" it attempted to run all the tests, which took well over an hour. Coincidentally, it also happened to be dinner time, so I came back later, only to discover that not all of the tests had passed, and that the JAR files had not been built. Admittedly, it was an older release, but it was a release, so I found this odd. In the end, I avoided the tests by removing the code, and running the "package" target again.

With all the errors out of the way I went back to the Topaz system again and run it. As I said earlier, it was no longer reporting errors. But then when I tried to use queries against it, it was completely unresponsive. A little probing found that it wasn't listening for HTTP at all, so I checked the log, and sure enough:
EmbeddedMulgaraServer> Unable to start web services due to: null [Continuing]
Argh.

Not only do I have to figure out what's going on here, it also appears that someone (possibly me) didn't code this configuration defensively enough! Sigh.

At that point it was after dinner, and I had technical reading to do for a job I might have. Well, I've received the offer, but it all depends on me not being kicked out of the country.

Monday, April 26, 2010

Multitasking


At the moment I feel like I have too many things on the boil. I'm regularly answering OSU/OSL about porting data over from Topaz and Mulgara, I'm supposed to be getting work done on SPARQL Update 1.1 (which suffered last week while I looked around for a way to stay in the USA), I'm trying to track down some unnecessary sorting that is being done by Mulgara queries in a Topaz configuration, I'm trying to catch up on reading (refreshing my memory on some important Semantic Web topics so that I keep nice and current), I'm trying to find a someone who can help us not get kicked out of the country (long story), I'm responding to requests on Mulgara, and when I have some spare time (in my evenings and weekends) I'm trying to make jSPARQLc look more impressive.

So how's it all going?

OSU/OSL


Well, OSU/OSL are responding slowly, which is frustrating, but also allows me the time to look at other things, so it's a mixed blessing. They keep losing my tickets, and then respond some time later apologizing for not getting back to me. However, they're not entirely at fault, as I have sudo access on out server, and could do some of this work for myself. The thing is that I've been avoiding the learning curves of Mailman and Trac porting while I have other stuff to be doing. All the same, we've made some progress lately, and I'm really hoping to switch the DNS over to the new servers in the next couple of days. Once that happens I'll be cutting an overdue release to Mulgara.

SPARQL Update 1.1


I really should have done some of this work already, but my job (and impending lack thereof) have interfered. Fortunately another editor has stepped up to help here, so with his help we should have it under control for the next publication round.

The biggest issues are:
  1. Writing possible responses for each operation. In some cases this will simply be success/failure, but for others it will mean describing partial success. For instance, a long-running LOAD operation may have loaded 100,000 triples before failing. Most systems want that data to stay in there, and not roll back the change, and we need some way to report what has happened.
  2. Dealing with an equivalent for FROM and FROM NAMED in INSERT/DELETE operations. Using FROM in a DELETE operation looks like this is the graph that you want to remove data from, whereas we really want to describe the list of graphs (and/or named graphs) that affect the WHERE clause. The last I read, the suggestion to use USING and USING NAMED instead was winning out. The problem is that no one really likes it, though they don't like every other suggestion even more. :-)


I doubt I'll get much done before the next meeting, but at least I did a little today, and I've been able to bring the other editor up to speed.

Sorting


This is a hassle that's been plaguing me for a while. A long time back PLoS complained about queries that were taking too long (like, up to 10 minutes!). After looking at them, I found that a lot of sorting of a lot of data was going on, so I investigated why.

From the outset, Mulgara adopted "Set Semantics". This meant that everything appeared only once. It made things a little harder to code, but it also made the algebra easier to work with. In order to accomplish this cleanly, each step in a query resolution removed duplicates. I wasn't there, so I don't know why the decision wasn't made to just leave it to the end. Maybe there was a good reason. Of course, in order to remove these duplicates, it had to order the data.

When SPARQL came along, the pragmatists pointed out that not having duplicates was a cost, and for many applications it didn't matter anyway. So they made duplicates allowable by default, and introduced the DISTINCT keyword to remove them if necessary, just like SQL. Mulgara didn't have this feature (though the Sesame-Mulgara bridge hacked it to work by selecting all variables across the bridge and projecting out the ones that weren't needed), but given the cost of this sorting, it was obvious we needed it.

The sorting in question came about because the query was a UNION between a number of product terms (or a disjunction of conjunctions). In order to make the UNION in order, each of the product terms was sorted first. Of course, without the sorting, a UNION can be a trivial operation, but with it the system was very slow. Actually, the query in question was more like a UNION between multiple products, with some of the product terms being UNIONS themselves. The resulting nested sorting was painful. Unfortunately, the way things stood, it was necessary, since there was no way to do a conjunction (product) without having the terms sorted, and since some of the terms could be UNIONS, then the result of a UNION had to be sorted.

The first thing I did was to factor the query out into a big UNION between terms (a sum-of-products). Then I manually executed each one to find out how long it took. After I added up all the times, the total was about 3 seconds, and most of that time was spent waiting for Lucene to respond (something I have no control over), so this was looking pretty good.

To make this work in a real query I had to make the factoring occur automatically, I had to remove the need to sort the output of a UNION, and I had to add a query syntax to TQL to turn this behavior on and off.

The syntax was already done for SPARQL, but PLoS were using TQL through Topaz. I know that a number of people use TQL, so I wasn't prepared to break the semantics of that language, which in turn meant that I couldn't introduce a DISTINCT keyword. After asking a couple of people, I eventually went with a new keyword of NONDISTINCT. I hate it, but it also seemed to be the best fit.

Next I did the factorization. Fortunately, Andrae had introduced a framework for modifying a query to a fixpoint, so I was able to add to that for my algebraic manipulation. I also looked at other expressions, like differences (which was only in TQL, but is about to become a part of SPARQL) and Optional joins (which were part of SPARQL, and came late into TQL). It turns out that there is a lot that you can do to expand a query to a sum-of-products (or as close to as possible), and fortunately it was easy to accomplish (thanks Andrae).

Finally, I put the code in to only do this factorization if a query was not supposed to be DISTINCT (the default in SPARQL, and if the new keyword is present for TQL). Unexpectedly, this ended up being the trickiest part. Part of the reason was because some UNION operations still needed to have the output sorted if they were embedded in an expression that couldn't be expanded out (a rare, though possible situation, but only when mixing with differences and optional joins).

I needed lots of tests to be sure that I'd done things correctly. I mean, this was a huge change to the query engine. If I'd got it wrong, it would be a serious issue. As a consequence, this code didn't get checked in and used in the timeframe that it ought to have. But finally, I felt it was correct, and I ran my 10 minute queries against the PLoS data.

Now the queries were running at a little over a minute. Well, this was an order of magnitude improvement, but still 30 times slower than I expected. What had happened? I checked where it was spending it's time, and it was still in a sort() method. Sigh. At a guess, I missed something in the code that allows sorting when needed, and avoids it the rest of the time.

Unfortunately, the time taken to get to that point had led to other things becoming important, and I didn't pursue the issue. Also, the only way to take advantage of this change was to update Topaz to use SELECT NONDISTINCT but that keyword was going to fail unless being run on a new Mulgara server. This meant that I couldn't update Topaz until I knew they'd moved to a newer Mulgara, and that didn't happen for a long time. Consequently, PLoS didn't see a performance change, and I ended up trying to improve other things for them rather than tracking it down. In retrospect, I confess that this was a huge mistake. PLoS recently reminded me of their speed issues with certain queries, but now they're looking at other solutions to it. Well, it's my fault that I didn't get it all going for them but that doesn't mean I should never do it, so I'm back at it again.

The problem queries only look really slow when executed against a large amount of data, so I had to get back to the PLoS dataset. The queries also meant running the Topaz setup, since they make use of the Topaz Lucene resolvers. So I updated Topaz and built the system.

Since I was going to work on Topaz, I figured I ought to add in the use of NONDISTINCT. This was trickier than I expected, since it looked like the Topaz code was not only trying to generate TQL code, it was also trying to re-parse it to do transformations on it. The parser in question was Antlr which is one that I've limited experience with, so I spent quite a bit of time trying to figure out what instances of SELECT could have a NONDISTINCT appended to it. In the end, I decided that all of the parsing was really for their own OQL language (which looks a lot like TQL). I hope I was right!

After spending way to long on Topaz, I took the latest updates from SVN, and compiled the Topaz version of Mulgara. Then I ran it to test where it was spending time in the query.

Unfortunately, I immediately started getting regular INFO messages of the form:
MulticastKeepaliveHeartbeatSender> Unexpected throwable in run thread. Continuing...null
java.lang.NullPointerException
at net.sf.ehcache.distribution.MulticastKeepaliveHeartbeatSender$MulticastServerThread.createCachePeersPayload(MulticastKeepaliveHeartbeatSender.java:180)
at net.sf.ehcache.distribution.MulticastKeepaliveHeartbeatSender$MulticastServerThread.run(MulticastKeepaliveHeartbeatSender.java:137)
Now Mulgara doesn't make use of ehcache at all. That's purely a Topaz thing, and my opinion to date has been that it's more trouble than it's worth. This is another example of it. I really don't know what could be going on here, but luckily I kept open the window where I updated the source from SVN, and I can see that someone has modified the class:
  org.topazproject.mulgara.resolver.CacheInvalidator
I can't guarantee that this is the problem, but I've never seen it before, and no other changes look related.

But by this point I'd reached the end of my day, so I decided I should come back to it in the morning (errr, maybe that will be after the SPARQL Working Group meeting).

Jetty


Despite that describing a good portion of my day (at least, those parts not spent in correspondence), I also got a few things done over the weekend. The first of these was a request for a new feature in the Mulgara Jetty configuration.

One of our users has been making heavy use of the REST API (yay! That time wasn't wasted after all!) and had found that Jetty was truncating their POST methods. It turns out that Jetty restricts this to 200,000 characters by default, and it wasn't enough for them. I do have to wonder what they're sticking in their queries, but OK. Or maybe they're POSTing RDF files to the server? That might explain it.

Jetty normally lets you define a lot of configuration with system parameters from the command line, or with an XML configuration file, and I was asked if I could allow either of those methods. Unfortunately, our embedded use of Jetty doesn't allow for either of these, but since I was shown exactly what was wanted I was able to track it down. A bit of 'grepping' for the system parameter showed me the class that gets affected. Then some Javadoc surfing took me to the appropriate interface (Context), and then I was able to go grepping through Mulgara's code. I found where we had access to these Contexts, and fortunately the Jetty configuration was located nearby. Up until this point Jetty's Contexts had not been configurable, but now they are. I only added in the field that had been requested, but everything is set up to add more with just two lines of code each - plus the XSD to describe the configuration in the configuration file.

jSPARQLc


My other weekend task was to add CONSTRUCT support to jSPARQLc. Sure, no one is using it yet, but Java needs so much boilerplate to make SPARQL work, that I figure it will be of use to someone eventually – possibly me. I'm also finding it to be a good learning experience for why JDBC is the wrong paradigm for SPARQL. I'm not too worried about that though, as the boilerplate stuff is all there, and it would be could easy to clean it up to something that doesn't try to conform to SPARQL. But for the moment it's trying to make SPARQL look like JDBC, and besides there's already another library that isn't trying to look like JDBC. I'd better stick to my niche.

I've decided that I'm definitely going to go with StAX to make forward-only result sets. However, I'm not sure if there is supposed to be a standard configuration for JDBC to set the desired form of the result set, so I haven't started on that yet.

The result of a CONSTRUCT is a graph. By default we can expect a RDF/XML document, though other formats are certainly possible. I'm not yet doing content negotiation with jSPARQLc, though that may need to be configurable, so I wanted to keep an open mind about what can be returned. That means that standard SELECT queries could return SPARQL Query Result XML or JSON, and CONSTRUCT queries could result in RDF/XML, N3, or RDF-JSON (Mulgara supports all but the last, but maybe I should add that one in. I've already left space for it).

Without content negotiation, I'm keeping to the XML formats for the moment, with the framework looking for the other formats (though it will report that the format is not handled). Initially I thought I might have to parse the top of the file, until I cursed myself for an idiot and looked up the content type in the header. Once the parameters have been removed, I could use the content type to do a "look up" for a parser constructor. I like this approach, since it means that any new content types I want to handle just become new entries in the look-up table.

This did leave me wondering if every SPARQL endpoint was going to fill in the Content-Type header, but I presume they will. I can always try a survey of servers once I get more features into the code.

Parsing an RDF/XML graph is a complex process that I had no desire to attempt (it could take all week to get it right - if not longer). Luckily, Jena has the ARP parser to do the job for me. However, the ARP parser is part of the main Jena jar, which seemed excessive to me. Fortunately, Jena's license is BSD, so it was possible to bring the ARP code in locally. I just had to update the packages to make sure it wouldn't conflict if anyone happens to have their own Jena in the classpath.

Funnily enough, while editing the ARP files (I'm doing this project "oldschool" with VIM). I discovered copyright notices for Plugged In Software. For anyone who doesn't know, Plugged In Software was the company that created the Tucana Knowledge Store (later to be open sourced as Kowari, then renamed to Mulgara). The company changed its name later on to match the software, but this code predated that. Looking at it, I seem to recall that the code in question was just a few bugfixes that Simon made. But it was still funny to see.

Once I had ARP installed, I could parse a graph, but into what? I'm not trying to run a triplestore here, just an API. So I reused an interface I came up with when I built my SPARQL test framework when I needed to read a graph. The interface isn't fully indexed, but it lets you do a lot of useful things if you want to navigate around a graph. For instance, it lets you ask for the list of properties on a subject, or to find the value(s) of a particular subject's property, or to construct a list from an RDF collection (usually an iterative operation). Thinking that I might also want to ask questions about particular objects (or even predicates) I've added in the other two indexes this time, but I'm in two minds about whether they really need to be there.

The original code for my graph interface was in Scala, and I was tempted to bring it in like this. But one of the purposes of this project was to be lightweight (unfortunately, I lost that advantage when I discovered that ARP needs Xerces), so I thought I should try to avoid the Scala JARs. Also, I thought that the exercise of bringing the Scala code into Java would refresh the API for me, as well as refresh me on Scala (which I haven't used for a couple of months). It did all of this, as well as having the effect of reminding me why Scala is so superior to Java.

Anyway, the project is getting some meat to it now, and it's been fun to work on in my evenings, and while I've been stuck indoors on my weekends. If anyone has any suggestions for it, then please feel free to let me know.

Reblog this post [with Zemanta]

Saturday, April 24, 2010

Program Modeling


The rise of Model Driven Architecture in development, has created a slew of modeling frameworks that can be used as tools for this design process. While some MDA tools involve a static design and a development process, others seek to have an interactive model that allows developers to work with the model at runtime. Since an RDF store allows models developed in RDFS and OWL to be read and updated at runtime, it therefore seems natural to use the modeling capabilities of RDFS and OWL to provide yet another framework for design and development. RDF also has the unusual characteristic of seamlessly mixing instance data with model data (the so-called "ABox" and "TBox"), giving the promise of a system that allows both dynamic modeling and persistent data all in a single integrated system. However, there appear to be some common pitfalls that developers fall prey to, which make this approach less useful than it might otherwise be.

For good or ill, two of the most common languages used on large scale projects today are Java and C#. Java in particular also has good penetration on the web, though for smaller projects more modern languages, such as Python or Ruby, are more often deployed. There are lots of reasons for Java and C# to be so popular on large projects: They are widely known and it can be easy to assemble a team around it; both the JVM and .Net engines have demonstrated substantial benefits in portability, memory management and optimization through JIT compiling; they have established a reputation of stability over their histories. Being a fan of functional programming and modern languages, I often find Java to be frustrating, but these strengths often bring me back to Java again, despite its shortcomings. Consequently, it is usually with Java or C# in mind that MDA projects start out trying to use OWL modeling.

On a related note, enterprise frameworks regularly make use of Hibernate to store and retrieve instance data using a relational database (RDBMS). Hibernate maps object definitions to an equivalent representation in a database table using an Object-Relational Mapping (ORM). While not a formal MDA modeling paradigm, a relational database schema forms a model, in the same way that UML or MOF does (only less expressive). While an ORM is not a tool for MDA, it nevertheless represents the code of a project in a form of model, with instance data that is interactive at runtime.

Unfortunately, the ORM approach offers a very weak form of modeling, and it has no capability to dynamically update at runtime. Several developers have looked at this problem and reasoned that perhaps these problems could be solved by modeling in RDF instead. After all, an RDF store allows a model to be updated as easily as the data, and the expressivity of OWL is far greater than that of the typical RDBMS schema. To this end, we have seen a few systems which have created an Object-Triples Mapping (OTM), with some approaches demonstrating more utility than others.

Static Languages


When OTM approaches are applied to Java and C#, we typically see an RDFS schema that describes the data to be stored and retrieved. This can be just as effective as an ORM on a relational database, and has the added benefit of allowing expansions of the class definitions to be implemented simply as extra data to be stored, rather than the whole change management demanded with the update of a RDBMS. An RDF store also has the benefit of being able to easily annotate the class instances being stored, though this is not reflected in the Java/C# classes and requires a new interface to interact with it. Unfortunately, the flow of modeling information through these systems is typically one-way, and does not make use of the real power being offered by RDF.

ORM in Hibernate embeds the database schema into the programming language by mapping table schemas to class descriptions in Java/C# and table entries to instances of those classes in runtime memory. Perhaps through this example set by ORM, we often see OTM systems mapping OWL classes to Java/C# classes, and RDF instances to Java/C# instances. This mapping seems intuitive, and it has its uses, but it is also fundamentally flawed.

The principle issue with OTM systems that attempt to embed themselves in the language, is that static languages (like the popular Java and C# languages) are unable to deal with arbitrary RDF. RDF and OWL work on an Open World Assumption, meaning that there may well be more of the model that the system is not yet aware of, and should be capable of taking into consideration. However, static languages are only able to update class definitions outside of runtime, meaning that they cannot accept new modeling information during runtime. It is possible to define a new class at runtime using a bytecode editing library, but then the class may only be accessed through meta-programming interfaces like reflection, defeating the purpose of the embedding in the first place. This is what is meant by the flow of modeling information being one-way: updates to the program code can be dynamically handled by a model in an RDF store, but updates to the model cannot be reflected by corresponding updates in the program.

But these programming languages are Turing Complete. We ought to be able to work with dynamic modeling in triples with static languages, so how do we approach it? The solution is to abandon the notion of embedding the model into the language. These classes are not dynamically reconfigurable, and therefore they cannot update with new model updates. Instead, object structures that can be updated can be used to represent the model. Unfortunately, this no longer means that the model is being used to model the programming code (as desired in MDA), but it does mean that the models are now accurate, and can represent the full functionality being expressed in the RDFS/OWL model.

As an example, it is relatively easy to express an instance of a class as a Java Map, with properties being the keys, and the "object" values being the values in the map. This is exactly the same as the way structures are expressed in Perl, so it should be a familiar approach to many developers. These instances should be constructed with a factory that takes a structure that contains the details of an OWL class (or, more likely, that subset of OWL that is relevant to the application). In this way it is possible to accurately represent any information found in an RDF store, regardless of foreknowledge. I can personally attest to the ease and utility of this approach, having written a version of it over two nights, and then providing it to a colleague who used it along with Rails to develop an impressive prototype ontology and data editor, complete with rules engine, all in a single day. I expect others can cite similar experiences.

Really Embedding


So we've seen that static languages like C# and Java can't dynamically embed "Open" models like RDF/OWL, but are there languages that can?

In the last decade we've seen a lot of "dynamic" languages gaining popularity, and to various extents, several of these offer that functionality. The most obvious example is Ruby, which has explicit support for opening up already defined classes in order to add new methods, or redefine existing ones. Smalltalk has Metaprogramming. Meta-programming isn't an explicit feature for many other languages, but so long as the language is dynamic there is often a way, such as these methods for Python and Perl.

Despite the opportunity to embed models into these languages, I'm unaware of any systems which do so. It seems that the only forms of OTM I can find are in the languages which already have an ORM, and are probably better used with that paradigm. There are numerous libraries for accessing RDF in each of the above languages, such as the Redland RDF Language bindings for Ruby and Python, the Rena, and ActiveRDF in Ruby, RDFLib and pyrple in Python, Perl RDF... the list goes and continues to grow. But none of the libraries I know perform any kind of language embedding in a dynamic language. My knowledge in this space is not exhaustive, but the lack of obvious candidates tells a story on its own.

Is this indicative of dynamic languages not needing the same kind of modeling storage that static languages seem to require? Java and C# often used Hibernate and similar systems in large scale commercial applications with large development teams, while dynamic languages are often used by individuals or small groups to quickly put together useful systems that aim at a very different target market. But as commercial acceptance of dynamic languages develops further, perhaps this kind of modeling would be useful in future. In fact a good modeling library like this could well show a Java team struggling with their own version of an OTM, just what they've been missing in their closed world.

Topaz


I wrote this essay partly out of frustration with a system I've worked with called Topaz. Topaz is a very clever piece of OTM code, written for Java, and built over my own project Mulgara. However, Topaz suffers from all of the closed world problems I outlined above, without any apparent attempt to mitigate the use of RDF by reading extra annotations, etc. It has been used by the Public Library of Science for their data persistence, but they have been unhappy with it, and it will soon be replaced.

While performance in Mulgara (something I'm working on), in Topaz's use of Mulgara, and in Topaz itself, has been an issue, I believe that a deeper problem lay in the use of a dynamic system to represent static data. My knowledge of Topaz has me wondering why the system didn't simply choose to use Hibernate. I'm sure the used of RDF and OWL provided some functionality that isn't easy accomplished by Hibernate, but I don't see the benefits being strong enough to make it worth the switch to a dynamic model.

For my money, I'd either adopt the standard static approach that so many systems already employ to great effect, or go the whole hog and design an OTM system that is truly dynamic and open.

Wednesday, April 21, 2010

Work

I've had a number of administrative things to get done this week, since work will be taking a dramatic new turn soon. I've been missing working in a team, so that part will be good, but there are too many unknowns right now, including a visa nightmare that has been unceremoniously dumped in my lap. So, I'm stressed and have a lot to do. But that doesn't mean I'm not working.

Multi-Results


I'd recently been asked to allow the HTTP interfaces to return multiple results. One look at the SPARQL Query Results XML Format makes it clear that SPARQL isn't capable of it, but the TQL XML format has always allowed it - or at least, I think it did. The SPARQL structure is sort of flat, with a declaration of variables at the top, and bindings under it. The TQL structure is similar, but embeds it all in another element called a "query". That name seems odd (since it's a result, not a query), so I wonder if someone had intended to include the original query as an attribute of that tag. Anyway, the structure is available, so I figured I should add it.

This was a little trickier than I expected, since I'd tried to abstract out the streaming of answers. This means that I could select the output type simply by using a different streaming class. For now, the available classes are SPARQL-XML, SPARQL-JSON and TQL-XML, but there could easily be others. However, I now had to modify all of those classes to handle multiple answers. Of course, the SPARQL streaming classes had to ignore them, while the TQL class didn't, but that wasn't too hard. However, I came away feeling that it was somehow messier than it ought to have been. Even so, I thought it worked OK.

One bit of complexity was in handling the GET requests of TQL vs. SPARQL. In SPARQL we can only expect a single query in a GET, but TQL can have multiple queries, separated by semicolons. While I like to keep as much code as possible common to all the classes, in the end I decided that the complexity of doing this was more than it was worth, and I put special multi-query-handling code in the TQL servlet.

All of this was done a little while ago, but because I was waiting on responses on the mulgara.org move, I decided not to release just yet. This was probably fortunate, since I got an email the other day explaining that subqueries were not being embedded properly. They were starting with a new query element tag, but not closing with them. However, these tags should not have appeared at this level at all. The suggested patch would have worked, but it relied on checking the indentation used for pretty-printing in order to find out if the query element should be opened. This would work, but was covering the problem, rather than solving it. A bit of checking, and I realized that I had code to send a header for each answer, code to send the data for the answer, but no code for the "footer". The footer would have been the closing tag for the query element, and this was being handled in other code, meaning that it only came up at the top level, and not in the embedded sub-answers. This in turn meant that it wasn't always matching up to the header. So I introduced a footer method for answers (a no-op in SPARQL-XML and SPARQL-JSON) which cleaned up the process well enough that avoiding the header (and footer) on sub-answers was now easy to see and get right.

So was I done? No. The email also commented on warnings of transactions not being closed. So I went looking at this, and decided that all answers were being closed properly. In confusion, I looked at the email again, and this time realized that the bug report said that they were using POST methods. Since I was only dealing with queries (and not update commands) I had only gone to the GET method. So I looked at POST, and sure enough it was a dogs breakfast.

Part of the problem with a POST is that it can include updates as well as queries. Not having a standard response for an update, I had struggled a little with this in the past. In the end, I'd chosen to only output the final result of all operations, but this was causing all sorts of problems. For a start, if there was more than one query, then only the last would be shown (OK in SPARQL, not in TQL). Also, since I was ignoring so many things, it meant that I wasn't closing anything if it needed it. This was particularly galling to have wrong, since I'd finally added SPARQL support for POST queries.

I'd really have liked to use the same multi-result code that I had for GET requests, but that didn't look like it was going to mix well with the need to support commands in the middle. In the end I copied/pasted some of the GET code (shudder) and fixed it up to deal with the result lists that I'd already built through the course of processing the POST request. It doesn't look too bad, and I've commented on the redundancy and why I've allowed it, so I think it's all OK. Anyway, it's all looking good now. Given that I also have a major bugfix from a few weeks back, then I should get it out the door despite the mulgara.org shuffle not being done.

I didn't mention that major bug, did I? For anyone interested, some time early last year a race bug was avoided by putting a lock into the transaction code. Unfortunately, that lock was to broad, and it prevented any other thread from reading while a transaction was being committed. This locked the database up during large commit operations. It's not the sort of thing that you're likely to see with unit tests, but I was still deeply embarrassed. At least I found it (a big thanks to the guys at PLoS for reporting this bug, and helping me find where it was).

So before I get dragged into any admin stuff tomorrow morning (office admin or sysadmin), I should try to cut a release to clean up some of these problems.

Meanwhile, I'm going to relax with a bit of Hadoop reading. I once talked about putting a triplestore on top of this, and it's an idea that's way overdue. I know others have tried exactly this, but each approach has been different, and I want to see what I can make of it. But I think I need a stronger background in the subject matter before I try to design something in earnest.

Monday, April 19, 2010

jSPARQLc


I spent Sunday and a little bit of this afternoon finishing up the code and testing for the SPARQL API. The whole thing had just a couple of typos in it, which surprised me no end, because I used VIM and not Eclipse. I must be getting more thorough as I age. :-)

Anyway, the whole thing works, though its limited in scope. To start with, the only data accessing methods I wrote were getObject(int) and getObject(String). Ultimately, I'd like to include many of the other get... methods, but these would require tedious mapping. For instance, getInt() would need to map xsd:int, xsd:integer, and all the other integral types to an integer. I've done this work in some of the internal APIs for Mulgara (particularly when dealing with SPARQL), so I know how tedious it is. I suppose I can copy/paste a good portion of it out of Mulgara (it's all Apache 2.0), but for the moment I wanted to get it up and running the way I said.

If I were to do all of this, then there are all sorts of mappings that can be done between data types. For instance, xsd:base64Binary would be good for Blobs. I could even introduce some custom data types to handle things like serialized Java objects, with a data type like: java:org.example.ClassName. Actually, that looks familiar. I should see if anyone has done it.

Anyway, as I progressed, I found that while it was straight forward enough to get basic functionality in, the JDBC interfaces are really inappropriate.

To start with, JDBC usually accesses a "cursor" at the server end, and this is accessed implicitly by ResultSet. It's not absolutely necessary, but moving backwards and forwards through a result that isn't entirely in memory really does need a server-side cursor. Since I'm doing everything in memory right now, then I was able to do an implementation that isn't TYPE_FORWARD_ONLY, but if I were to move over to using StAX (in the comments from my last entry) then I'd have to fall back to that.

The server-side cursor approach also makes it possible to write to a ResultSet, since SQL results are closely tied to the tables they represent. However, this doesn't really apply to RDF, since statements can never be updated, only added and removed. SPARQL Update is coming (I ought to know, as I'm the editor for the document), but there is no real way to convert the update operations on a ResultSet back into SPARQL-Update operations over the network. It might be theoretically, possible but it would need a lot of communication with the server, and it doesn't even make sense. After all, you'd be trying to map one operation paradigm to a completely different one. Even if it could be made to work, it would be confusing to use. Since my whole point in writing this API was to simplify things for people who are used to JDBC, then it would be self defeating.

So if this API were to allow write operations as well, then it would need a new approach, and I'm not sure what that should be. Passing SPARQL Update operations straight through might be the best bet, though it's not offering a lot of help (beyond doing all the HTTP work for you).

The other thing that I noticed was that a blind adherence to the JDBC approach created a few classes that I don't think are really needed. For instance, the ResultSetMetaData interface only contains two methods that make any sense from the perspective of SPARQL: getColumnCount() and getColumnName(). The data comes straight out of the ResultSet, so I would have put them there if the choice were mine. The real metadata is in the list of "link" elements in the result set, but this could encoded with anything (even text) so there was no way to make that metadata fit the JDBC API. Instead, I just let the user ask for the last of links directly (creating a new method to do so).

Another class that didn't make too much sense to me was Statement. It's a handy place to record some state about what you've doing on a Connection, but other than that, it just seems to proxy the Connection it's attached to. I see there some options for caching (that I've never used myself when I've been on JDBC), so I suspect that it does more than I give it credit for, but for the moment it just appears to be an inconvenience.

Anyway, I thought I'd put it up somewhere, and since I haven't tried Google's code repository before, I've put it up there. It's a Java SPARQL Connectivity library, so for lack of anything better I called it jSPARQLc (maybe JRC for Java RDF Connectivity would have been better, but there are lots of JRC things out there, but jSPARQLc didn't return any hits from Google, so I went with that). It's very raw and has very little configuration, but it passes it's tests. :-)

Speaking of tests, if you want to try it, then the connectivity tests won't pass until you've done the following:
  • Start a SPARQL endpoint at http://localhost:8080/sparql/ (the sourcecode in the test needs to change if your endpoint is elsewhere).
  • Create a graph with the URI <test:data>
  • Loaded the data in test.n3 up into it (this file is in the root directory)
I know I shouldn't have hardcoded some of this, but it was just a test on a 0.1 level project. If it seems useful, and/or you have ideas for it, then please let me know.

Mulgara


Other than this, I have some administration I need to do to get both Mulgara and Topaz onto another server, and this seems to slow everything down as I wait for the admins to get back to me. It's also why there hasn't been a Mulgara release recently, even though it's overdue. However, I just got a message from an admin today, so hopefully things have progressed. Even so, I think I'll just cut the release soon anyway.

One fortunate aspect of the delayed release was a message I got from David Smith about how some resources aren't being closed in the TQL REST interface (when subqueries are used). He's sent me a patch, but I need to spend some time figuring out why I got this wrong, else I could end up hiding the real problem. That's a job for the morning... right after the SPARQL Working Group meeting. Once all of that is resolved, I'll get a release out, and try to figure out what I can do to speed up the server migration.

Oh, and I need to update Topaz to take advantage of some major performance improvements in Mulgara, and then I need to find even more performance improvements. Hopefully I'll be onto some of that by the afternoon, but I don't want to promise the moon only to come back tomorrow night and confess I got stuck on the same thing all day.

Saturday, April 17, 2010

SPARQL API

Every time I try to use SPARQL with Java I keep running into all that minutiae that Java makes you deal with. The HttpComponents from Apache make things easier, but there's still a loto f code that has to be written. Then after you get your data back, you still have to process it, which means XML or JSON parsing. All of this means a lot of code, just to get a basic framework going.

I know there are a lot of APIs out there for working with RDF engines, but there aren't many for working directly over SPARQL. I eventually had a look and found SPARQL Engine for Java, but this seemed to have more client-side processing than I'd expect. I haven't looked too carefully at it, so this may be incorrect, but I thought it would be worthwhile to take all the boilerplate I've had to put together in the past, and see if I can glue it all together in some sensible fashion. Besides, today's Saturday, meaning I don't have to worry about my regular job, and I'm recovering from a procedure yesterday, so I couldn't do much more than sit at the computer anyway.

One of my inspirations was a conversation I had with Henry Story (hmmm, Henry's let that link get badly out of date) a couple of years ago about a standard API for RDF access, much like JDBC. At the time I didn't think that Sun could make something like that happen, but if there were a couple of decent attempts at it floating around, then some kind of pseudo standard could emerge. I never tried it before, but today I thought it might be fun to try.

The first thing I remembered was that when you write a library, you end up writing all sorts of tedious code while you consider the various ways that a user might want to use it. So I stuck to the basics, though I did add in various options as I dealt with individual configuration options. So it's possible to set the default-graph-uri as a single item as well as with a list (since a lot of the time you only want to set one graph URI). I was eschewing Eclipse today, so I ended up making use of VIM macros for some of my more tedious coding. The tediousness also reminded me again why I like Scala, but given that I wanted it to look vaguely JDBC-like, I figured that the Java approach was more appropriate.

I remember that TKS (the name of the first incarnation of the Mulgara codebase) had attempted to implement JDBC. Apparently, a good portion of the API, was implemented, but the there were some elements that just didn't fit. So from the outset I avoided trying to duplicate that mistake. Instead, I decided to cherry pick the most obvious features, abandon anything that doesn't make sense, and add in a couple of new features where it seems useful or necessary. So while some of it might look like JDBC, it won't have anything to do with it.

I found a piece of trivial JDBC code I'd used to test something once-upon-a-time, and tweaked it a little to look like something I might try to do with SPARQL. My goal was to write the library that would make this work, and then take it from there. This is the example:
    final String ENDPOINT = "http://localhost:8080/sparql/";
Connection c = DriverManager.getConnection(ENDPOINT);

Statement s = c.createStatement();
s.setDefaultGraph("test:data");
ResultSet rs = s.executeQuery("SELECT * WHERE { ?s ?p ?o }");
rs.beforeFirst();
while (rs.next()) {
System.out.println(
rs.getObject(1).toString() + ", " +
rs.getObject(2) + ", " +
rs.getObject(3));
}
rs.close();
c.close();


My first thought was that this is not how I would design the API (the "Statement" seems a little superfluous), but that wasn't the point.

Anyway, I've nearly finished it, but I'm dopey from pain medication, so I thought I'd write down some thoughts about it, and pick it up again in the morning. So if anyone out there is reading this (which I doubt, given how little I write here) these notes are more for me than for you, so don't expect to find it interesting. :-)

Observations


The first big difference to JDBC is the configuration. A lot of JDBC is either specific to a particular driver, or to RDBM Systems in general. This goes for the structure of the API as well as the configuration. For instance, ResultSet seems to be heavily geared towards cursors, which SPARQL doesn't support. I was momentarily tempted to try emulating this functionality through LIMIT and OFFSET, but that would have involved a lot of network traffic, and could potentially interfere with the user trying to use these keywords themselves. Getting the row number (getRow) would have been really tricky if I'd gone that way too.

But ResultSet was one of the last things I worked on today, so I'll rewind.

The first step was making the HTTP call. I usually use GET, but I've recently added in the POST binding for SPARQL querying in Mulgara , so I made sure the client code can do both. For the moment I'm automatically choosing to do a POST query when the URL gets above 1024 characters (I believe that was the URL limit for some version of IE), but I should probably make the use of POST vs. GET configurable. Fortunately, building parameters was identical for both methods, though they get put into difference places.

Speaking of parameters, I need to check this out, but I believe that graph URIs in SPARQL do not get encoded. Now that's not going to work if they contain their own queries (any why wouldn't they), but most graphs don't do that, so it's never bitten me before. Fortunately, doing a URL-Decode on an unencoded graph URI is usually safe, so that's how I've been able to get away with it until now. But as a client that has to do the encoding I needed to think more carefully about it.

From what I can tell, the only part that will give me grief is the query portion of the URI. So I checked out the query, and if there wasn't one, I just sent the graph unencoded. If there was one, then I'd encode just the query, add it to the URI, and then see if decoding got me back to the original. If it does, then I send that. Otherwise, I just encode the whole graph URI and send that. As I write it down, it looks even more like a hack than ever, but so far it seems to work.

So now that I have all the HTTP stuff happening, what about the response? Since answers can be large, my first thought was SAX. Well, actually, my first thought was Scala, since I've already parsed SPARQL response documents with Scala's XML handling, and it was trivial. But I'm Java so that means SAX or DOM. SAX can handle large documents, but possibly more importantly, I've always found SAX easier to deal with than DOM, so that's the way I went.

Because SAX operates on a stream, I thought I could build a stream handler, but I think that was just the medication talking, since I quickly remembered that it's an event model. The only way I could do it as a stream would be if I buit up a queue with one thread writing at one end and the consuming thread taking data off at the other. That's possible, but it's hard to test if it scales, and if the consumer doesn't get drain the queue in a timely manner, then you can cause problems for the writing end as well. It's possible to slow up the writer by not returning from the even methods until the queue has some space, but that seems clunky. Also, when you consider that a ResultSet is supposed to be able to rewind and so forth, a streaming model just doesn't work.

In the end, it seemed that I would have to have my ResultSets in memory. This is certainly easier that any other option I could think of, and the size of RAM these days means that it's not really a big deal. But it's still in the back of my mind that maybe I'm missing an obvious idea.

The other thing that came to mind is to create an API to provides object events in the same way that SAX provides events for XML elements. This would work fine, but it's nothing like the API I'm trying to look like, so I didn't give that any serious thought.

So now I'm in the midst of a SAX parser. There's a lot of work in there that I don't need when working with other languages, but it does give you a comfortable feeling knowing that you have such fine-grained control over the process, Java enumerations have come in handy here, as I decided to go with a state-machine approach. I don't use this very often (outside of hardware design, where I've always liked it), but it's made the coding so straightforward it's been a breeze.

One question I have, is if the parser should create a ResultSet object, or if it should be the object. It's sort of easy to just create the object with the InputStream as the parameter for the constructor, but then the object you get back could be either a boolean result or a list of variable bindings, and you have to interrogate it to find out which one it is. The alternative is to use a factory that returns different types of result sets. I initially went with the former because both have to parse the header section, but now that I've written it out, I'm thinking that the latter is the better way to go. I'll change it in the morning.

I'm also thinking of having a parser to deal with JSON (I did some abstraction to make this easy), but for now I'll just take one step at a time.

One issue I haven't given a lot of time to yet is the CONSTRUCT query. These have to return a graph and not a result set. That brings a few questions to mind:
  • How do I tell the difference? I don't want to do it in the API, since that's something the user may not want to have to figure out. But short of having an entire parser, it could be difficult to see the form of the query before it's sent.
  • I can wait for the response, and figure it out there, but then my SAX parser needs to be able to deal with RDF/XML. I usually use Jena's parser for this, since I know it's a lot of work. Do I really want to go that way? Unfortunately, I don't know of any good way to move to a different parser once I've seen the opening elements. I could try a BufferedInputStream, so I could rewind it, but can that handle really large streams? I'll think on that.
  • How do I represent the graph at the client end?
Representing a graph goes way beyond ResultSet, and poses the question of just how far to go. A simple list of triples would probably suffice, but if I have a graph then I usually want to do interesting stuff with it.

I'm thinking of using my normal graph library, which isn't out in the wild yet, but I find it very useful. I currently have implementations of it in Java, Ruby and Scala. I keep re-implementing it whenever I'm in a new language, because it's just so useful (it's trivial to put under a Jena or Mulgara API too). However, it also goes beyond the JDBC goal that I was looking for, so I'm cautious about going that way.

Anyway, it's getting late on a Saturday night, and I'm due for some more pain medication, so I'll leave it there. I need to talk to people about work again, so having an active blog will be important once more (even if it makes me look ignorant occasionally). I'll see if I can keep it up.