Monday, April 26, 2010

Multitasking


At the moment I feel like I have too many things on the boil. I'm regularly answering OSU/OSL about porting data over from Topaz and Mulgara, I'm supposed to be getting work done on SPARQL Update 1.1 (which suffered last week while I looked around for a way to stay in the USA), I'm trying to track down some unnecessary sorting that is being done by Mulgara queries in a Topaz configuration, I'm trying to catch up on reading (refreshing my memory on some important Semantic Web topics so that I keep nice and current), I'm trying to find a someone who can help us not get kicked out of the country (long story), I'm responding to requests on Mulgara, and when I have some spare time (in my evenings and weekends) I'm trying to make jSPARQLc look more impressive.

So how's it all going?

OSU/OSL


Well, OSU/OSL are responding slowly, which is frustrating, but also allows me the time to look at other things, so it's a mixed blessing. They keep losing my tickets, and then respond some time later apologizing for not getting back to me. However, they're not entirely at fault, as I have sudo access on out server, and could do some of this work for myself. The thing is that I've been avoiding the learning curves of Mailman and Trac porting while I have other stuff to be doing. All the same, we've made some progress lately, and I'm really hoping to switch the DNS over to the new servers in the next couple of days. Once that happens I'll be cutting an overdue release to Mulgara.

SPARQL Update 1.1


I really should have done some of this work already, but my job (and impending lack thereof) have interfered. Fortunately another editor has stepped up to help here, so with his help we should have it under control for the next publication round.

The biggest issues are:
  1. Writing possible responses for each operation. In some cases this will simply be success/failure, but for others it will mean describing partial success. For instance, a long-running LOAD operation may have loaded 100,000 triples before failing. Most systems want that data to stay in there, and not roll back the change, and we need some way to report what has happened.
  2. Dealing with an equivalent for FROM and FROM NAMED in INSERT/DELETE operations. Using FROM in a DELETE operation looks like this is the graph that you want to remove data from, whereas we really want to describe the list of graphs (and/or named graphs) that affect the WHERE clause. The last I read, the suggestion to use USING and USING NAMED instead was winning out. The problem is that no one really likes it, though they don't like every other suggestion even more. :-)


I doubt I'll get much done before the next meeting, but at least I did a little today, and I've been able to bring the other editor up to speed.

Sorting


This is a hassle that's been plaguing me for a while. A long time back PLoS complained about queries that were taking too long (like, up to 10 minutes!). After looking at them, I found that a lot of sorting of a lot of data was going on, so I investigated why.

From the outset, Mulgara adopted "Set Semantics". This meant that everything appeared only once. It made things a little harder to code, but it also made the algebra easier to work with. In order to accomplish this cleanly, each step in a query resolution removed duplicates. I wasn't there, so I don't know why the decision wasn't made to just leave it to the end. Maybe there was a good reason. Of course, in order to remove these duplicates, it had to order the data.

When SPARQL came along, the pragmatists pointed out that not having duplicates was a cost, and for many applications it didn't matter anyway. So they made duplicates allowable by default, and introduced the DISTINCT keyword to remove them if necessary, just like SQL. Mulgara didn't have this feature (though the Sesame-Mulgara bridge hacked it to work by selecting all variables across the bridge and projecting out the ones that weren't needed), but given the cost of this sorting, it was obvious we needed it.

The sorting in question came about because the query was a UNION between a number of product terms (or a disjunction of conjunctions). In order to make the UNION in order, each of the product terms was sorted first. Of course, without the sorting, a UNION can be a trivial operation, but with it the system was very slow. Actually, the query in question was more like a UNION between multiple products, with some of the product terms being UNIONS themselves. The resulting nested sorting was painful. Unfortunately, the way things stood, it was necessary, since there was no way to do a conjunction (product) without having the terms sorted, and since some of the terms could be UNIONS, then the result of a UNION had to be sorted.

The first thing I did was to factor the query out into a big UNION between terms (a sum-of-products). Then I manually executed each one to find out how long it took. After I added up all the times, the total was about 3 seconds, and most of that time was spent waiting for Lucene to respond (something I have no control over), so this was looking pretty good.

To make this work in a real query I had to make the factoring occur automatically, I had to remove the need to sort the output of a UNION, and I had to add a query syntax to TQL to turn this behavior on and off.

The syntax was already done for SPARQL, but PLoS were using TQL through Topaz. I know that a number of people use TQL, so I wasn't prepared to break the semantics of that language, which in turn meant that I couldn't introduce a DISTINCT keyword. After asking a couple of people, I eventually went with a new keyword of NONDISTINCT. I hate it, but it also seemed to be the best fit.

Next I did the factorization. Fortunately, Andrae had introduced a framework for modifying a query to a fixpoint, so I was able to add to that for my algebraic manipulation. I also looked at other expressions, like differences (which was only in TQL, but is about to become a part of SPARQL) and Optional joins (which were part of SPARQL, and came late into TQL). It turns out that there is a lot that you can do to expand a query to a sum-of-products (or as close to as possible), and fortunately it was easy to accomplish (thanks Andrae).

Finally, I put the code in to only do this factorization if a query was not supposed to be DISTINCT (the default in SPARQL, and if the new keyword is present for TQL). Unexpectedly, this ended up being the trickiest part. Part of the reason was because some UNION operations still needed to have the output sorted if they were embedded in an expression that couldn't be expanded out (a rare, though possible situation, but only when mixing with differences and optional joins).

I needed lots of tests to be sure that I'd done things correctly. I mean, this was a huge change to the query engine. If I'd got it wrong, it would be a serious issue. As a consequence, this code didn't get checked in and used in the timeframe that it ought to have. But finally, I felt it was correct, and I ran my 10 minute queries against the PLoS data.

Now the queries were running at a little over a minute. Well, this was an order of magnitude improvement, but still 30 times slower than I expected. What had happened? I checked where it was spending it's time, and it was still in a sort() method. Sigh. At a guess, I missed something in the code that allows sorting when needed, and avoids it the rest of the time.

Unfortunately, the time taken to get to that point had led to other things becoming important, and I didn't pursue the issue. Also, the only way to take advantage of this change was to update Topaz to use SELECT NONDISTINCT but that keyword was going to fail unless being run on a new Mulgara server. This meant that I couldn't update Topaz until I knew they'd moved to a newer Mulgara, and that didn't happen for a long time. Consequently, PLoS didn't see a performance change, and I ended up trying to improve other things for them rather than tracking it down. In retrospect, I confess that this was a huge mistake. PLoS recently reminded me of their speed issues with certain queries, but now they're looking at other solutions to it. Well, it's my fault that I didn't get it all going for them but that doesn't mean I should never do it, so I'm back at it again.

The problem queries only look really slow when executed against a large amount of data, so I had to get back to the PLoS dataset. The queries also meant running the Topaz setup, since they make use of the Topaz Lucene resolvers. So I updated Topaz and built the system.

Since I was going to work on Topaz, I figured I ought to add in the use of NONDISTINCT. This was trickier than I expected, since it looked like the Topaz code was not only trying to generate TQL code, it was also trying to re-parse it to do transformations on it. The parser in question was Antlr which is one that I've limited experience with, so I spent quite a bit of time trying to figure out what instances of SELECT could have a NONDISTINCT appended to it. In the end, I decided that all of the parsing was really for their own OQL language (which looks a lot like TQL). I hope I was right!

After spending way to long on Topaz, I took the latest updates from SVN, and compiled the Topaz version of Mulgara. Then I ran it to test where it was spending time in the query.

Unfortunately, I immediately started getting regular INFO messages of the form:
MulticastKeepaliveHeartbeatSender> Unexpected throwable in run thread. Continuing...null
java.lang.NullPointerException
at net.sf.ehcache.distribution.MulticastKeepaliveHeartbeatSender$MulticastServerThread.createCachePeersPayload(MulticastKeepaliveHeartbeatSender.java:180)
at net.sf.ehcache.distribution.MulticastKeepaliveHeartbeatSender$MulticastServerThread.run(MulticastKeepaliveHeartbeatSender.java:137)
Now Mulgara doesn't make use of ehcache at all. That's purely a Topaz thing, and my opinion to date has been that it's more trouble than it's worth. This is another example of it. I really don't know what could be going on here, but luckily I kept open the window where I updated the source from SVN, and I can see that someone has modified the class:
  org.topazproject.mulgara.resolver.CacheInvalidator
I can't guarantee that this is the problem, but I've never seen it before, and no other changes look related.

But by this point I'd reached the end of my day, so I decided I should come back to it in the morning (errr, maybe that will be after the SPARQL Working Group meeting).

Jetty


Despite that describing a good portion of my day (at least, those parts not spent in correspondence), I also got a few things done over the weekend. The first of these was a request for a new feature in the Mulgara Jetty configuration.

One of our users has been making heavy use of the REST API (yay! That time wasn't wasted after all!) and had found that Jetty was truncating their POST methods. It turns out that Jetty restricts this to 200,000 characters by default, and it wasn't enough for them. I do have to wonder what they're sticking in their queries, but OK. Or maybe they're POSTing RDF files to the server? That might explain it.

Jetty normally lets you define a lot of configuration with system parameters from the command line, or with an XML configuration file, and I was asked if I could allow either of those methods. Unfortunately, our embedded use of Jetty doesn't allow for either of these, but since I was shown exactly what was wanted I was able to track it down. A bit of 'grepping' for the system parameter showed me the class that gets affected. Then some Javadoc surfing took me to the appropriate interface (Context), and then I was able to go grepping through Mulgara's code. I found where we had access to these Contexts, and fortunately the Jetty configuration was located nearby. Up until this point Jetty's Contexts had not been configurable, but now they are. I only added in the field that had been requested, but everything is set up to add more with just two lines of code each - plus the XSD to describe the configuration in the configuration file.

jSPARQLc


My other weekend task was to add CONSTRUCT support to jSPARQLc. Sure, no one is using it yet, but Java needs so much boilerplate to make SPARQL work, that I figure it will be of use to someone eventually – possibly me. I'm also finding it to be a good learning experience for why JDBC is the wrong paradigm for SPARQL. I'm not too worried about that though, as the boilerplate stuff is all there, and it would be could easy to clean it up to something that doesn't try to conform to SPARQL. But for the moment it's trying to make SPARQL look like JDBC, and besides there's already another library that isn't trying to look like JDBC. I'd better stick to my niche.

I've decided that I'm definitely going to go with StAX to make forward-only result sets. However, I'm not sure if there is supposed to be a standard configuration for JDBC to set the desired form of the result set, so I haven't started on that yet.

The result of a CONSTRUCT is a graph. By default we can expect a RDF/XML document, though other formats are certainly possible. I'm not yet doing content negotiation with jSPARQLc, though that may need to be configurable, so I wanted to keep an open mind about what can be returned. That means that standard SELECT queries could return SPARQL Query Result XML or JSON, and CONSTRUCT queries could result in RDF/XML, N3, or RDF-JSON (Mulgara supports all but the last, but maybe I should add that one in. I've already left space for it).

Without content negotiation, I'm keeping to the XML formats for the moment, with the framework looking for the other formats (though it will report that the format is not handled). Initially I thought I might have to parse the top of the file, until I cursed myself for an idiot and looked up the content type in the header. Once the parameters have been removed, I could use the content type to do a "look up" for a parser constructor. I like this approach, since it means that any new content types I want to handle just become new entries in the look-up table.

This did leave me wondering if every SPARQL endpoint was going to fill in the Content-Type header, but I presume they will. I can always try a survey of servers once I get more features into the code.

Parsing an RDF/XML graph is a complex process that I had no desire to attempt (it could take all week to get it right - if not longer). Luckily, Jena has the ARP parser to do the job for me. However, the ARP parser is part of the main Jena jar, which seemed excessive to me. Fortunately, Jena's license is BSD, so it was possible to bring the ARP code in locally. I just had to update the packages to make sure it wouldn't conflict if anyone happens to have their own Jena in the classpath.

Funnily enough, while editing the ARP files (I'm doing this project "oldschool" with VIM). I discovered copyright notices for Plugged In Software. For anyone who doesn't know, Plugged In Software was the company that created the Tucana Knowledge Store (later to be open sourced as Kowari, then renamed to Mulgara). The company changed its name later on to match the software, but this code predated that. Looking at it, I seem to recall that the code in question was just a few bugfixes that Simon made. But it was still funny to see.

Once I had ARP installed, I could parse a graph, but into what? I'm not trying to run a triplestore here, just an API. So I reused an interface I came up with when I built my SPARQL test framework when I needed to read a graph. The interface isn't fully indexed, but it lets you do a lot of useful things if you want to navigate around a graph. For instance, it lets you ask for the list of properties on a subject, or to find the value(s) of a particular subject's property, or to construct a list from an RDF collection (usually an iterative operation). Thinking that I might also want to ask questions about particular objects (or even predicates) I've added in the other two indexes this time, but I'm in two minds about whether they really need to be there.

The original code for my graph interface was in Scala, and I was tempted to bring it in like this. But one of the purposes of this project was to be lightweight (unfortunately, I lost that advantage when I discovered that ARP needs Xerces), so I thought I should try to avoid the Scala JARs. Also, I thought that the exercise of bringing the Scala code into Java would refresh the API for me, as well as refresh me on Scala (which I haven't used for a couple of months). It did all of this, as well as having the effect of reminding me why Scala is so superior to Java.

Anyway, the project is getting some meat to it now, and it's been fun to work on in my evenings, and while I've been stuck indoors on my weekends. If anyone has any suggestions for it, then please feel free to let me know.

Reblog this post [with Zemanta]

1 comment:

dragisak said...

EHcache is used by FilterResolverFactory. It removes entries in distributed second-level cache used by Topaz client.

When data gets modified on server, in order to keep caches coherent the database (sic) talks back to client !