Saturday, April 17, 2010

SPARQL API

Every time I try to use SPARQL with Java I keep running into all that minutiae that Java makes you deal with. The HttpComponents from Apache make things easier, but there's still a loto f code that has to be written. Then after you get your data back, you still have to process it, which means XML or JSON parsing. All of this means a lot of code, just to get a basic framework going.

I know there are a lot of APIs out there for working with RDF engines, but there aren't many for working directly over SPARQL. I eventually had a look and found SPARQL Engine for Java, but this seemed to have more client-side processing than I'd expect. I haven't looked too carefully at it, so this may be incorrect, but I thought it would be worthwhile to take all the boilerplate I've had to put together in the past, and see if I can glue it all together in some sensible fashion. Besides, today's Saturday, meaning I don't have to worry about my regular job, and I'm recovering from a procedure yesterday, so I couldn't do much more than sit at the computer anyway.

One of my inspirations was a conversation I had with Henry Story (hmmm, Henry's let that link get badly out of date) a couple of years ago about a standard API for RDF access, much like JDBC. At the time I didn't think that Sun could make something like that happen, but if there were a couple of decent attempts at it floating around, then some kind of pseudo standard could emerge. I never tried it before, but today I thought it might be fun to try.

The first thing I remembered was that when you write a library, you end up writing all sorts of tedious code while you consider the various ways that a user might want to use it. So I stuck to the basics, though I did add in various options as I dealt with individual configuration options. So it's possible to set the default-graph-uri as a single item as well as with a list (since a lot of the time you only want to set one graph URI). I was eschewing Eclipse today, so I ended up making use of VIM macros for some of my more tedious coding. The tediousness also reminded me again why I like Scala, but given that I wanted it to look vaguely JDBC-like, I figured that the Java approach was more appropriate.

I remember that TKS (the name of the first incarnation of the Mulgara codebase) had attempted to implement JDBC. Apparently, a good portion of the API, was implemented, but the there were some elements that just didn't fit. So from the outset I avoided trying to duplicate that mistake. Instead, I decided to cherry pick the most obvious features, abandon anything that doesn't make sense, and add in a couple of new features where it seems useful or necessary. So while some of it might look like JDBC, it won't have anything to do with it.

I found a piece of trivial JDBC code I'd used to test something once-upon-a-time, and tweaked it a little to look like something I might try to do with SPARQL. My goal was to write the library that would make this work, and then take it from there. This is the example:
    final String ENDPOINT = "http://localhost:8080/sparql/";
Connection c = DriverManager.getConnection(ENDPOINT);

Statement s = c.createStatement();
s.setDefaultGraph("test:data");
ResultSet rs = s.executeQuery("SELECT * WHERE { ?s ?p ?o }");
rs.beforeFirst();
while (rs.next()) {
System.out.println(
rs.getObject(1).toString() + ", " +
rs.getObject(2) + ", " +
rs.getObject(3));
}
rs.close();
c.close();


My first thought was that this is not how I would design the API (the "Statement" seems a little superfluous), but that wasn't the point.

Anyway, I've nearly finished it, but I'm dopey from pain medication, so I thought I'd write down some thoughts about it, and pick it up again in the morning. So if anyone out there is reading this (which I doubt, given how little I write here) these notes are more for me than for you, so don't expect to find it interesting. :-)

Observations


The first big difference to JDBC is the configuration. A lot of JDBC is either specific to a particular driver, or to RDBM Systems in general. This goes for the structure of the API as well as the configuration. For instance, ResultSet seems to be heavily geared towards cursors, which SPARQL doesn't support. I was momentarily tempted to try emulating this functionality through LIMIT and OFFSET, but that would have involved a lot of network traffic, and could potentially interfere with the user trying to use these keywords themselves. Getting the row number (getRow) would have been really tricky if I'd gone that way too.

But ResultSet was one of the last things I worked on today, so I'll rewind.

The first step was making the HTTP call. I usually use GET, but I've recently added in the POST binding for SPARQL querying in Mulgara , so I made sure the client code can do both. For the moment I'm automatically choosing to do a POST query when the URL gets above 1024 characters (I believe that was the URL limit for some version of IE), but I should probably make the use of POST vs. GET configurable. Fortunately, building parameters was identical for both methods, though they get put into difference places.

Speaking of parameters, I need to check this out, but I believe that graph URIs in SPARQL do not get encoded. Now that's not going to work if they contain their own queries (any why wouldn't they), but most graphs don't do that, so it's never bitten me before. Fortunately, doing a URL-Decode on an unencoded graph URI is usually safe, so that's how I've been able to get away with it until now. But as a client that has to do the encoding I needed to think more carefully about it.

From what I can tell, the only part that will give me grief is the query portion of the URI. So I checked out the query, and if there wasn't one, I just sent the graph unencoded. If there was one, then I'd encode just the query, add it to the URI, and then see if decoding got me back to the original. If it does, then I send that. Otherwise, I just encode the whole graph URI and send that. As I write it down, it looks even more like a hack than ever, but so far it seems to work.

So now that I have all the HTTP stuff happening, what about the response? Since answers can be large, my first thought was SAX. Well, actually, my first thought was Scala, since I've already parsed SPARQL response documents with Scala's XML handling, and it was trivial. But I'm Java so that means SAX or DOM. SAX can handle large documents, but possibly more importantly, I've always found SAX easier to deal with than DOM, so that's the way I went.

Because SAX operates on a stream, I thought I could build a stream handler, but I think that was just the medication talking, since I quickly remembered that it's an event model. The only way I could do it as a stream would be if I buit up a queue with one thread writing at one end and the consuming thread taking data off at the other. That's possible, but it's hard to test if it scales, and if the consumer doesn't get drain the queue in a timely manner, then you can cause problems for the writing end as well. It's possible to slow up the writer by not returning from the even methods until the queue has some space, but that seems clunky. Also, when you consider that a ResultSet is supposed to be able to rewind and so forth, a streaming model just doesn't work.

In the end, it seemed that I would have to have my ResultSets in memory. This is certainly easier that any other option I could think of, and the size of RAM these days means that it's not really a big deal. But it's still in the back of my mind that maybe I'm missing an obvious idea.

The other thing that came to mind is to create an API to provides object events in the same way that SAX provides events for XML elements. This would work fine, but it's nothing like the API I'm trying to look like, so I didn't give that any serious thought.

So now I'm in the midst of a SAX parser. There's a lot of work in there that I don't need when working with other languages, but it does give you a comfortable feeling knowing that you have such fine-grained control over the process, Java enumerations have come in handy here, as I decided to go with a state-machine approach. I don't use this very often (outside of hardware design, where I've always liked it), but it's made the coding so straightforward it's been a breeze.

One question I have, is if the parser should create a ResultSet object, or if it should be the object. It's sort of easy to just create the object with the InputStream as the parameter for the constructor, but then the object you get back could be either a boolean result or a list of variable bindings, and you have to interrogate it to find out which one it is. The alternative is to use a factory that returns different types of result sets. I initially went with the former because both have to parse the header section, but now that I've written it out, I'm thinking that the latter is the better way to go. I'll change it in the morning.

I'm also thinking of having a parser to deal with JSON (I did some abstraction to make this easy), but for now I'll just take one step at a time.

One issue I haven't given a lot of time to yet is the CONSTRUCT query. These have to return a graph and not a result set. That brings a few questions to mind:
  • How do I tell the difference? I don't want to do it in the API, since that's something the user may not want to have to figure out. But short of having an entire parser, it could be difficult to see the form of the query before it's sent.
  • I can wait for the response, and figure it out there, but then my SAX parser needs to be able to deal with RDF/XML. I usually use Jena's parser for this, since I know it's a lot of work. Do I really want to go that way? Unfortunately, I don't know of any good way to move to a different parser once I've seen the opening elements. I could try a BufferedInputStream, so I could rewind it, but can that handle really large streams? I'll think on that.
  • How do I represent the graph at the client end?
Representing a graph goes way beyond ResultSet, and poses the question of just how far to go. A simple list of triples would probably suffice, but if I have a graph then I usually want to do interesting stuff with it.

I'm thinking of using my normal graph library, which isn't out in the wild yet, but I find it very useful. I currently have implementations of it in Java, Ruby and Scala. I keep re-implementing it whenever I'm in a new language, because it's just so useful (it's trivial to put under a Jena or Mulgara API too). However, it also goes beyond the JDBC goal that I was looking for, so I'm cautious about going that way.

Anyway, it's getting late on a Saturday night, and I'm due for some more pain medication, so I'll leave it there. I need to talk to people about work again, so having an active blog will be important once more (even if it makes me look ignorant occasionally). I'll see if I can keep it up.

4 comments:

steen said...

Just to let you know that there is actually someone reading your blog, I would mention that you could also use a StAX parser as an alternative to SAX or DOM (the latter not really an alternative for parsing proper, is it?)

Paula said...

I can't stand using DOM so I agree with you on that. It seems unnecessarily complex, uses too much memory, is slow... well the list goes on.

I've had a quick look at StAX, and it looks to be exactly what I wanted. Bonus marks for it being in JDK 1.6, though this means that I'm now extremely embarrassed that I wasn't aware of it before now. In my defense, JDK 1.6 only made it into a production release on the Mac recently, but I still ought to have known.

I'm guessing that you could tell that I was after a "pull" based streaming parser, but I hadn't kept up enough with recent libraries to know about StAX. This is why it's worthwhile blogging: people tell you stuff like this (particularly important when you're not spending time in an office).

Rather than re-doing yesterday's work, I think I'll get my current iteration complete, and then come back to do a streaming version once it's all working. Thanks for the advice though, it looks perfect.

Alex said...

While you're at it, can you put in a JDBC-like driver layer that will let me plug in vendor-specific client libraries? Throw in your graph API and we'll have ourselves the beginnings of a Grand Unified Java API for RDF :-)

Paula said...

I had planned to keep it open for such drivers, but really, the point of SPARQL is to avoid the need for them!

The writing interfaces are still on a system-by-system basis, but SPARQL-Update is supposed to address that.

Perhaps if individual systems have their own peculiarities, then this can be encapsulated in a driver, but I believe that the plan is to avoid exactly that.