Monday, April 19, 2010

jSPARQLc


I spent Sunday and a little bit of this afternoon finishing up the code and testing for the SPARQL API. The whole thing had just a couple of typos in it, which surprised me no end, because I used VIM and not Eclipse. I must be getting more thorough as I age. :-)

Anyway, the whole thing works, though its limited in scope. To start with, the only data accessing methods I wrote were getObject(int) and getObject(String). Ultimately, I'd like to include many of the other get... methods, but these would require tedious mapping. For instance, getInt() would need to map xsd:int, xsd:integer, and all the other integral types to an integer. I've done this work in some of the internal APIs for Mulgara (particularly when dealing with SPARQL), so I know how tedious it is. I suppose I can copy/paste a good portion of it out of Mulgara (it's all Apache 2.0), but for the moment I wanted to get it up and running the way I said.

If I were to do all of this, then there are all sorts of mappings that can be done between data types. For instance, xsd:base64Binary would be good for Blobs. I could even introduce some custom data types to handle things like serialized Java objects, with a data type like: java:org.example.ClassName. Actually, that looks familiar. I should see if anyone has done it.

Anyway, as I progressed, I found that while it was straight forward enough to get basic functionality in, the JDBC interfaces are really inappropriate.

To start with, JDBC usually accesses a "cursor" at the server end, and this is accessed implicitly by ResultSet. It's not absolutely necessary, but moving backwards and forwards through a result that isn't entirely in memory really does need a server-side cursor. Since I'm doing everything in memory right now, then I was able to do an implementation that isn't TYPE_FORWARD_ONLY, but if I were to move over to using StAX (in the comments from my last entry) then I'd have to fall back to that.

The server-side cursor approach also makes it possible to write to a ResultSet, since SQL results are closely tied to the tables they represent. However, this doesn't really apply to RDF, since statements can never be updated, only added and removed. SPARQL Update is coming (I ought to know, as I'm the editor for the document), but there is no real way to convert the update operations on a ResultSet back into SPARQL-Update operations over the network. It might be theoretically, possible but it would need a lot of communication with the server, and it doesn't even make sense. After all, you'd be trying to map one operation paradigm to a completely different one. Even if it could be made to work, it would be confusing to use. Since my whole point in writing this API was to simplify things for people who are used to JDBC, then it would be self defeating.

So if this API were to allow write operations as well, then it would need a new approach, and I'm not sure what that should be. Passing SPARQL Update operations straight through might be the best bet, though it's not offering a lot of help (beyond doing all the HTTP work for you).

The other thing that I noticed was that a blind adherence to the JDBC approach created a few classes that I don't think are really needed. For instance, the ResultSetMetaData interface only contains two methods that make any sense from the perspective of SPARQL: getColumnCount() and getColumnName(). The data comes straight out of the ResultSet, so I would have put them there if the choice were mine. The real metadata is in the list of "link" elements in the result set, but this could encoded with anything (even text) so there was no way to make that metadata fit the JDBC API. Instead, I just let the user ask for the last of links directly (creating a new method to do so).

Another class that didn't make too much sense to me was Statement. It's a handy place to record some state about what you've doing on a Connection, but other than that, it just seems to proxy the Connection it's attached to. I see there some options for caching (that I've never used myself when I've been on JDBC), so I suspect that it does more than I give it credit for, but for the moment it just appears to be an inconvenience.

Anyway, I thought I'd put it up somewhere, and since I haven't tried Google's code repository before, I've put it up there. It's a Java SPARQL Connectivity library, so for lack of anything better I called it jSPARQLc (maybe JRC for Java RDF Connectivity would have been better, but there are lots of JRC things out there, but jSPARQLc didn't return any hits from Google, so I went with that). It's very raw and has very little configuration, but it passes it's tests. :-)

Speaking of tests, if you want to try it, then the connectivity tests won't pass until you've done the following:
  • Start a SPARQL endpoint at http://localhost:8080/sparql/ (the sourcecode in the test needs to change if your endpoint is elsewhere).
  • Create a graph with the URI <test:data>
  • Loaded the data in test.n3 up into it (this file is in the root directory)
I know I shouldn't have hardcoded some of this, but it was just a test on a 0.1 level project. If it seems useful, and/or you have ideas for it, then please let me know.

Mulgara


Other than this, I have some administration I need to do to get both Mulgara and Topaz onto another server, and this seems to slow everything down as I wait for the admins to get back to me. It's also why there hasn't been a Mulgara release recently, even though it's overdue. However, I just got a message from an admin today, so hopefully things have progressed. Even so, I think I'll just cut the release soon anyway.

One fortunate aspect of the delayed release was a message I got from David Smith about how some resources aren't being closed in the TQL REST interface (when subqueries are used). He's sent me a patch, but I need to spend some time figuring out why I got this wrong, else I could end up hiding the real problem. That's a job for the morning... right after the SPARQL Working Group meeting. Once all of that is resolved, I'll get a release out, and try to figure out what I can do to speed up the server migration.

Oh, and I need to update Topaz to take advantage of some major performance improvements in Mulgara, and then I need to find even more performance improvements. Hopefully I'll be onto some of that by the afternoon, but I don't want to promise the moon only to come back tomorrow night and confess I got stuck on the same thing all day.

No comments: