Sunday, September 04, 2011

SPARQL JSON

After commenting the other day that Fuseki was ignoring my request for a JSON response, I was asked to submit a bug report. It's easy to cast unsubstantiated stones in a blog, but a bug report is a different story, so I went back to confirm everything. In the process, I looked at the SPARQL 1.1 Query Results JSON Format spec (I can't find a public copy of it, so no link, sorry. UPDATE: Found it.) and was chagrined to discover that it has its own MIME type of "application/sparql-results+json". I had been using the JSON type of "application/json", and this indeed does not work, but the corrected type does. I don't think it's a good idea to ignore "application/json" so I'll report that, but strictly speaking it's correct (at least, I think so. As Andy said to me in a back channel, I don't really know what the + is supposed to mean in the subtype). So Fuseki got it right. Sorry.

When I finally get around to implementing this for Mulgara I'll try to handle both. Which reminds me... I'd better get some more SPARQL 1.1 implemented. My day job at Revelytix needs me to do some Mulgara work for a short while, so that may help get the ball rolling for me.

separate

I commented the other day that I wanted a Clojure function that takes a predicate and a seq, and returns two seqs: one that matches the predicate and the other that doesn't match. I was thinking I'd build it with a loop construct, to avoid recursing through the seq twice.

The next day, Gary (at work) suggested the separate function from Clojure Contrib. I know there's some good stuff in contrib, but unfortunately I've never taken the time to fully audit it.

The implementation of this function is obvious:
(defn separate [f s]
[(filter f s) (filter (complement f) s)])
I was disappointed to learn that this function iterates twice, but Gary pointed out that there had been a discussion on exactly this point, and the counter argument is that this is the only way to build the results lazily. That's a reasonable point, and it is usually one of the top considerations for Clojure implementations. I don't actually have cause for complaint anyway, since the seqs I'm using are always small (being built out of query structures).

This code was also a good reminder that (complement x) offers a concise replacement for the alternative code:
  #(not x %)
By extension, it's a reminder to brush up on my idiomatic Clojure. I should finish the book The Joy of Clojure (which is a thoroughly enjoyable read, by the way).

Thursday, September 01, 2011

ASTs


After yesterday's post, I noticed that a number of references came to me via Twitter (using a t.co URL). After looking for it, I realized that I'm linked to from the Planet Clojure page. I'm not sure why, though I guess it's because I'm working for Revelytix, and we're one of the larger Clojure shops around. Only my post wasn't about Clojure - it was about JavaScript. So now I feel obligated to write something about Clojure. Besides, I'm out of practice with my writing, so it would do me good.

So the project I spend most of my time on (for the moment) is called Rex. It's a RIF based rules engine written entirely in Clojure. It does lots of things, but the basic workflow for running rules is:
  1. Parse the rules out of the rule file and into a Concrete Syntax Tree. We can parse the RIF XML format, the RIF Presentation Syntax, and the RIF RDF format, and we plan on doing others.
  2. Run a set of transformations on the CST to convert it into an appropriate Abstract Syntax Tree (AST). This can involve some tricky analysis, particularly for aggregates (which are a RIF extension).
  3. Transformation of the AST into SPARQL 1.1 query fragments.
  4. Execute the engine, by processing the rules to generate more data until the capacity to generate new data has been exhausted. (My... that's a lot of hand waving).
It's step 3 that I was interested in today.

The rest of this post is about how Rex processes a CST into an AST using Clojure, and about some subsequent refactoring that went on. You have been warned...

When I first wrote the CST to AST transformation step, it was to do a reasonably straight forward analysis of the CST. Most importantly, I needed to see the structure of the rule so that I could see what kind of data it depends on, thereby figuring out which other rules might need to be run once a given rule was executed. Since the AST is a tree structure, this made for relatively straight forward recursive functions.

Next, I had to start identifying some CST structures that needed to be changed in the AST. This is where it got more interesting. Again, I had to write recursive functions, but instead of simply analyzing the data, it had to be changed. It turns out that this is handled easily by having a different function for each type of node in the tree. In the normal case the function then recurses on all of its children, and constructs an identical node type using the new children. The leaf nodes then just return themselves. The "different function" for each type is actually accessed with the same name, but dispatches on the function type. In Java that would need a visitor pattern, or perhaps a map of types to functors, but in Clojure it's handled trivially with multimethods or protocols. Unfortunately, the online resources for describing the multi-dispatch aspects of Clojure protocols are not clear, but Luke VanderHart and Stuart Sierra's book Practical Clojure covers it nicely.

As an abstract example of what I mean, say I have an AST consisting of Conjunctions, Disjunctions and leaf nodes. Both Conjunctions and Disjunctions have a single field that contains a seq of the child nodes. These are declared with:
(defrecord Conjunction [children])

(defrecord Disjunction [children])
The transformation function can be called tx, and I'll define it with multiple dispatch on the node type using multimethods:
(defmulti tx  class)


(defmethod tx Disjunction [{c :children}]
(Disjunction. (map tx c)))

(defmethod tx Conjunction [{c :children}]
(Conjunction. (map tx c)))

(defmethod tx :default [n] n)
This will reconstruct an identical tree to the original, though with all new nodes except the leaves. Now, the duplication in the Disjunction and Conjunction methods should be ringing alarm bells, but in real code the functions have more specific jobs to do. For instance, the Conjunction may want to group terms that meet a certain condition (call the test "special-type?") into a new node type (call it "Foo"):
;; A different definition for tx on Conjunction

(defmethod tx Conjunction [{c :children}]
(let [new-children (map tx c)
special-nodes (filter special-type? new-children)
other-nodes (filter (comp not special-type?) new-children)]
(Conjunction. (conj other-nodes (Foo. special-nodes)))))
Hmmm... while writing that example I realized that I regularly run into the pattern of filtering out everything that meets a test, and everything that fails the test. Other than having to test everything twice, it seems too verbose. What I need is a function that will take a seq and a predicate and return a tuple containing a seq of everything that matches the predicate, and a second seq of everything that fails the predicate. I'm not seeing anything like that right now, so that may be a job for the morning.

I should note, that there is no need for a function to return the same type that came into it. There are several occasions where Rex returns a different type. For example, a conjunction between a Basic Graph Pattern (BGP) and the negation of another BGP becomes a MINUS operation between the two BGPs (a Basic Graph Pattern comes from SPARQL and is just a triple pattern for matching against the subject/predicate/object of a triple in an RDF store).

Overall, this approach works very well for transforming the CST into the full AST. As I've needed to incorporate more features and optimizations over time, I found that I had two choices. Either I could expand the complexity of the operation for every type in the tree processing code, or I could perform different types of analysis on the entire tree, one after another. The latter makes the process far easier to understand, making the design more robust and debugging easier, so that's how Rex has been written. It makes analysis slightly slower, but analysis is orders of magnitude faster than actually running the rules, so that is not a consideration.

Threading

The first time through, analysis was simple:
(defn analyse [rules]

(process rules))
Adding a new processing step was similarly easy:
(defn analyse [rules]

(process2 (process1 rules)))
But once a third, then a fourth step appeared, it became obvious that I needed to use the Clojure threading macro:
(defn analyse [rules]

(-> rules
process1
process2
process3
process4))
So now it's starting to look nice. Each step in the analysis process is a single function name implemented for various types. These names are then provided in a list of things to be applied to the rules, via the threading macro. There's a little more complexity (one of the steps picks up references to parts of the tree, and since each stage changes the tree, then these references will be pointing to an old and unused version of the tree. So that step has to be last), but it paints the general picture.

Partial

Another thing that I've glossed over is that each rule is actually a record that contains the AST as one of its members. The rules themselves are a seq which is in turn a member of a record containing a "Rule Program". So each process step actually ended up being like this:
(defn process1 [rule-program]

(letfn [(process-rule [rule] (assoc rule :body (tx (:body rule))))]
(assoc rule-program :rules (map process-rule (:rules rule-prog)))))
Did you follow that?

The bottom line is replacing the :rules field with a new one. It's mapping process-rule onto the seq of rules, and storing the result in a new rules-program, which is what gets returned. The process-rule function is defined locally as associating the :body of a rule with the tx function applied to the existing body. This creates a new rule that has has the tx applied to it.

This all looked fine to start with. A new rule program is created by transforming all the rules in the old program. A transformed rule is created by transforming the AST (the :body) in the old rule. But after the third analysis it became obvious that there was duplication going on. In fact, it was all being duplicated except for the tx step. But that was buried deep in the function. What was the best way to pull it out?

To start with, the embedded process-rule function came out. After all, it was just inside to hide it, and not because it had to pick up a closure anywhere. This function then accepts the kind of transformation that it needs to do as a parameter:
(defn convert-rule

[convert-fn rule]
(assoc rule :body (convert-fn (:body rule))))
Next, we want a general function for converting all the rules, which can accept a conversion function to pass on to convert-rule. It does all the rules, so I just pluralized the name:
(defn convert-rules

[conversion-fn rule-prog]
(assoc rule-prog :rules (map #(convert-rule conversion-fn %) (:rules rule-prog)))))
That works, but now the function getting mapped is looking messy (and messy leads to mistakes). I could improve it by defining a new function, but I just factored a function out of this function. Fortunately, there is a simpler way to define this new function. It's a "partial" application of convert-rule. But I'll still move it into a let block for clarity:
(defn convert-rules

[conversion-fn rule-prog]
(let [conv-rule (partial convert-rule conversion-fn)]
(assoc rule-prog :rules (map conv-rule (:rules rule-prog)))))
So now my original process1 definition becomes a simple:
(defn process1 [rule-program]

(convert-rules tx rule-program))
That works, but the rule-program parameter is just sticking out like a sore thumb. Fortunately, we've already seen how to fix this:
(def process1 (partial convert-rules tx))
Indeed, all of the processing functions can be written this way:
(def process1 (partial convert-rules tx))

(def process2 (partial convert-rules tx2))
(def process3 (partial convert-rules tx3))
(def process4 (partial convert-rules tx4))
It may seem strange that a function is now being defined with a def instead of a defn, but it's really not an issue. It's worth remembering that defn is just a macro that uses def to attach a symbol to a call to (fn ...).

Documenting

Of course, my functions don't appear as sterile as what I've been typing here. I do indeed use documentation. That means that the process1 function would look more like:
(defn process1 [rule-program]

"The documentation for process1"
(convert-rules tx rule-program))
One of the nice features of the defn macro is the ease of writing documentation. This isn't as trivial with def since it's a special form, rather than a macro, but it's still not too hard to do. You just need to attach some metadata to the object, with a key of :doc. Unfortunately, I couldn't remember the exact syntax for this today, and rather then go trawling through books or existing code, Alex was kind enough to remind me:
(def ^{:doc "The documentation for process1"}

process1 (partial convert-rules tx))


Framework

The upshot of all of this is a simpler framework for adding new steps to the existing analysis system. Adding a new analysis step just needs a new function thrown into the thread. I could put a call to (partial convert-rules ...) directly into the thread, but by using a def I get to name and document that step of the analysis. the only real work of the analysis is then done in the single multi-dispatch function, which is just as it should be.

So right now my evening "hobby" has been JavaScript, while my day job is Clojure. I have to tell you, the day job is much more fun.

Robusta

I was just posting this into Google+ when I realized I was typing more than I'd intended. It read more like a short blog post. Which reminded me.... "Oh yeah. I have a blog out there somewhere! Maybe I should write this in there."

After spending a lot of recent nights on JavaScript I think I'm starting to get a feel for it. I started with Douglas Crockford's "JavaScript: The Good Parts", and am now plowing through David Flanagan's "JavaScript: The Definitive Guide". (I met David when he came to Brisbane to run a short course on Java programming back in 1996. Nice guy. He said that he'd preferred to have called his book "Java in a Demitasse", but O'Reilly wanted it in their "Nutshell" series).

After using languages like Ruby, Erlang, Scala and Clojure, I'm finding it a little frustrating, but it's hard to argue with the ubiquity of the platform. Fortunately it has closures and first class functions, though the variable scoping is bizarre. I've been enjoying the callback approach to asynchronous function calls, though the syntax tends to make the resulting code confusing to read. I'm mostly sticking to Crockford's subset of the language, and this does make things a little more sensible. Flanagan's book has been filling in the gaps for me, but it's especially useful for documenting libraries like HTML5 Canvas and the File API.

As with all first attempts at using a language, mine is a little messy and inconsistent. However, it can't hurt to put it out there. I've built a simple tool for working with SPARQL endpoints (specifically aimed at Jena/Fuseki, but it should mostly work on others too). The important piece is the SPARQL connection object that it comes with (found in sparql.js). I'm hoping that this will be a useful object for more general application. It can even convert XML responses into a SPARQL-JSON structure (I wrote this after discovering Fuseki was ignoring my Content-type settings on queries).

Unfortunately, it's missing one important piece, which is the ability to upload a file from a browser. In general, it's possible to upload a file using a form submission, but that encodes all the parameters into the request body, and the SPARQL HTTP protocol requires that the graph URI appear as a parameter in the URL of the request. In an attempt to get the graph parameter out of the body and into the URL, I even tried dynamically constructing the URL for the form submission, but the browser "cleverly" saw what I was doing and pushed the parameter back into the body. So I can't use form submission. Alternatively, JavaScript makes it easy to submit an HTTP POST operation with everything set the way you want it. However, the only way to read a local file is through the form submission process, which means I still can't do a file upload. In the end, I just used Fuseki's file upload servlet, but this has the problem of being non-standard, and it also doesn't like URIs that aren't http (yes Andy, that's why I asked you about this restriction - though I'd already run into it at work).

The resulting system needed a name, so I called it Robusta. Everyone seems to be enamored with Arabica beans, but no one ever talks about Robusta beans. Don't get me wrong.... if I had to choose between the two I'd definitely go for the Arabica. But by blending in a small portion of Robusta beans you add a richness to the flavor of your coffee (it's also used as a cheap "filler" and promotes crema in espresso, but I like the flavor aspect). At the time, I came up with the name because I really needed some caffeine, and Arabica was too obvious. But in retrospect, I like the name, since adding a bit of SPARQL to your scripts can really enhance a system (OK, that's tacky. I probably need another one of those coffees). It wasn't after I'd had some coffee that I thought to look for other projects with the same name, but by that point it was already up there.

Robusta is still a work in progress, and it's mostly a late night project that fits around everything else that I'm doing. But I'm using it at work, and it makes my life easier. I'd like to know if anyone has ideas for it, or can point out errors, inefficiencies, or potential improvements. It's posted at GitHub as a part of the Revelytix project group, at:
  http://github.com/revelytix/robusta

Thursday, October 21, 2010

Clojure Experiences

I've been learning Clojure recently, as my new job uses it frequently. I'm happy with this (it has me attending (first clojure-conj) right now), but I'm still learning a lot about the language. I'm comfortable with functional programming, and of course, there's next to nothing to learn about syntax and grammar, but there is a steep learning curve to effectively use the libraries. Not only are there a lot of them, but they are not really well documented.

OO languages tell you which functions belong to a class, but since Clojure functions don't really belong to anything it can sometimes be hard to find everything that is relevant to the structures you are dealing with. I've had a few people tell me that there are good websites out there that provide the sort of documentation I'd like, so I'll go looking for them soon.

Meanwhile, I find myself looking at the libraries all the time. Other than the general principle of code reuse, the dominant paradigm in Clojure is to use the core libraries whenever possible. This is because those libraries are likely to be much more efficient than anything you can come up with on your own. They're also "lazy", meaning that the functions do very little work themselves, with all of the work being put off until you actually need it. Laziness is a good thing, and should show up in Clojure code as much as possible. Using the core libraries is the best way to accomplish that.

Clojure Maps


Part of my current task is to take a data structure that represents an RDF graph, and process it. The structure that I have now maps resource identifiers (sometimes URIs, but mostly internal blank node IDs) to structures representing all of the property/values associated with that resource. Unfortunately, the values are simple identifiers too, so to find their data I have to go back to the structure and get the resources for that object. This has to be repeated until everything found has no properties or is a literal (which has no properties). Also, the anonymous resources in RDF lists are showing up as data structures, which makes them difficult to work with as well.

Following this kind of structure is easy enough in an imperative language, though quite tedious. However, in Clojure it is simply too inelegant to work with. So I want some way to make this to a nice nested structure that represents the nested data from the RDF graph (fortunately, this kind of graph is guaranteed to not have any cycles). In particular, I'd this new structure to be lazy wherever I can get it.

The basic approach is simple: get the property/values for a given resource. Then convert the values (resources) that aren't literals to the actual data that they represent. This is done by getting the property/values of those resources, which is what I started with, so now I know I need recursion.

Now there are a few caveats. First off, if a resource is actually the head of a list, I don't want to just get it's properties. Instead, I want to traverse the list (lazily, of course) and get all the values attached to the nodes. That's easy enough, and is done with a function to identify resources that are lists, and another one to construct the list (a recursive function that uses lazy-seq). Second, there is one property/value that I want to preserve, and that is the :id property. This property shows the URI or identifier for the original node, so the value here needs to be kept, and not expanded on (it would recurse indefinitely otherwise).

After identifying these requirements, I figured I needed some way to convert all the key/value elements of a map into a new map. This sounds very much like converting all the elements in a sequence (a "seq") into a new sequence, a task which is done using the map function. Is there a similar operation to be applied to a map? (The noun and the verb get confusing here, sorry).

It turns out that Alex Miller had been dealing with exactly this problem just recently, and has built a new function called mapmap to do this operation. However, I wasn't aware of this at the time, so I had to pursue my own solution. I'm reasonably happy with my answer, and I learnt a bit in the process, so I'm glad I solved it myself, though it would have been nice to get that sleep back.

Morphing a Map


There have been a few times when I've needed to create a new map, and the first time I needed to I went to clojure.core to see what kind of map-creating functions were available to me. At that point I found hash-map, which takes a sequence containing an even number of items, and converts it into a map using the even-numbered offsets as the keys and the odd-numbered offsets as the values (with the first offset being "0"). So I could create a map of numbers to their labels using the following:
  (hash-map 1 "one" 2 "two" 3 "three")
So was there a way to create this list from my existing map? (Well, yes, but was it easy?)

I wanted to iterate over my existing map "entries" and create new entries. If you treat a map as a seq, then it appears as a sequence of 2-element vectors, with the first element being the key and the second element being the value. So if I want to morph the values in a map called "mymap" with a function called "morph", I could convert these pairs into new pairs with:
  (map (fn [[k v]] [k (morph v)]) mymap)
For anyone unfamiliar with the syntax, my little anonymous function in the middle there is taking as it's single parameter a 2 element vector, and destructuring it into the values k and v, then returning a new two element vector of k and (morph v). So let's apply this to a map of numbers to their names (incidentally, I don't code anything like this, but I do blog this way)....
  (def mymap {1 "one", 2 "two", 3 "three"})
(def morph #(.toUpperCase %))
(map (fn [[k v]] [k (morph v)]) mymap)
The result is:
([1 "ONE"] [2 "TWO"] [3 "THREE"])
Looks good, but there is one problem. This result is a sequence of pairs, while hash-map just takes a sequence. That's OK, we can just call flatten before passing it in:
  (apply hash-map (flatten (map (fn [[k v]] [k (morph v)]) mymap)))
{1 "ONE", 2 "TWO", 3 "THREE"}
The "apply" was needed because hash-map needed lots of arguments (6 in this case) while the result of flatten would have been just 1 argument (a list).

This is looking good. It's what I've done in the past, and it's hardly worthy of a blog post. But this time I had a new twist. My "values" were also structures, and often these structures were seqs as well. For instance, say I have a map of numbers to names in a couple of languages:
(def langmap {1 ["one" "uno"], 2 ["two" "due"], 3 ["three" "tre"]})
My morph method needs to change to accept a seq of strings instead of just a string, but that's trivial. So then the map step looks good, giving an output of:
([1 ("ONE" "UNO")] [2 ("TWO" "DUE")] [3 ("THREE" "TRE")])
But this fails to be loaded into a hash-map, with the error:
java.lang.IllegalArgumentException: No value supplied for key: TRE (NO_SOURCE_FILE:0)
This is exactly what bit me at 1am.

The problem is simple. flatten is recursive. It dives into the sub-sequences, and brings them up too. The result is trying to map 1 to "ONE", mapping "UNO" to 2, and so on, until it gets to "TRE" at which point it doesn't has a value to map it to, so it gives an error. Fortunately for me I had an odd number of items when everything was flattened, or else I would have been working with a corrupted map and may not have discovered it for some time.

My first thought was to write a version of flatten that isn't recursive, but that's when I took a step back and asked myself if there was another way to construct a map. Maybe I shouldn't be using hash-map. So I looked at Clojure's cheatsheet, and discovered zipmap.

zipmap


zipmap wasn't quite what I wanted, in that instead of taking a sequence of pairs, it wants a pair of sequences. It then constructs a map where the nth key/value comes from the nth entry in the first sequence (the key) and the nth entry from the second sequence (the value). So I just needed to rotate, or transpose my data. Fortunately, the day before one of my co-workers had asked me exactly the same question.

In this case, he had a sequence of MxN entries, looking something like:
[[a1 a2 a3] [b1 b2 b3] [c1 c2 c3]]
And the result that he wanted was:
[[a1 b1 c1] [a2 b2 c2] [a3 b3 c3]]
My first thought was some wildly inefficient loop-comprehension, when he asked me if the following would work (for some arg):
  (apply map list arg)
Obviously he didn't have a repl going, so I plugged it in, and sure enough it worked. But why?

I struggled to understand it until I looked up the documentation for map again, and remembered that it can take multiple sequences, not just one. I had totally forgotten this. When provided with n sequences, map requires a function that accepts n arguments. It then goes through each of the sequences in parallel, with the output being a sequence where the nth entry is the result of calling the function with the nth entry of every source sequence. In this case, the function was just list, meaning that it creates a list out of the items from each sequence. The apply just expanded the single argument out into the required lists that map needed. This one line is a lovely piece of elegance, and deserves its own name:
  (defn rotate [l] (apply map list l))

So the final composition for creating a new map with values modified from the original map is:
  (apply zipmap (rotate (map (fn [[k v]] [k (morph v)]) langmap)))
{3 ("THREE" "TRE"), 2 ("TWO" "DUE"), 1 ("ONE" "UNO")}


Now the actual function that I use inside map isn't nearly so simple. After all, I need to recursively traverse the RDF graph, plus I need to find RDF lists and turn them into sequences, etc. But this zipmap/rotate/map idiom certainly makes it easy.

Incidentally, I could do exactly the same thing with Alex's mapmap:
  (mapmap key (comp morph val) langmap)
Oh well. I'm still happy I discovered zipmap/rotate.

Thursday, May 13, 2010

Feedburner


Something reminded me that I had RSS going through Feedburner, so I tried to look at it. Turns out that after the Google move they want to update everyone's account, and I hadn't done it yet. So I gave them all my details, and the system told me that it didn't recognize me.

So then I said I'd forgotten my password, at which point it recognized my email address and asked a "secret question". I know the answer to the question, and it's not even that secret, since I'm sure that most of my family could figure it out, but Feedburner claims I'm wrong. Could someone have changed this? Maybe, but there are only a handful of people who would know that answer, and I trust them not to do something like that.

Not a problem, I'll just submit a report explaining my problem. Only it seems that Feedburner is exempt from that kind of thing. The closest you can get to help is an FAQ. Good one Google.

So now what? Well, I guess I just create a new Feedburner link and take it from there. Sorry, but if you've been using RSS to follow this blog, then would you mind changing it please? It's annoying, I know.

Wrongful Indexing


Some years ago I commented on the number and type of indexes that can be used for tuples. At the time, I pointed out that indexing triples required 3 indexes, and there were 2 appropriate sets of indexes to use. Similarly, quads can be indexed with 6 indexes, and there are 4 such sets. In both cases (5-tuples get silly, requiring 10 indexes and there are 12 possible sets). In each case, I said that each set of indexes would work just as well as the others, and so I always selected the set that included the natural ordering of the tuples.

So for RDF triples, the two sets of indexes are ordered by:
  subject,predicate,object
predicate,object,subject
object,subject,predicate
and
  object,predicate,subject
predicate,subject,object
subject,object,predicate
For convenience I have always chosen the first set, as this includes the natural ordering of subject/predicate/object, but it looks like I was wrong.

In using these indexes I've always presumed random 3-tuples, but in reality the index is representing RDF. Whenever I thought about the data I was looking for, this seemed OK, but that's because I tended to think about properties on resources, and not other RDF structures. In particular, I was failing to consider lists.

Since first building RDF indexes (2001) and writing about them (2004) I've learnt a lot about functional programming. This, in turn, led to an appreciation of lists, particularly in algorithms. I'm still not enamored of them in on-disk structures, but I do appreciate their utility and elegance in many applications. So it was only natural that when I was representing RDF graphs with Scala and I needed to read lists, then I used some trivial recursive code to build a Scala list, and it all looked great. But then I decided to port the Graph class to Java to avoid including the Scala Jars for a really lightweight library.

I'd like to point out that I'm talking about a library function that can read a well-formed RDF list and return a list in whatever programming language the library is implemented in. The remainder of this post is going to presume that the lists are well formed, since any alternatives can never be returned as a list in an API anyway.

Reading a list usually involves the subject/predicate/object (SPO) index. You start by looking up the head of the list as a subject, then the predicates rdf:first for the data at that point in the list, and rdf:rest for the rest of the list. Rinse and repeat until rdf:rest yields a value of rdf:nil. So for each node in the list, there is a lookup by subject, followed by two lookups by predicate. This is perfect for the SPO index.

However, it's been bugging me that I have such a general approach, when the structure is predetermined. Why look up these two predicates so generally, when we know exactly what we want? What if we reduce the set we're looking in to just the predicates that we want and then go looking for the subjects? That would mean looking first by predicate, then subject, then object, leading to a PSO index. So what does that algorithm look like?

First, look up the rdf:rest predicate, leading to an index of subject/object containing all list structures. Next, look up the rdf:rest predicate, retrieving subject/objects containing all the list data. Now to iterate down the list no longer involves finding the subject followed by the predicate, in order to read the next list node, but rather it just requires finding the subject, and the list node is in the corresponding object. Similarly with the data stored in the node. We're still doing a fixed number of lookups in an index, which means that the overall complexity does not change at all. Tree indexes will still give O(log(N)) complexity, and hash indexes will still give O(1) complexity. However, each step can involve disk seeks, so it's worth seeing the difference.

To compare more directly, using an SPO index requires every node to:
  • Lookup across the entire graph by subject.
  • Lookup across the subject (2 or 3 predicates) for rdf:first.
  • Lookup across the subject (2 or 3 predicates) for rdf:rest.

For the PSO index we have some initial setup:
  • Lookup across the entire graph for the rdf:first predicate.
  • Lookup across the entire graph for the rdf:rest predicate.
Then for every node:
  • Lookup the rdf:first data for the value.
  • Lookup the rdf:rest data for the next node.

It's important to note a few things, particularly for tree indexes. Trees are the most likely structure used when using a disk, so I'm going to concentrate on them. The number of subjects in a graph tends to scale up with the size of the graph, while the number of predicates is bounded. This is because predicates are used to express a model, with each predicate indicating a certain relationship. Any system trying to deal with the model needs some idea of the concepts it is dealing with, so it's almost impossible to deal with completely arbitrary relationships. If we know what the relationships are ahead of time, then there must be a fixed number of them. In contrast, subjects represent individuals, and these can be completely unbounded. So if we look up an entire graph to find a particular subject, then we may have to dive down a very deep tree to find that subject. Looking across the entire graph for a given predicate will never have to go very deep, because there are so few of them.

So the first algorithm (using the SPO index) iteratively looks across every subject in the graph for each node in the list. The next two lookups are trivial, since nodes in a list will only have properties of rdf:first, rdf:rest and possibly rdf:type. The data associated with these properties will almost certainly be in the same block where the subject was found, meaning that there will be no more disk seeks.

The second algorithm (using the PSO index) does a pair of lookups across every predicate in the graph. The expected number of disk seeks to find the first predicate is significantly fewer than for any of the "subject" searches in the first algorithm. Given how few predicates are in the system, then finding the second predicate may barely involve any disk seeks at all, particularly since the first search will have populated the disk cache with a good portion of the tree, and the similarities in the URIs of the predicates is likely to make both predicates very close to each other. Of course, this presumes that the predicates are even in a tree. Several systems (including one I'm writing right now) treat predicates differently because of how few there are. Indeed, a lot of systems will cache them in a hashtable, regardless of the on-disk structure. So the initial lookup is very inexpensive.

The second algorithm then iterates down the list, just like the first one does. However, this time, instead of searching for the nodes out of every subject in the list, it will now be just searching for these nodes in the subjects that appear as list nodes. While lists are commonly used in some RDF structures, the subjects in all the lists typically form a very small minority out of all the subjects in a graph. Consequently, depending on the type and depth of trees being used, iterating through a list with the second algorithm, could result in a 2 or three (or more) fewer disk seeks for each node. That's a saving that can add up.

Solid State Disks


I've been talking about disk seeks, but this is an artificial restriction imposed by spinning disk drives. Solid State Disks (SSDs) don't have this limitation.

People have been promoting solid state drives (SSDs) for some years now, but I've yet to use them myself. In fact, most people I know are still using traditional spinning platters. The prices difference is still a big deal, and for really large data, disk drives are still the only viable option. But this will change one day, so am I right to be concerned about disk seeks?

Disk seeks are a function of data locality. When data has to be stored somewhere else on a disk, the drive head must physically seek across the surface to this new address. SSDs don't require anything to move, but there are still costs in addressing scattered data.

While it is possible to address every bit of memory in a device in one step, in practice this is never done. This is because the complexity of the circuit grows exponentially as you try to address more and more data in one step. Instead, the memory is broken up into "banks". A portion of the address can now be used to select a bank, allowing the remaining bits in the address to select the required memory in just that bank. This works well, but it does lead to some delays. Selecting a new bank requires "setup", "hold" and "settling" times, all leading to delays. These delays are an order of magnitude smaller than seek delays for a spinning disk, but they do represent a limit on the speed of the device. So while SSDs are much faster than disk drives, there are still limits to their speed, and improvements in data locality can still have a significant impact on performance.

Tuesday, May 04, 2010

Web Services Solved


It's a been a long couple of days, and I really want to relax instead of write, but it's been a few days and I've been promising myself that I'd write, so I figured I need to get something written before I can open a beer.

First of all, the web services problem was trivial. I recently added a new feature that allowed ContextHandlers in Jetty to be configured. Currently the only configuration option I've put in there is the one that was requested, and that is the size of a form. Apparently this is 200k by default, but if you're going to load large files then that may not be enough. Anyway, the problem came about when my code tried to read the maximum form size from the configuration. I wasn't careful enough to check if the context was being configured in the first place, so an NPE was thrown if it was missing.

Fortunately, most people would never see the problem, since the default configuration file includes details for contexts, and this ends up in every build by default. The reason I was seeing it is because Topaz replaces the configuration with their own (since it describes their custom resolvers), and this custom configuration file doesn't have the new option in it. Of course, I could just add it to Topaz, but the correct solution is to make sure that a configuration can't throw an NPE – which is exactly what I told the Ehcache guys, so it's fitting that I have to do it myself. :-)

Hosting


Since I'm on the topic of Topaz, it looks like te OSU/OSL guys and I have both the Topaz and Mulgara servers configured. They wouldn't typically be hosting individual projects (well, they do occasionally), but in this case it's all going in under the umbrella of Duraspace. Of course this has taken some time, and in the case of Topaz I'm still testing that it's all correct, but I think it's there. I'll be changing the DNS for Topaz over soon, and Mulgara was changed last week. Mulgara's DNS has propagated now, so I'm in the process of cutting a long-overdue release.

One thing that changed in the hosting is that I no longer have a Linux host to build the distribution on. Theoretically, that would be OK, since I ought to be able to build on any platform. However, Mulgara is still distributed as a build for Java 1.5 (I've had complaints when I accidentally put out a release that was built for 1.6). This is easy to set up on Linux, since you just change the JAVA_HOME environment variable to make sure you're pointing to a 1.5 JDK. However, every computer I have here is a Mac. Once upon a time that didn't change anything, but now all JDKs point to JDK 1.6. That means I need to configure the compiler to output the correct version. It can be done, but Mulgara wasn't set up for it.

If you read the Ant documentation on compiling you'll see that you can set the target to any JDK version you like. However, that would require editing 58 files (I just had to run a quick command to see that. Wow... I didn't realize it was so bad). I'm sure I'd miss a <javac> somewhere. Fortunately, there is another option, even if the Ant documents discourage it. There's a system parameter called ant.build.java.target which will set the default value globally. I checked to make sure that nothing was going to be missed by this (ie. that nothing was manually setting the target) and when it all looked good I changed the build script to set this to "1.5". I didn't change the corresponding script on Windows, but personally I only want this for distributions. Anyone who needs to set it up on Windows probably has the very JDK they want to run Mulgara on anyway.

Well, that's my story, and I'm sticking to it.

Semantic Universe


What else? Oh yes. I wrote a post for Semantic Universe. It's much more technical that the other posts I've seen there, but I was told that would be OK. I'm curious to know how it will be received.

I was interested in how it was promoted on Twitter. I wrote something that mixes linked data and SPARQL to create a kind of federated query (something I find to be very useful, BTW, and I think more people should be aware of it). However, in the process I mentioned that this shouldn't be necessary, since SPARQL 1.1 will be including a note on federated querying. Despite SPARQL 1.1 only being mentioned a couple of times, the tweet said, that I discussed "how/why SPARQL 1.1 plans to be a bit more dazzling". Well, admittedly SPARQL 1.1 will be more dazzling, but my post didn't discuss that. Perhaps it was a hint to talk about that in a future post.

Miscellanea


Speaking of future posts, I realized that I've been indexing RDF backwards, at least for lists. It doesn't affect the maximum complexity of iterating a list, but it does affect the expected complexity. I won't talk about it tonight, but hopefully by mentioning it here I'll prompt myself to write about it soon.

This last weekend was the final weekend that my in-laws were visiting from the other side of the planet, so I didn't get much jSPARQLc done. I hope to fix that tomorrow night. I'm even wondering if the Graph API should be factored out into it's own sister project. It's turning out to be incredibly useful for reading and working with RDF data when you just want access to the structure and you don't need a full query engine. It would even plug directly into almost every query engine out there, so there's a lot of utility to it.

I'm also finally learning Hadoop, since I've had more pressure to consider a clustered RDF store, much as BBN have created. I've read the MapReduce, GFS and BigTable papers, so I went into it thinking I'd be approaching the problem one way, but the more I learn the more I think it would scale better if I went in other directions. So for the moment I'm trying to avoid getting too many preconceived notions of architecture until I've learnt some more and applied my ideas to some simple cases. Of course, Hive tries to do the same thing for relational data, so I think I need to look at the code in that project too. I have a steep learning curve ahead of me there, but I've been avoiding those recently, so it will do me some good.

Other than that, it's been interviews and immigration lawyers. These are horribly time consuming, and way too boring to talk about, so I won't. See you tomorrow.

Friday, April 30, 2010

Topaz and Ehcache


Don't ask what I did 2 days ago, because I forget. It's one of the reasons I need to blog more. I also forgot because my brain got clouded due to yesterday's tasks.

I didn't get a lot done yesterday for the simple reason that I was filling in paperwork for an immigration lawyer. For anyone who has ever had to do this mind numbing task, they probably know that you end up filling in mostly the same things that you filled in 12 months ago, but just subtly different, so there's no possibility of using copy/paste. They will also know that getting all of the information together can take half a day. Strangely, the hardest piece of information was my mother's birthday (why do they want this? I have no idea). There is a bizarre story behind this, that I won't go into right now, but with my mother asleep in Australia there was no way to ask her. Fortunately, I had two brothers online at the time: one lives here in the USA, and the other is a student in Australia (who was up late drinking with friends, and decided to get online to say hi before going to bed). Unfortunately, neither of them new either (the well-oiled brother being completely unaware of why we didn't know).

But I finally got it done, cleared my immediate email queue (only 65 to go!) and got down to work.

My first task was to get the Topaz version of Mulgara up and running the same way it used to run 10 months ago. I had already tried going back through the subversion history for the project (ah, that's one of the things I did two days ago!), but with no success. However, I had been able to find out that others have had this error with Ehcache. No one had a fix, since upgrading the library normally made their problem go away. Well I tried upgrading it myself, but without luck. Evidently the problem was in usage, but I didn't know if it was a problem in the code talking to Ehcache, or the XML configuration file that is uses. Since everything used to work without error, I figured that the code was probably OK, and that it was the configuration at fault. The complexity of the configuration file only deepened my suspicion.

I didn't want to learn the ins-and-outs of an Ehcache configuration, so my first non-lawyer related task yesterday was to look at the code where the exception was coming from (thank goodness the Java compiler includes line numbers in class files by default). So it turned out that Terracotta (the company who provides Ehcache) have a nice navigable HTML versions of all their opensource code, which made this task much more pleasant than having to get it all from Subversion. This led me to the line that was throwing the exception, which looked like:
List localCachePeers = cacheManager.getCachePeerListener("RMI").getBoundCachePeers();
Great, a compound statement. OK, so I use them myself, but they're annoying when you debug. Was it cacheManager that was null or was it the return value from getCachePeerListener("RMI")?

At this point I jumped around in the code for a bit (I quite like those hyperlinks. I've seen them before too. I should figure out which project creates these pages), looking for what initialized cacheManager. I didn't find definitive proof that it was set, but it looked pretty good. So I looked at getCachePeerListener("RMI") and discovered that it was a lookup in a Hashmap. This is a prime candidate for returning null, and indeed the documentation for the method even states that it will return null if the scheme is not configured. Since the heartbeat code was making the presumption that it could perform an operation on the return value of this method, then the "RMI" scheme is evidently supposed to be configured in every configuration. The fact that it's possible for this method to return null (even if it's not supposed to) means that the calling code is not defensive enough (any kind of NullPointerException is unacceptable, even if you catch it and log it). Also, the fact that something is always supposed to be configured for "RMI" had me looking in the code to discover where listeners get registered. This turned out to come from some kind of configuration object, which looked like it had been built from an XML file.

So the problem appears to be the combination of something that's missing from the configuration file, and a presumption that it will be there (i.e. the code couldn't handle it if the item was missing). At this point I joined a forum and described the issue, both to point out that the code should be more defensive, and also to ask what is missing. In the meantime, I tried creating my own version of the library with a fix in it, and discovered that the issue did indeed go away. Then this morning I received a message explaining what I needed to configure, and also that the code now deals with the missing configuration. It still complains on every heartbeat (in 5 second intervals), but now it tells you what's wrong, and how to fix it:
WARNING: The RMICacheManagerPeerListener is missing. You need to configure
a cacheManagerPeerListenerFactory with
class="net.sf.ehcache.distribution.RMICacheManagerPeerListenerFactory"
in ehcache.xml.
Kudos to "gluck" for the quick response. (Hey, I just realized – "gluck" is from Brisbane. My home town!)

Incidentally, creating my own version of Ehcache was problematic in itself. It's a Maven project, and when I tried to build "package" it attempted to run all the tests, which took well over an hour. Coincidentally, it also happened to be dinner time, so I came back later, only to discover that not all of the tests had passed, and that the JAR files had not been built. Admittedly, it was an older release, but it was a release, so I found this odd. In the end, I avoided the tests by removing the code, and running the "package" target again.

With all the errors out of the way I went back to the Topaz system again and run it. As I said earlier, it was no longer reporting errors. But then when I tried to use queries against it, it was completely unresponsive. A little probing found that it wasn't listening for HTTP at all, so I checked the log, and sure enough:
EmbeddedMulgaraServer> Unable to start web services due to: null [Continuing]
Argh.

Not only do I have to figure out what's going on here, it also appears that someone (possibly me) didn't code this configuration defensively enough! Sigh.

At that point it was after dinner, and I had technical reading to do for a job I might have. Well, I've received the offer, but it all depends on me not being kicked out of the country.

Monday, April 26, 2010

Multitasking


At the moment I feel like I have too many things on the boil. I'm regularly answering OSU/OSL about porting data over from Topaz and Mulgara, I'm supposed to be getting work done on SPARQL Update 1.1 (which suffered last week while I looked around for a way to stay in the USA), I'm trying to track down some unnecessary sorting that is being done by Mulgara queries in a Topaz configuration, I'm trying to catch up on reading (refreshing my memory on some important Semantic Web topics so that I keep nice and current), I'm trying to find a someone who can help us not get kicked out of the country (long story), I'm responding to requests on Mulgara, and when I have some spare time (in my evenings and weekends) I'm trying to make jSPARQLc look more impressive.

So how's it all going?

OSU/OSL


Well, OSU/OSL are responding slowly, which is frustrating, but also allows me the time to look at other things, so it's a mixed blessing. They keep losing my tickets, and then respond some time later apologizing for not getting back to me. However, they're not entirely at fault, as I have sudo access on out server, and could do some of this work for myself. The thing is that I've been avoiding the learning curves of Mailman and Trac porting while I have other stuff to be doing. All the same, we've made some progress lately, and I'm really hoping to switch the DNS over to the new servers in the next couple of days. Once that happens I'll be cutting an overdue release to Mulgara.

SPARQL Update 1.1


I really should have done some of this work already, but my job (and impending lack thereof) have interfered. Fortunately another editor has stepped up to help here, so with his help we should have it under control for the next publication round.

The biggest issues are:
  1. Writing possible responses for each operation. In some cases this will simply be success/failure, but for others it will mean describing partial success. For instance, a long-running LOAD operation may have loaded 100,000 triples before failing. Most systems want that data to stay in there, and not roll back the change, and we need some way to report what has happened.
  2. Dealing with an equivalent for FROM and FROM NAMED in INSERT/DELETE operations. Using FROM in a DELETE operation looks like this is the graph that you want to remove data from, whereas we really want to describe the list of graphs (and/or named graphs) that affect the WHERE clause. The last I read, the suggestion to use USING and USING NAMED instead was winning out. The problem is that no one really likes it, though they don't like every other suggestion even more. :-)


I doubt I'll get much done before the next meeting, but at least I did a little today, and I've been able to bring the other editor up to speed.

Sorting


This is a hassle that's been plaguing me for a while. A long time back PLoS complained about queries that were taking too long (like, up to 10 minutes!). After looking at them, I found that a lot of sorting of a lot of data was going on, so I investigated why.

From the outset, Mulgara adopted "Set Semantics". This meant that everything appeared only once. It made things a little harder to code, but it also made the algebra easier to work with. In order to accomplish this cleanly, each step in a query resolution removed duplicates. I wasn't there, so I don't know why the decision wasn't made to just leave it to the end. Maybe there was a good reason. Of course, in order to remove these duplicates, it had to order the data.

When SPARQL came along, the pragmatists pointed out that not having duplicates was a cost, and for many applications it didn't matter anyway. So they made duplicates allowable by default, and introduced the DISTINCT keyword to remove them if necessary, just like SQL. Mulgara didn't have this feature (though the Sesame-Mulgara bridge hacked it to work by selecting all variables across the bridge and projecting out the ones that weren't needed), but given the cost of this sorting, it was obvious we needed it.

The sorting in question came about because the query was a UNION between a number of product terms (or a disjunction of conjunctions). In order to make the UNION in order, each of the product terms was sorted first. Of course, without the sorting, a UNION can be a trivial operation, but with it the system was very slow. Actually, the query in question was more like a UNION between multiple products, with some of the product terms being UNIONS themselves. The resulting nested sorting was painful. Unfortunately, the way things stood, it was necessary, since there was no way to do a conjunction (product) without having the terms sorted, and since some of the terms could be UNIONS, then the result of a UNION had to be sorted.

The first thing I did was to factor the query out into a big UNION between terms (a sum-of-products). Then I manually executed each one to find out how long it took. After I added up all the times, the total was about 3 seconds, and most of that time was spent waiting for Lucene to respond (something I have no control over), so this was looking pretty good.

To make this work in a real query I had to make the factoring occur automatically, I had to remove the need to sort the output of a UNION, and I had to add a query syntax to TQL to turn this behavior on and off.

The syntax was already done for SPARQL, but PLoS were using TQL through Topaz. I know that a number of people use TQL, so I wasn't prepared to break the semantics of that language, which in turn meant that I couldn't introduce a DISTINCT keyword. After asking a couple of people, I eventually went with a new keyword of NONDISTINCT. I hate it, but it also seemed to be the best fit.

Next I did the factorization. Fortunately, Andrae had introduced a framework for modifying a query to a fixpoint, so I was able to add to that for my algebraic manipulation. I also looked at other expressions, like differences (which was only in TQL, but is about to become a part of SPARQL) and Optional joins (which were part of SPARQL, and came late into TQL). It turns out that there is a lot that you can do to expand a query to a sum-of-products (or as close to as possible), and fortunately it was easy to accomplish (thanks Andrae).

Finally, I put the code in to only do this factorization if a query was not supposed to be DISTINCT (the default in SPARQL, and if the new keyword is present for TQL). Unexpectedly, this ended up being the trickiest part. Part of the reason was because some UNION operations still needed to have the output sorted if they were embedded in an expression that couldn't be expanded out (a rare, though possible situation, but only when mixing with differences and optional joins).

I needed lots of tests to be sure that I'd done things correctly. I mean, this was a huge change to the query engine. If I'd got it wrong, it would be a serious issue. As a consequence, this code didn't get checked in and used in the timeframe that it ought to have. But finally, I felt it was correct, and I ran my 10 minute queries against the PLoS data.

Now the queries were running at a little over a minute. Well, this was an order of magnitude improvement, but still 30 times slower than I expected. What had happened? I checked where it was spending it's time, and it was still in a sort() method. Sigh. At a guess, I missed something in the code that allows sorting when needed, and avoids it the rest of the time.

Unfortunately, the time taken to get to that point had led to other things becoming important, and I didn't pursue the issue. Also, the only way to take advantage of this change was to update Topaz to use SELECT NONDISTINCT but that keyword was going to fail unless being run on a new Mulgara server. This meant that I couldn't update Topaz until I knew they'd moved to a newer Mulgara, and that didn't happen for a long time. Consequently, PLoS didn't see a performance change, and I ended up trying to improve other things for them rather than tracking it down. In retrospect, I confess that this was a huge mistake. PLoS recently reminded me of their speed issues with certain queries, but now they're looking at other solutions to it. Well, it's my fault that I didn't get it all going for them but that doesn't mean I should never do it, so I'm back at it again.

The problem queries only look really slow when executed against a large amount of data, so I had to get back to the PLoS dataset. The queries also meant running the Topaz setup, since they make use of the Topaz Lucene resolvers. So I updated Topaz and built the system.

Since I was going to work on Topaz, I figured I ought to add in the use of NONDISTINCT. This was trickier than I expected, since it looked like the Topaz code was not only trying to generate TQL code, it was also trying to re-parse it to do transformations on it. The parser in question was Antlr which is one that I've limited experience with, so I spent quite a bit of time trying to figure out what instances of SELECT could have a NONDISTINCT appended to it. In the end, I decided that all of the parsing was really for their own OQL language (which looks a lot like TQL). I hope I was right!

After spending way to long on Topaz, I took the latest updates from SVN, and compiled the Topaz version of Mulgara. Then I ran it to test where it was spending time in the query.

Unfortunately, I immediately started getting regular INFO messages of the form:
MulticastKeepaliveHeartbeatSender> Unexpected throwable in run thread. Continuing...null
java.lang.NullPointerException
at net.sf.ehcache.distribution.MulticastKeepaliveHeartbeatSender$MulticastServerThread.createCachePeersPayload(MulticastKeepaliveHeartbeatSender.java:180)
at net.sf.ehcache.distribution.MulticastKeepaliveHeartbeatSender$MulticastServerThread.run(MulticastKeepaliveHeartbeatSender.java:137)
Now Mulgara doesn't make use of ehcache at all. That's purely a Topaz thing, and my opinion to date has been that it's more trouble than it's worth. This is another example of it. I really don't know what could be going on here, but luckily I kept open the window where I updated the source from SVN, and I can see that someone has modified the class:
  org.topazproject.mulgara.resolver.CacheInvalidator
I can't guarantee that this is the problem, but I've never seen it before, and no other changes look related.

But by this point I'd reached the end of my day, so I decided I should come back to it in the morning (errr, maybe that will be after the SPARQL Working Group meeting).

Jetty


Despite that describing a good portion of my day (at least, those parts not spent in correspondence), I also got a few things done over the weekend. The first of these was a request for a new feature in the Mulgara Jetty configuration.

One of our users has been making heavy use of the REST API (yay! That time wasn't wasted after all!) and had found that Jetty was truncating their POST methods. It turns out that Jetty restricts this to 200,000 characters by default, and it wasn't enough for them. I do have to wonder what they're sticking in their queries, but OK. Or maybe they're POSTing RDF files to the server? That might explain it.

Jetty normally lets you define a lot of configuration with system parameters from the command line, or with an XML configuration file, and I was asked if I could allow either of those methods. Unfortunately, our embedded use of Jetty doesn't allow for either of these, but since I was shown exactly what was wanted I was able to track it down. A bit of 'grepping' for the system parameter showed me the class that gets affected. Then some Javadoc surfing took me to the appropriate interface (Context), and then I was able to go grepping through Mulgara's code. I found where we had access to these Contexts, and fortunately the Jetty configuration was located nearby. Up until this point Jetty's Contexts had not been configurable, but now they are. I only added in the field that had been requested, but everything is set up to add more with just two lines of code each - plus the XSD to describe the configuration in the configuration file.

jSPARQLc


My other weekend task was to add CONSTRUCT support to jSPARQLc. Sure, no one is using it yet, but Java needs so much boilerplate to make SPARQL work, that I figure it will be of use to someone eventually – possibly me. I'm also finding it to be a good learning experience for why JDBC is the wrong paradigm for SPARQL. I'm not too worried about that though, as the boilerplate stuff is all there, and it would be could easy to clean it up to something that doesn't try to conform to SPARQL. But for the moment it's trying to make SPARQL look like JDBC, and besides there's already another library that isn't trying to look like JDBC. I'd better stick to my niche.

I've decided that I'm definitely going to go with StAX to make forward-only result sets. However, I'm not sure if there is supposed to be a standard configuration for JDBC to set the desired form of the result set, so I haven't started on that yet.

The result of a CONSTRUCT is a graph. By default we can expect a RDF/XML document, though other formats are certainly possible. I'm not yet doing content negotiation with jSPARQLc, though that may need to be configurable, so I wanted to keep an open mind about what can be returned. That means that standard SELECT queries could return SPARQL Query Result XML or JSON, and CONSTRUCT queries could result in RDF/XML, N3, or RDF-JSON (Mulgara supports all but the last, but maybe I should add that one in. I've already left space for it).

Without content negotiation, I'm keeping to the XML formats for the moment, with the framework looking for the other formats (though it will report that the format is not handled). Initially I thought I might have to parse the top of the file, until I cursed myself for an idiot and looked up the content type in the header. Once the parameters have been removed, I could use the content type to do a "look up" for a parser constructor. I like this approach, since it means that any new content types I want to handle just become new entries in the look-up table.

This did leave me wondering if every SPARQL endpoint was going to fill in the Content-Type header, but I presume they will. I can always try a survey of servers once I get more features into the code.

Parsing an RDF/XML graph is a complex process that I had no desire to attempt (it could take all week to get it right - if not longer). Luckily, Jena has the ARP parser to do the job for me. However, the ARP parser is part of the main Jena jar, which seemed excessive to me. Fortunately, Jena's license is BSD, so it was possible to bring the ARP code in locally. I just had to update the packages to make sure it wouldn't conflict if anyone happens to have their own Jena in the classpath.

Funnily enough, while editing the ARP files (I'm doing this project "oldschool" with VIM). I discovered copyright notices for Plugged In Software. For anyone who doesn't know, Plugged In Software was the company that created the Tucana Knowledge Store (later to be open sourced as Kowari, then renamed to Mulgara). The company changed its name later on to match the software, but this code predated that. Looking at it, I seem to recall that the code in question was just a few bugfixes that Simon made. But it was still funny to see.

Once I had ARP installed, I could parse a graph, but into what? I'm not trying to run a triplestore here, just an API. So I reused an interface I came up with when I built my SPARQL test framework when I needed to read a graph. The interface isn't fully indexed, but it lets you do a lot of useful things if you want to navigate around a graph. For instance, it lets you ask for the list of properties on a subject, or to find the value(s) of a particular subject's property, or to construct a list from an RDF collection (usually an iterative operation). Thinking that I might also want to ask questions about particular objects (or even predicates) I've added in the other two indexes this time, but I'm in two minds about whether they really need to be there.

The original code for my graph interface was in Scala, and I was tempted to bring it in like this. But one of the purposes of this project was to be lightweight (unfortunately, I lost that advantage when I discovered that ARP needs Xerces), so I thought I should try to avoid the Scala JARs. Also, I thought that the exercise of bringing the Scala code into Java would refresh the API for me, as well as refresh me on Scala (which I haven't used for a couple of months). It did all of this, as well as having the effect of reminding me why Scala is so superior to Java.

Anyway, the project is getting some meat to it now, and it's been fun to work on in my evenings, and while I've been stuck indoors on my weekends. If anyone has any suggestions for it, then please feel free to let me know.

Reblog this post [with Zemanta]