Working notes: 09/01/2011

Sunday, September 04, 2011

SPARQL JSON

After commenting the other day that Fuseki was ignoring my request for a JSON response, I was asked to submit a bug report. It's easy to cast unsubstantiated stones in a blog, but a bug report is a different story, so I went back to confirm everything. In the process, I looked at the SPARQL 1.1 Query Results JSON Format spec (I can't find a public copy of it, so no link, sorry. UPDATE: Found it.) and was chagrined to discover that it has its own MIME type of "application/sparql-results+json". I had been using the JSON type of "application/json", and this indeed does not work, but the corrected type does. I don't think it's a good idea to ignore "application/json" so I'll report that, but strictly speaking it's correct (at least, I think so. As Andy said to me in a back channel, I don't really know what the + is supposed to mean in the subtype). So Fuseki got it right. Sorry.

When I finally get around to implementing this for Mulgara I'll try to handle both. Which reminds me... I'd better get some more SPARQL 1.1 implemented. My day job at Revelytix needs me to do some Mulgara work for a short while, so that may help get the ball rolling for me.

separate

I commented the other day that I wanted a Clojure function that takes a predicate and a seq, and returns two seqs: one that matches the predicate and the other that doesn't match. I was thinking I'd build it with a loop construct, to avoid recursing through the seq twice.

The next day, Gary (at work) suggested the separate function from Clojure Contrib. I know there's some good stuff in contrib, but unfortunately I've never taken the time to fully audit it.

The implementation of this function is obvious:

(defn separate [f s]
  [(filter f s) (filter (complement f) s)])

I was disappointed to learn that this function iterates twice, but Gary pointed out that there had been a discussion on exactly this point, and the counter argument is that this is the only way to build the results lazily. That's a reasonable point, and it is usually one of the top considerations for Clojure implementations. I don't actually have cause for complaint anyway, since the seqs I'm using are always small (being built out of query structures).

This code was also a good reminder that (complement x) offers a concise replacement for the alternative code:

  #(not x %)

By extension, it's a reminder to brush up on my idiomatic Clojure. I should finish the book The Joy of Clojure (which is a thoroughly enjoyable read, by the way).

Thursday, September 01, 2011

ASTs

After yesterday's post, I noticed that a number of references came to me via Twitter (using a t.co URL). After looking for it, I realized that I'm linked to from the Planet Clojure page. I'm not sure why, though I guess it's because I'm working for Revelytix, and we're one of the larger Clojure shops around. Only my post wasn't about Clojure - it was about JavaScript. So now I feel obligated to write something about Clojure. Besides, I'm out of practice with my writing, so it would do me good.

So the project I spend most of my time on (for the moment) is called Rex. It's a RIF based rules engine written entirely in Clojure. It does lots of things, but the basic workflow for running rules is:

Parse the rules out of the rule file and into a Concrete Syntax Tree. We can parse the RIF XML format, the RIF Presentation Syntax, and the RIF RDF format, and we plan on doing others.
Run a set of transformations on the CST to convert it into an appropriate Abstract Syntax Tree (AST). This can involve some tricky analysis, particularly for aggregates (which are a RIF extension).
Transformation of the AST into SPARQL 1.1 query fragments.
Execute the engine, by processing the rules to generate more data until the capacity to generate new data has been exhausted. (My... that's a lot of hand waving).

It's step 3 that I was interested in today.

The rest of this post is about how Rex processes a CST into an AST using Clojure, and about some subsequent refactoring that went on. You have been warned...

When I first wrote the CST to AST transformation step, it was to do a reasonably straight forward analysis of the CST. Most importantly, I needed to see the structure of the rule so that I could see what kind of data it depends on, thereby figuring out which other rules might need to be run once a given rule was executed. Since the AST is a tree structure, this made for relatively straight forward recursive functions.

Next, I had to start identifying some CST structures that needed to be changed in the AST. This is where it got more interesting. Again, I had to write recursive functions, but instead of simply analyzing the data, it had to be changed. It turns out that this is handled easily by having a different function for each type of node in the tree. In the normal case the function then recurses on all of its children, and constructs an identical node type using the new children. The leaf nodes then just return themselves. The "different function" for each type is actually accessed with the same name, but dispatches on the function type. In Java that would need a visitor pattern, or perhaps a map of types to functors, but in Clojure it's handled trivially with multimethods or protocols. Unfortunately, the online resources for describing the multi-dispatch aspects of Clojure protocols are not clear, but Luke VanderHart and Stuart Sierra's book Practical Clojure covers it nicely.

As an abstract example of what I mean, say I have an AST consisting of Conjunctions, Disjunctions and leaf nodes. Both Conjunctions and Disjunctions have a single field that contains a seq of the child nodes. These are declared with:

(defrecord Conjunction [children])

(defrecord Disjunction [children])

The transformation function can be called tx, and I'll define it with multiple dispatch on the node type using multimethods:

(defmulti tx  class)



(defmethod tx Disjunction [{c :children}]

    (Disjunction. (map tx c)))



(defmethod tx Conjunction [{c :children}]

    (Conjunction. (map tx c)))



(defmethod tx :default [n] n)

This will reconstruct an identical tree to the original, though with all new nodes except the leaves. Now, the duplication in the Disjunction and Conjunction methods should be ringing alarm bells, but in real code the functions have more specific jobs to do. For instance, the Conjunction may want to group terms that meet a certain condition (call the test "special-type?") into a new node type (call it "Foo"):

;; A different definition for tx on Conjunction

(defmethod tx Conjunction [{c :children}]

  (let [new-children (map tx c)

        special-nodes (filter special-type? new-children)

        other-nodes (filter (comp not special-type?) new-children)]

    (Conjunction. (conj other-nodes (Foo. special-nodes)))))

Hmmm... while writing that example I realized that I regularly run into the pattern of filtering out everything that meets a test, and everything that fails the test. Other than having to test everything twice, it seems too verbose. What I need is a function that will take a seq and a predicate and return a tuple containing a seq of everything that matches the predicate, and a second seq of everything that fails the predicate. I'm not seeing anything like that right now, so that may be a job for the morning.

I should note, that there is no need for a function to return the same type that came into it. There are several occasions where Rex returns a different type. For example, a conjunction between a Basic Graph Pattern (BGP) and the negation of another BGP becomes a MINUS operation between the two BGPs (a Basic Graph Pattern comes from SPARQL and is just a triple pattern for matching against the subject/predicate/object of a triple in an RDF store).

Overall, this approach works very well for transforming the CST into the full AST. As I've needed to incorporate more features and optimizations over time, I found that I had two choices. Either I could expand the complexity of the operation for every type in the tree processing code, or I could perform different types of analysis on the entire tree, one after another. The latter makes the process far easier to understand, making the design more robust and debugging easier, so that's how Rex has been written. It makes analysis slightly slower, but analysis is orders of magnitude faster than actually running the rules, so that is not a consideration.

Threading

The first time through, analysis was simple:

(defn analyse [rules]

  (process rules))

Adding a new processing step was similarly easy:

(defn analyse [rules]

  (process2 (process1 rules)))

But once a third, then a fourth step appeared, it became obvious that I needed to use the Clojure threading macro:

(defn analyse [rules]

  (-> rules

      process1

      process2

      process3

      process4))

So now it's starting to look nice. Each step in the analysis process is a single function name implemented for various types. These names are then provided in a list of things to be applied to the rules, via the threading macro. There's a little more complexity (one of the steps picks up references to parts of the tree, and since each stage changes the tree, then these references will be pointing to an old and unused version of the tree. So that step has to be last), but it paints the general picture.

Partial

Another thing that I've glossed over is that each rule is actually a record that contains the AST as one of its members. The rules themselves are a seq which is in turn a member of a record containing a "Rule Program". So each process step actually ended up being like this:

(defn process1 [rule-program]

  (letfn [(process-rule [rule] (assoc rule :body (tx (:body rule))))]

    (assoc rule-program :rules (map process-rule (:rules rule-prog)))))

Did you follow that?

The bottom line is replacing the :rules field with a new one. It's mapping process-rule onto the seq of rules, and storing the result in a new rules-program, which is what gets returned. The process-rule function is defined locally as associating the :body of a rule with the tx function applied to the existing body. This creates a new rule that has has the tx applied to it.

This all looked fine to start with. A new rule program is created by transforming all the rules in the old program. A transformed rule is created by transforming the AST (the :body) in the old rule. But after the third analysis it became obvious that there was duplication going on. In fact, it was all being duplicated except for the tx step. But that was buried deep in the function. What was the best way to pull it out?

To start with, the embedded process-rule function came out. After all, it was just inside to hide it, and not because it had to pick up a closure anywhere. This function then accepts the kind of transformation that it needs to do as a parameter:

(defn convert-rule

  [convert-fn rule]

  (assoc rule :body (convert-fn (:body rule))))

Next, we want a general function for converting all the rules, which can accept a conversion function to pass on to convert-rule. It does all the rules, so I just pluralized the name:

(defn convert-rules

  [conversion-fn rule-prog]

  (assoc rule-prog :rules (map #(convert-rule conversion-fn %) (:rules rule-prog)))))

That works, but now the function getting mapped is looking messy (and messy leads to mistakes). I could improve it by defining a new function, but I just factored a function out of this function. Fortunately, there is a simpler way to define this new function. It's a "partial" application of convert-rule. But I'll still move it into a let block for clarity:

(defn convert-rules

  [conversion-fn rule-prog]

  (let [conv-rule (partial convert-rule conversion-fn)]

    (assoc rule-prog :rules (map conv-rule (:rules rule-prog)))))

So now my original process1 definition becomes a simple:

(defn process1 [rule-program]

  (convert-rules tx rule-program))

That works, but the rule-program parameter is just sticking out like a sore thumb. Fortunately, we've already seen how to fix this:

(def process1 (partial convert-rules tx))

Indeed, all of the processing functions can be written this way:

(def process1 (partial convert-rules tx))

(def process2 (partial convert-rules tx2))

(def process3 (partial convert-rules tx3))

(def process4 (partial convert-rules tx4))

It may seem strange that a function is now being defined with a def instead of a defn, but it's really not an issue. It's worth remembering that defn is just a macro that uses def to attach a symbol to a call to (fn ...).

Documenting

Of course, my functions don't appear as sterile as what I've been typing here. I do indeed use documentation. That means that the process1 function would look more like:

(defn process1 [rule-program]

  "The documentation for process1"

  (convert-rules tx rule-program))

One of the nice features of the defn macro is the ease of writing documentation. This isn't as trivial with def since it's a special form, rather than a macro, but it's still not too hard to do. You just need to attach some metadata to the object, with a key of :doc. Unfortunately, I couldn't remember the exact syntax for this today, and rather then go trawling through books or existing code, Alex was kind enough to remind me:

(def ^{:doc "The documentation for process1"}

  process1 (partial convert-rules tx))

Framework

The upshot of all of this is a simpler framework for adding new steps to the existing analysis system. Adding a new analysis step just needs a new function thrown into the thread. I could put a call to (partial convert-rules ...) directly into the thread, but by using a def I get to name and document that step of the analysis. the only real work of the analysis is then done in the single multi-dispatch function, which is just as it should be.

So right now my evening "hobby" has been JavaScript, while my day job is Clojure. I have to tell you, the day job is much more fun.

Robusta

I was just posting this into Google+ when I realized I was typing more than I'd intended. It read more like a short blog post. Which reminded me.... "Oh yeah. I have a blog out there somewhere! Maybe I should write this in there."

After spending a lot of recent nights on JavaScript I think I'm starting to get a feel for it. I started with Douglas Crockford's "JavaScript: The Good Parts", and am now plowing through David Flanagan's "JavaScript: The Definitive Guide". (I met David when he came to Brisbane to run a short course on Java programming back in 1996. Nice guy. He said that he'd preferred to have called his book "Java in a Demitasse", but O'Reilly wanted it in their "Nutshell" series).

After using languages like Ruby, Erlang, Scala and Clojure, I'm finding it a little frustrating, but it's hard to argue with the ubiquity of the platform. Fortunately it has closures and first class functions, though the variable scoping is bizarre. I've been enjoying the callback approach to asynchronous function calls, though the syntax tends to make the resulting code confusing to read. I'm mostly sticking to Crockford's subset of the language, and this does make things a little more sensible. Flanagan's book has been filling in the gaps for me, but it's especially useful for documenting libraries like HTML5 Canvas and the File API.

As with all first attempts at using a language, mine is a little messy and inconsistent. However, it can't hurt to put it out there. I've built a simple tool for working with SPARQL endpoints (specifically aimed at Jena/Fuseki, but it should mostly work on others too). The important piece is the SPARQL connection object that it comes with (found in sparql.js). I'm hoping that this will be a useful object for more general application. It can even convert XML responses into a SPARQL-JSON structure (I wrote this after discovering Fuseki was ignoring my Content-type settings on queries).

Unfortunately, it's missing one important piece, which is the ability to upload a file from a browser. In general, it's possible to upload a file using a form submission, but that encodes all the parameters into the request body, and the SPARQL HTTP protocol requires that the graph URI appear as a parameter in the URL of the request. In an attempt to get the graph parameter out of the body and into the URL, I even tried dynamically constructing the URL for the form submission, but the browser "cleverly" saw what I was doing and pushed the parameter back into the body. So I can't use form submission. Alternatively, JavaScript makes it easy to submit an HTTP POST operation with everything set the way you want it. However, the only way to read a local file is through the form submission process, which means I still can't do a file upload. In the end, I just used Fuseki's file upload servlet, but this has the problem of being non-standard, and it also doesn't like URIs that aren't http (yes Andy, that's why I asked you about this restriction - though I'd already run into it at work).

The resulting system needed a name, so I called it Robusta. Everyone seems to be enamored with Arabica beans, but no one ever talks about Robusta beans. Don't get me wrong.... if I had to choose between the two I'd definitely go for the Arabica. But by blending in a small portion of Robusta beans you add a richness to the flavor of your coffee (it's also used as a cheap "filler" and promotes crema in espresso, but I like the flavor aspect). At the time, I came up with the name because I really needed some caffeine, and Arabica was too obvious. But in retrospect, I like the name, since adding a bit of SPARQL to your scripts can really enhance a system (OK, that's tacky. I probably need another one of those coffees). It wasn't after I'd had some coffee that I thought to look for other projects with the same name, but by that point it was already up there.

Robusta is still a work in progress, and it's mostly a late night project that fits around everything else that I'm doing. But I'm using it at work, and it makes my life easier. I'd like to know if anyone has ideas for it, or can point out errors, inefficiencies, or potential improvements. It's posted at GitHub as a part of the Revelytix project group, at:
http://github.com/revelytix/robusta

Working notes