Thursday, September 01, 2011

ASTs


After yesterday's post, I noticed that a number of references came to me via Twitter (using a t.co URL). After looking for it, I realized that I'm linked to from the Planet Clojure page. I'm not sure why, though I guess it's because I'm working for Revelytix, and we're one of the larger Clojure shops around. Only my post wasn't about Clojure - it was about JavaScript. So now I feel obligated to write something about Clojure. Besides, I'm out of practice with my writing, so it would do me good.

So the project I spend most of my time on (for the moment) is called Rex. It's a RIF based rules engine written entirely in Clojure. It does lots of things, but the basic workflow for running rules is:
  1. Parse the rules out of the rule file and into a Concrete Syntax Tree. We can parse the RIF XML format, the RIF Presentation Syntax, and the RIF RDF format, and we plan on doing others.
  2. Run a set of transformations on the CST to convert it into an appropriate Abstract Syntax Tree (AST). This can involve some tricky analysis, particularly for aggregates (which are a RIF extension).
  3. Transformation of the AST into SPARQL 1.1 query fragments.
  4. Execute the engine, by processing the rules to generate more data until the capacity to generate new data has been exhausted. (My... that's a lot of hand waving).
It's step 3 that I was interested in today.

The rest of this post is about how Rex processes a CST into an AST using Clojure, and about some subsequent refactoring that went on. You have been warned...

When I first wrote the CST to AST transformation step, it was to do a reasonably straight forward analysis of the CST. Most importantly, I needed to see the structure of the rule so that I could see what kind of data it depends on, thereby figuring out which other rules might need to be run once a given rule was executed. Since the AST is a tree structure, this made for relatively straight forward recursive functions.

Next, I had to start identifying some CST structures that needed to be changed in the AST. This is where it got more interesting. Again, I had to write recursive functions, but instead of simply analyzing the data, it had to be changed. It turns out that this is handled easily by having a different function for each type of node in the tree. In the normal case the function then recurses on all of its children, and constructs an identical node type using the new children. The leaf nodes then just return themselves. The "different function" for each type is actually accessed with the same name, but dispatches on the function type. In Java that would need a visitor pattern, or perhaps a map of types to functors, but in Clojure it's handled trivially with multimethods or protocols. Unfortunately, the online resources for describing the multi-dispatch aspects of Clojure protocols are not clear, but Luke VanderHart and Stuart Sierra's book Practical Clojure covers it nicely.

As an abstract example of what I mean, say I have an AST consisting of Conjunctions, Disjunctions and leaf nodes. Both Conjunctions and Disjunctions have a single field that contains a seq of the child nodes. These are declared with:
(defrecord Conjunction [children])

(defrecord Disjunction [children])
The transformation function can be called tx, and I'll define it with multiple dispatch on the node type using multimethods:
(defmulti tx  class)


(defmethod tx Disjunction [{c :children}]
(Disjunction. (map tx c)))

(defmethod tx Conjunction [{c :children}]
(Conjunction. (map tx c)))

(defmethod tx :default [n] n)
This will reconstruct an identical tree to the original, though with all new nodes except the leaves. Now, the duplication in the Disjunction and Conjunction methods should be ringing alarm bells, but in real code the functions have more specific jobs to do. For instance, the Conjunction may want to group terms that meet a certain condition (call the test "special-type?") into a new node type (call it "Foo"):
;; A different definition for tx on Conjunction

(defmethod tx Conjunction [{c :children}]
(let [new-children (map tx c)
special-nodes (filter special-type? new-children)
other-nodes (filter (comp not special-type?) new-children)]
(Conjunction. (conj other-nodes (Foo. special-nodes)))))
Hmmm... while writing that example I realized that I regularly run into the pattern of filtering out everything that meets a test, and everything that fails the test. Other than having to test everything twice, it seems too verbose. What I need is a function that will take a seq and a predicate and return a tuple containing a seq of everything that matches the predicate, and a second seq of everything that fails the predicate. I'm not seeing anything like that right now, so that may be a job for the morning.

I should note, that there is no need for a function to return the same type that came into it. There are several occasions where Rex returns a different type. For example, a conjunction between a Basic Graph Pattern (BGP) and the negation of another BGP becomes a MINUS operation between the two BGPs (a Basic Graph Pattern comes from SPARQL and is just a triple pattern for matching against the subject/predicate/object of a triple in an RDF store).

Overall, this approach works very well for transforming the CST into the full AST. As I've needed to incorporate more features and optimizations over time, I found that I had two choices. Either I could expand the complexity of the operation for every type in the tree processing code, or I could perform different types of analysis on the entire tree, one after another. The latter makes the process far easier to understand, making the design more robust and debugging easier, so that's how Rex has been written. It makes analysis slightly slower, but analysis is orders of magnitude faster than actually running the rules, so that is not a consideration.

Threading

The first time through, analysis was simple:
(defn analyse [rules]

(process rules))
Adding a new processing step was similarly easy:
(defn analyse [rules]

(process2 (process1 rules)))
But once a third, then a fourth step appeared, it became obvious that I needed to use the Clojure threading macro:
(defn analyse [rules]

(-> rules
process1
process2
process3
process4))
So now it's starting to look nice. Each step in the analysis process is a single function name implemented for various types. These names are then provided in a list of things to be applied to the rules, via the threading macro. There's a little more complexity (one of the steps picks up references to parts of the tree, and since each stage changes the tree, then these references will be pointing to an old and unused version of the tree. So that step has to be last), but it paints the general picture.

Partial

Another thing that I've glossed over is that each rule is actually a record that contains the AST as one of its members. The rules themselves are a seq which is in turn a member of a record containing a "Rule Program". So each process step actually ended up being like this:
(defn process1 [rule-program]

(letfn [(process-rule [rule] (assoc rule :body (tx (:body rule))))]
(assoc rule-program :rules (map process-rule (:rules rule-prog)))))
Did you follow that?

The bottom line is replacing the :rules field with a new one. It's mapping process-rule onto the seq of rules, and storing the result in a new rules-program, which is what gets returned. The process-rule function is defined locally as associating the :body of a rule with the tx function applied to the existing body. This creates a new rule that has has the tx applied to it.

This all looked fine to start with. A new rule program is created by transforming all the rules in the old program. A transformed rule is created by transforming the AST (the :body) in the old rule. But after the third analysis it became obvious that there was duplication going on. In fact, it was all being duplicated except for the tx step. But that was buried deep in the function. What was the best way to pull it out?

To start with, the embedded process-rule function came out. After all, it was just inside to hide it, and not because it had to pick up a closure anywhere. This function then accepts the kind of transformation that it needs to do as a parameter:
(defn convert-rule

[convert-fn rule]
(assoc rule :body (convert-fn (:body rule))))
Next, we want a general function for converting all the rules, which can accept a conversion function to pass on to convert-rule. It does all the rules, so I just pluralized the name:
(defn convert-rules

[conversion-fn rule-prog]
(assoc rule-prog :rules (map #(convert-rule conversion-fn %) (:rules rule-prog)))))
That works, but now the function getting mapped is looking messy (and messy leads to mistakes). I could improve it by defining a new function, but I just factored a function out of this function. Fortunately, there is a simpler way to define this new function. It's a "partial" application of convert-rule. But I'll still move it into a let block for clarity:
(defn convert-rules

[conversion-fn rule-prog]
(let [conv-rule (partial convert-rule conversion-fn)]
(assoc rule-prog :rules (map conv-rule (:rules rule-prog)))))
So now my original process1 definition becomes a simple:
(defn process1 [rule-program]

(convert-rules tx rule-program))
That works, but the rule-program parameter is just sticking out like a sore thumb. Fortunately, we've already seen how to fix this:
(def process1 (partial convert-rules tx))
Indeed, all of the processing functions can be written this way:
(def process1 (partial convert-rules tx))

(def process2 (partial convert-rules tx2))
(def process3 (partial convert-rules tx3))
(def process4 (partial convert-rules tx4))
It may seem strange that a function is now being defined with a def instead of a defn, but it's really not an issue. It's worth remembering that defn is just a macro that uses def to attach a symbol to a call to (fn ...).

Documenting

Of course, my functions don't appear as sterile as what I've been typing here. I do indeed use documentation. That means that the process1 function would look more like:
(defn process1 [rule-program]

"The documentation for process1"
(convert-rules tx rule-program))
One of the nice features of the defn macro is the ease of writing documentation. This isn't as trivial with def since it's a special form, rather than a macro, but it's still not too hard to do. You just need to attach some metadata to the object, with a key of :doc. Unfortunately, I couldn't remember the exact syntax for this today, and rather then go trawling through books or existing code, Alex was kind enough to remind me:
(def ^{:doc "The documentation for process1"}

process1 (partial convert-rules tx))


Framework

The upshot of all of this is a simpler framework for adding new steps to the existing analysis system. Adding a new analysis step just needs a new function thrown into the thread. I could put a call to (partial convert-rules ...) directly into the thread, but by using a def I get to name and document that step of the analysis. the only real work of the analysis is then done in the single multi-dispatch function, which is just as it should be.

So right now my evening "hobby" has been JavaScript, while my day job is Clojure. I have to tell you, the day job is much more fun.

No comments: