Thursday, October 21, 2010

Clojure Experiences

I've been learning Clojure recently, as my new job uses it frequently. I'm happy with this (it has me attending (first clojure-conj) right now), but I'm still learning a lot about the language. I'm comfortable with functional programming, and of course, there's next to nothing to learn about syntax and grammar, but there is a steep learning curve to effectively use the libraries. Not only are there a lot of them, but they are not really well documented.

OO languages tell you which functions belong to a class, but since Clojure functions don't really belong to anything it can sometimes be hard to find everything that is relevant to the structures you are dealing with. I've had a few people tell me that there are good websites out there that provide the sort of documentation I'd like, so I'll go looking for them soon.

Meanwhile, I find myself looking at the libraries all the time. Other than the general principle of code reuse, the dominant paradigm in Clojure is to use the core libraries whenever possible. This is because those libraries are likely to be much more efficient than anything you can come up with on your own. They're also "lazy", meaning that the functions do very little work themselves, with all of the work being put off until you actually need it. Laziness is a good thing, and should show up in Clojure code as much as possible. Using the core libraries is the best way to accomplish that.

Clojure Maps

Part of my current task is to take a data structure that represents an RDF graph, and process it. The structure that I have now maps resource identifiers (sometimes URIs, but mostly internal blank node IDs) to structures representing all of the property/values associated with that resource. Unfortunately, the values are simple identifiers too, so to find their data I have to go back to the structure and get the resources for that object. This has to be repeated until everything found has no properties or is a literal (which has no properties). Also, the anonymous resources in RDF lists are showing up as data structures, which makes them difficult to work with as well.

Following this kind of structure is easy enough in an imperative language, though quite tedious. However, in Clojure it is simply too inelegant to work with. So I want some way to make this to a nice nested structure that represents the nested data from the RDF graph (fortunately, this kind of graph is guaranteed to not have any cycles). In particular, I'd this new structure to be lazy wherever I can get it.

The basic approach is simple: get the property/values for a given resource. Then convert the values (resources) that aren't literals to the actual data that they represent. This is done by getting the property/values of those resources, which is what I started with, so now I know I need recursion.

Now there are a few caveats. First off, if a resource is actually the head of a list, I don't want to just get it's properties. Instead, I want to traverse the list (lazily, of course) and get all the values attached to the nodes. That's easy enough, and is done with a function to identify resources that are lists, and another one to construct the list (a recursive function that uses lazy-seq). Second, there is one property/value that I want to preserve, and that is the :id property. This property shows the URI or identifier for the original node, so the value here needs to be kept, and not expanded on (it would recurse indefinitely otherwise).

After identifying these requirements, I figured I needed some way to convert all the key/value elements of a map into a new map. This sounds very much like converting all the elements in a sequence (a "seq") into a new sequence, a task which is done using the map function. Is there a similar operation to be applied to a map? (The noun and the verb get confusing here, sorry).

It turns out that Alex Miller had been dealing with exactly this problem just recently, and has built a new function called mapmap to do this operation. However, I wasn't aware of this at the time, so I had to pursue my own solution. I'm reasonably happy with my answer, and I learnt a bit in the process, so I'm glad I solved it myself, though it would have been nice to get that sleep back.

Morphing a Map

There have been a few times when I've needed to create a new map, and the first time I needed to I went to clojure.core to see what kind of map-creating functions were available to me. At that point I found hash-map, which takes a sequence containing an even number of items, and converts it into a map using the even-numbered offsets as the keys and the odd-numbered offsets as the values (with the first offset being "0"). So I could create a map of numbers to their labels using the following:
  (hash-map 1 "one" 2 "two" 3 "three")
So was there a way to create this list from my existing map? (Well, yes, but was it easy?)

I wanted to iterate over my existing map "entries" and create new entries. If you treat a map as a seq, then it appears as a sequence of 2-element vectors, with the first element being the key and the second element being the value. So if I want to morph the values in a map called "mymap" with a function called "morph", I could convert these pairs into new pairs with:
  (map (fn [[k v]] [k (morph v)]) mymap)
For anyone unfamiliar with the syntax, my little anonymous function in the middle there is taking as it's single parameter a 2 element vector, and destructuring it into the values k and v, then returning a new two element vector of k and (morph v). So let's apply this to a map of numbers to their names (incidentally, I don't code anything like this, but I do blog this way)....
  (def mymap {1 "one", 2 "two", 3 "three"})
(def morph #(.toUpperCase %))
(map (fn [[k v]] [k (morph v)]) mymap)
The result is:
([1 "ONE"] [2 "TWO"] [3 "THREE"])
Looks good, but there is one problem. This result is a sequence of pairs, while hash-map just takes a sequence. That's OK, we can just call flatten before passing it in:
  (apply hash-map (flatten (map (fn [[k v]] [k (morph v)]) mymap)))
{1 "ONE", 2 "TWO", 3 "THREE"}
The "apply" was needed because hash-map needed lots of arguments (6 in this case) while the result of flatten would have been just 1 argument (a list).

This is looking good. It's what I've done in the past, and it's hardly worthy of a blog post. But this time I had a new twist. My "values" were also structures, and often these structures were seqs as well. For instance, say I have a map of numbers to names in a couple of languages:
(def langmap {1 ["one" "uno"], 2 ["two" "due"], 3 ["three" "tre"]})
My morph method needs to change to accept a seq of strings instead of just a string, but that's trivial. So then the map step looks good, giving an output of:
([1 ("ONE" "UNO")] [2 ("TWO" "DUE")] [3 ("THREE" "TRE")])
But this fails to be loaded into a hash-map, with the error:
java.lang.IllegalArgumentException: No value supplied for key: TRE (NO_SOURCE_FILE:0)
This is exactly what bit me at 1am.

The problem is simple. flatten is recursive. It dives into the sub-sequences, and brings them up too. The result is trying to map 1 to "ONE", mapping "UNO" to 2, and so on, until it gets to "TRE" at which point it doesn't has a value to map it to, so it gives an error. Fortunately for me I had an odd number of items when everything was flattened, or else I would have been working with a corrupted map and may not have discovered it for some time.

My first thought was to write a version of flatten that isn't recursive, but that's when I took a step back and asked myself if there was another way to construct a map. Maybe I shouldn't be using hash-map. So I looked at Clojure's cheatsheet, and discovered zipmap.


zipmap wasn't quite what I wanted, in that instead of taking a sequence of pairs, it wants a pair of sequences. It then constructs a map where the nth key/value comes from the nth entry in the first sequence (the key) and the nth entry from the second sequence (the value). So I just needed to rotate, or transpose my data. Fortunately, the day before one of my co-workers had asked me exactly the same question.

In this case, he had a sequence of MxN entries, looking something like:
[[a1 a2 a3] [b1 b2 b3] [c1 c2 c3]]
And the result that he wanted was:
[[a1 b1 c1] [a2 b2 c2] [a3 b3 c3]]
My first thought was some wildly inefficient loop-comprehension, when he asked me if the following would work (for some arg):
  (apply map list arg)
Obviously he didn't have a repl going, so I plugged it in, and sure enough it worked. But why?

I struggled to understand it until I looked up the documentation for map again, and remembered that it can take multiple sequences, not just one. I had totally forgotten this. When provided with n sequences, map requires a function that accepts n arguments. It then goes through each of the sequences in parallel, with the output being a sequence where the nth entry is the result of calling the function with the nth entry of every source sequence. In this case, the function was just list, meaning that it creates a list out of the items from each sequence. The apply just expanded the single argument out into the required lists that map needed. This one line is a lovely piece of elegance, and deserves its own name:
  (defn rotate [l] (apply map list l))

So the final composition for creating a new map with values modified from the original map is:
  (apply zipmap (rotate (map (fn [[k v]] [k (morph v)]) langmap)))
{3 ("THREE" "TRE"), 2 ("TWO" "DUE"), 1 ("ONE" "UNO")}

Now the actual function that I use inside map isn't nearly so simple. After all, I need to recursively traverse the RDF graph, plus I need to find RDF lists and turn them into sequences, etc. But this zipmap/rotate/map idiom certainly makes it easy.

Incidentally, I could do exactly the same thing with Alex's mapmap:
  (mapmap key (comp morph val) langmap)
Oh well. I'm still happy I discovered zipmap/rotate.