Wednesday, August 22, 2012

Clojure DC

Tonight was the first night for Clojure DC, which is a meetup group for Clojure users. It's a bit of a hike for me to get up there, but I dread getting too isolated from other professionals, so I decided it was worth making the trip despite the distance and traffic. Luckily, I was not disappointed.

Although I was late (have you ever tried to cross the 14th Street Bridge at 6pm? Ugh) not much had happened beyond som pizza consumption. The organizers, Matt and Chris, had a great venue to work with, and did a good job of getting the ball rolling.

After introductions all around, Matt and Chris gave a description of what they're hoping to do with the group, prompted us with ideas for future meetings, and asked for feedback. They suggested perhaps doing some Clojure Koans, and in that spirit they provided new users with an introduction to writing Clojure code (not an intro to Clojure, but an intro to writing code), by embarking on a function to parse tnetstrings. I'd never heard of these, but they're a similar concept to JSON, only the encoding and parsing is even simpler.

This part of the presentation was fun, since Matt and Chris had a banter that was reminiscent of Daniel Friedman and William Byrd presenting miniKanren at Clojure/Conj last year. While writing the code they asked for feedback, and I was pleased to learn a few things from some of the more experienced developers who'd shown up (notably, Relevance employee Craig Andera, and ex-Clojure.core developer, and co-author of my favorite Clojure book, Michael Fogus). For instance, while I knew that maps operate as functions where they look up an argument in themselves, I did not know that they can optionally accept a "not-found" parameter like clojure.core/get does. I've always used "get" to handle this in the past, and it's nice to know I can skip it.

While watching what was going on, I decided that a regex would work nicely. So I ended up giving it a go myself. The organizers stopped after parsing a string and a number, but I ended up doing the lot, including maps and arrays. Interestingly, I decided I needed to return a tuple, and after I finished I perused the reference Python implementation and discovered that this returned the same tuple. Always nice to know when you're on the right track. :-)

Anyway, my attempt looked like:

(ns tnetstrings.core)

(def type-map {\, identity
               \# #(Integer/parseInt %)
               \^ #(Float/parseFloat %)
               \! #(Boolean/parseBoolean %)
               \~ (constantly nil)
               \} (fn [m] (loop [mp {} remainder m]
                            (if (empty? remainder)
                              mp
                              (let [[k r] (parse-t remainder)
                                    [v r] (parse-t r)]
                                (recur (assoc mp k v) r)))))
               \] (fn [m] (loop [array [] remainder m]
                            (if (empty? remainder)
                              array
                              (let [[a r] (parse-t remainder)]
                                (recur (conj array a) r)))))})

(defn parse-t [msg]
  (if-let [[header len] (re-find #"([0-9]+):" msg)]
    (let [head-length (count header)
          data-length (Integer/parseInt len)
          end (+ data-length head-length)
          parser (type-map (nth msg end) identity)]
      [(parser (.substring msg head-length end)) (.substring msg (inc end))])))

There are lots of long names in here, but I wasn't trying to play "golf". The main reason I liked this was because of the if-let I introduced. It isn't perfect, but if the data doesn't start out correctly, then the function just returns nil without blowing up.

While this worked, it was bothering me that both the array and the map forms looked so similar. I thought about this in the car on the way home, and I recalled the handy equivalence:

(= (assoc m k v) (conj m [k v]))

So with this in hand, I had another go when I got home:

(ns tnetstrings.core)

(defn embedded [s f]
  (fn [m] (loop [data s remainder m]
            (if (empty? remainder)
              data
              (let [[d r] (f remainder)]
                (recur (conj data d) r))))))

(def type-map {\, identity
               \# #(Integer/parseInt %)
               \^ #(Float/parseFloat %)
               \! #(Boolean/parseBoolean %)
               \~ (constantly nil)
               \} (embedded {} (fn [m] (let [[k r] (parse-t m)
                                             [v r] (parse-t r)]
                                         [[k v] r])))
               \] (embedded [] (fn [m] (let [[a r] (parse-t m)]
                                         [a r])))})

(defn parse-t [msg]
  (if-let [[header len] (re-find #"([0-9]+):" msg)]
    (let [head-length (count header)
          data-length (Integer/parseInt len)
          end (+ data-length head-length)
          parser (type-map (nth msg end) identity)]
      [(parser (.substring msg head-length end)) (.substring msg (inc end))])))

So now each of the embedded structures is based on a function returned from "embedded". This contains the general structure of:

  • Seeing if there is anything left to parse.
  • If not, then return the already parsed data.
  • If so, then parse it, and add the parsed data to the structure before repeating on the remaining string to be parsed.
In the case of the array, just one element is parsed by re-entering the main parsing function. The result is just the returned data. In the case of the map, the result is a key/value tuple, obtained by re-entering the parsing function twice. By wrapping the key/value like this we not only get to return it as a single "value", but it's also in the form required for the conj function that is used on the provided data structure (vector or map).

The result looks a little noisy (lots of brackets and parentheses), but I think it abstracts out the operations much better. Exercises like this are designed to help you think about problems the right way, so I think it was a great little exercise.

Other than this code, I also got the chance to chat with a few people, which was the whole point of the trip. It's getting late, so I won't go into those conversations now, but I was pleased to hear that many of them will be going to Clojure/Conj this year.

5 comments:

hugoestr said...

Thanks so much for the summary. I couldn't make it there yesterday, and it is good to at least get a sense of what happened.

Craig Andera said...

Nice. After a quick look, the only improvement that jumps out at me is the use of (last blah) instead of (nth blah end).

Hope to see you at the next one?

Paul Gearon said...

Thanks Craig.

Did you mean using (last msg) as a direct replacement, or did you mean I should move the call to .substring up into the let and call last on the substring? (The former wouldn't work, of course, as it would pick up the last character of the full string, rather than just the part being parsed. So I'm guessing you mean the latter).

Looking at it, you have a good point here. The call to .substring is already finding the end of the string, so by using nth I'm going through the string a second time.

This is why I like exercises like last night's. Interacting with other people to write code can both teach you new tricks and improve how you think about problems.

Craig Andera said...

Thinking about it some more, I'm not sure that last would be an improvement in any event. It would depend on how it's implemented - if it traverses the string as a seq, then it may be less efficient than nth, which could be implemented to randomly seek into the string.

But those sorts of micro-optimizations are pretty silly. I think last just reads better, if what you're trying to say is that you want the last character in the string.

Paul Gearon said...

Hi Craig,

I agree that micro-optimizations aren't usually helpful. But in a purely intellectual exercise like this one I think they can be helpful, as they can occasionally help you see some patterns of optimization while coding in the future.

However, I revisited this again, and I may have been too hasty in agreeing with you earlier. I thought that by doing the substring first then I could just say last on that string to get the key for the parser. However, if I did it that way, I'd also have to call substring again to truncate that final character before parsing (it wouldn't matter for integers, floats, booleans or nil, but strings would parse through to the final comma). Calling .substring twice just to get the final character doesn't seem like the right approach.

Finally, I pasted my code straight out of my repl where it was working fine. What this missed was that the type-map contains a reference to parse-t without it being defined yet. My first iteration at the repl didn't attempt to do maps nor arrays, so there was no forward reference, and my subsequent attempt saw that parse-t was already defined. So the code should have the following near the top:
(def parse-t)

I found this by writing some tests for the parser (naturally). Maybe I should write something about that in the blog too.