Saturday, March 15, 2008

Writing

I've been trying to sit down and write for over a week, but each time I try I end up writing code instead. I've even fallen behind reading Slashdot. I've been getting a lot of messages from people wanting to know what happened last week, what our plans are for Mulgara, etc, but I just haven't been able to respond. That's what happens when a developer tries to work in the real world. I handle the real world, and I can handle code, but not at the same time. :-(

For the moment, I have priorities with work that I have to see to, so I'll be concentrating on technical things for a while. However, there are a few things happening with Mulgara, so I'll try to mention them as I go. In the meantime, I'm working on SPARQL queries.

SPARQL

The two main features that we're missing now are OPTIONAL and FILTER. Looking at OPTIONAL some time ago I realized that it's a hybrid between ConstraintConjunction (the inner join aspect), and ConstraintDisjunction (matches on the left side leaving unbound columns). I worked on something similar when I did ConstraintDifference a few years ago, so I know that this is easy. Hence, I put this part off until last.

In the last week or so (in between the meeting in San Francisco, and getting a nasty virus) I've been on filters. Right now I'm down to some classes to represent the operator definitions for all the functions like bound(), isIRI() and regex(). I already have the functionality implemented, but you still need to represent it in an abstract syntax if you're going to construct expressions at query time. So it's all just some boiler plate code to represent the parameters and pass the context on down to any variables that need resolving. After that, I'm on to the unit tests. In an ideal world, I'd test everything but in reality I have less time than that. Many of the functions are so similar that I'll just be testing a good sample of each of them.

Looking at the list in the SPARQL definition, you might think that there aren't too many functions at all, but you would be wrong. For a first approximation, many of the functions have to be reimplemented for each type of parameter. I've even gone to the effort of making sure that working on an <xsd:int> returns an <xsd:int> (when appropriate), and that an <xsd:short> returns an <xsd:short>. Since I was already trying to keep floating point numbers and integers apart, then this seemed to be a natural extension. Then I have to consider the types of numbers typed into the SPARQL query, literal numbers typed in to the query, and variables that get bound to numbers during processing. This raises the complexity considerably.

My first attempt had me doing largish methods that have copious "if (value instanceof ...)" statements in them. This is clunky and brittle. The moment I went to do it a second time, I decided to throw it out, and do it all with maps to functors (where are closures?!?). This actually worked well, and has the advantage of giving short and simple functions, and consistent patterns to follow in implementations. I'd have liked to use generics a little more, but they are really suited for interpreting code you are writing, rather than code that is being structured from a parser. Consequently, in one class I ended up writing a little Ruby script to write the series of functor classes I needed for arithmetic operations! Scary, I know, but it works quite well. It was either that or a series of if/then/else blocks taking me down dark passages I never want to enter.

The frustrating thing is that via autoboxing, you can write the same arithmetic over and over again, and have it do different things. For instance, the expression:
x * y
can result it totally different return types depending on whether x and y are Doubles, Floats, Integers, etc. This is common when programming in Java using the native types (like double and int) but this must be established at compile time, not when processing query. That means you want to have access to every combination of parameters at run time. This can be done with autoboxing, and defining classes with interfaces that return java.lang.Numbers. Then the code x*y can be written over and over, and it means something different each time. Java generics are nice, but they are a long way short of C++ templates, a fact especially obvious when you want to use them on native types (along with a hundred other reasons). But Generics + Autoboxing can sometimes get you some of the way.

OK, so that gave me access to each combination of parameters, but surely there's a better way to do it dynamically? Well, not in Java. The only approaches I've seen in the past either use heuristics to work out which version of arithmetic to run, or else it promotes everything into a standard type (like Double). The latter has arithmetic problems, and gives an inappropriate type for the result. The former can just be complex to read, write, and verify.

The problem comes back the CPU having different instructions for the different forms of arithmetic. A compiler has no problems selecting which one to use, but that is because it has access to the entire library of instructions. Conversely, a parser is not expected to have access to all instructions, leading to the problems I'm talking about. So you either choose a subset of instructions to work with (ie. upcast everything), or else you provide all instructions in a library, and then map the parameters into the correct instruction - either with the heuristic tree or something like a hash map.

Dynamic languages have a much easier time of it. For a start, they usually have all instructions at their disposal in the interpreter. Many (though not all) of them also simplify their numeric types to only a couple of types. Whatever they use, the poor programming schmuck writing his own interpreter (that would be me) need only write x*y and let the dynamic language developer work out what he wanted. At the very least, we can emit it in a string and do an eval().

Oh well, I shouldn't complain. I have all the functions written out (via Ruby) and a hash map that lets me get what I need trivially. With the exception that there is a lot of machine generated code that looks like the same thing over and over, the whole system comes down to just a few lines of easily verifiable code - which is what I like to see. Following the code path you'll see that any kind of operation just goes through a few steps and it's done.

No comments: