Tuesday, March 18, 2008


Whew! I've finally finished filter functions.

I was just about done when I had two issues show up for me. First off, I realized that each parameter of regex takes an expression that resolves to a simple literal. In other words, it is possible to calculate a different pattern and/or flag for every line <shudder/>. OK, so I wouldn't do it, but the spec says it, so I did it. Not that it was hard. It just seems obtuse.

While I'm on it, the flags for regex don't quite match the flags in Java. Granted, they're ALMOST the same, but if I want to be a stickler about this things, then it's not quite there. The most apparent difference is that the "x" character is not the same as enabling the COMMENTS flag in Java - though it's similar. In fact, in Java 5, the COMMENTS flag does not even appear as an option in the Javadoc, though a quick scan of the library source shows that it is.

Once I found small differences (which frankly I expected to find) I decided not to look for any more. The point is that I am not going to implement my own regex engine. Sure, it would be a great learning experience (I know that suffix trees get me part of the way - but I'd have to learn some more to get all of it), but it would take me months, and for no useful purpose. I'm surprised they didn't just choose a standard engine and say "use a standards-compliant regex engine, like XXX". As it is, it looks like everyone will be nearly there, but never quite make it.

The next problem was that I hadn't looked carefully enough at the definition of equal. I was mostly right, but it turns out that if you compare two literals that are different, then you don't return false: you throw a type exception. That just feels broken. Yes, I understand the semantics, but it's a perfectly common thing to do to check that two literals are the same. Having unexpected data throw an exception from a perfectly formed query might make the type theoreticians happy, but from the perspective of a software developer it looks like bad judgement.

Ironically, you CAN choose to return true for two different literals if you have a specific extension that handles direct comparisons between their types. For instance, you can check if "5"^^xsd:integer is equal to "5"^^xsd:long. Or perhaps you want to compare "5"^^temp:celsius and "41"^^temp:fahrenheit. If you want to get the same lexical form, then you use the sameterm() function, so that case is covered. But what if you want to compare two literals to have the same semantic value, and simply return false if they don't? Maybe I need to re-read this spec, because it doesn't work for me. Still, I've implemented it as asked, if it was more annoying to do so.

So now I have a lot of unit tests to write. Yes, I know the TDD purists will be out to get me, but the exact implementation and interfaces were still floating a little when I started, and besides, it is faster to write code with the tests written after. This is mostly because you don't have to change the tests if you realize you need to change the interfaces. And time is something I'm working hard against at the moment.


Andrae had a go at me for looking to make filters annotations on the constraints in the AST for the query. I didn't see a problem with this (and there is no operational difference) until Andrae pointed out that it would have a big impact on the optimizer and query re-writer, since each node can have more that one type: a filtered version and an unfiltered version.

He was suggesting that I use the conjunction code to apply filters (and the concrete syntax of SPARQL almost seems to imply that FILTER is added in as a conjunction - though this might just be to allow alternative syntaxes) but I pointed out that this will get awkward as the BOUND() function requires that variables not be guaranteed to be pre-bound. This led to a discussion of the use of BOUND(), and I was able to show that it is often used in conjunction with NOT and OPTIONAL to emulate subtraction functionality. When he saw what I meant, he was quite congratulatory of SPARQL for taking a log(n) operation and making it linear in n.
(For any non-Australians reading this.... yes, that was sarcasm)

At least this conversation made me realize that filtering the output of each Tuple would be a mistake (good thing I haven't written this yet). Instead I'll be implementing FILTER in the AST as a new constraint element that wraps another constraint (this makes it easy for the optimizer and transformer to ignore) and to create a new operation akin to MINUS that will do the work. Currently MINUS removes elements on the left that match (via variable bindings) elements on the right. The new code will remove them based on failing the FILTER test. Simple. :-)

No comments: