Saturday, April 14, 2007

SPARQL

When I needed a break from the DL Handbook, I read the SPARQL Query Language for RDF. I've browsed this enough to have an idea of how to write SPARQL, but I'd never gone through the entire document in detail before now. It's unfortunate that I didn't have a pen with me, as I found a few small errors, mostly where the text disagreed with sample data (probably due to inconsistent revisions of the document).

I found it interesting to see elements which look a lot like TQL in there. Admittedly, I don't know which elements of TQL we borrowed from other languages (Simon would be the best oen to describe that), but I'm pretty sure that some of what we included was original. FROM NAMED, and GRAPH look like TQL FROM and IN clauses. OFFSET and LIMIT also look to have come straight out of TQL. There are many other similarities, but I know that these look a lot like RDQL as well, so they could have come from anywhere. I should ask Simon and Tom just how much influence they had.

Filters

From a performance perspective I've never been too happy about filters. Simple filtering in Mulgara is done more effectively using appropriate resolvers. However, I was never too clear on exactly how expressive filters could be. Consequently, I always wondered if we could perform filtering using Mulgara resolvers. The short answer is that we shouldn't (in my opinion). I know that others (such as Andrae) have also looked at this, and probably have a better idea than I have, but I'll put forward my ideas anyway.

Unfortunately, SPARQL defines filters as "constraints" (which they are), and a pattern to be matched as a TriplesBlock. This is at odds with Mulgara terminology, which defines both concepts as "constraints". This is because the pattern in a TriplesBlock is also a constraint on the data to be returned. I want to refer to the internal structure of Mulgara queries, so I'll say TriplesBlock when I'm talking about the SPARQL construct, Constraint when talking about the Mulgara internal construct, and Filter when talking about SPARQL filters (which are SPARQL constraints).

A filter could be passed into a Mulgara query as a constraint, which is to be resolved against the current group. This has several advantages. The first is that it can be used to great efficiency for some functions (such as when the function returns a set, which is the case for type tests like isBlank and isLiteral). They also work within the current query resolution system.

The first problem I have with this approach is that it requires a mapping of semantics from filtering the output to an inner join against some theoretic set of tuples returned by a resolver. I'm confident that this semantic works, but fiddling with semantics does seem inappropriate. All the same, it is an idea worth thinking through.

Constraints of this type would get mapped to a resolver that knows how to handle filter functions. The main problem is that resolver constraints usually get passed in using some kind of triple pattern, which tends to limit the width of the data. This might work for a simple filter:
 e.g. "?x < 5" could map to "?x mulgara:lessThan '5'^^xsd:integer"
But it starts to fail for complex expressions like:
 ?x > 5 && ?x < 10 & regex(str(?y), "foo$")

I also considered the idea of packaging the entire filter into a string that gets passed to the resolver as a literal. Filter parsing is pretty much a separate stage in SPARQL already, so I don't think it is an issue to put it off into a resolver. The main problem with this is that there is no real mechanism to pass more than a couple of bound variables into a resolver, whereas a filter could refer to an arbitrary number of variables. In theory, the resolver could instead ignore already bound variables, and instead return a set of data to be joined against with a constraint conjunction, but for some functions these sets need to infinite!

It might be possible to break down a filter expression into a constraint expression (constraint conjunctions and disjunctions), but it is still possible to construct difficult filter expressions that don't break down so easily. For instance:
 regex(?z, str(?x + ?y))
This is also ignoring external functions, which might take more than two parameters. The example of geographic distance, shown in section 11.6 demonstrates the importance of supporting such extensions.

There are other reasons to avoid using a resolver, but the restriction on bound variables is a real show stopper.

Filtered Constraints

After deciding not to use the usual constraint conjunction method for filtering, how should solutions be filtered? SPARQL defines filters to operate on a group of patterns. For groups with more than one constraint, this corresponds to a constraint conjunction in TQL. For groups of just one constraint, this is just a simple constraint. For groups containing a GroupPatternNotTriples, this becomes either constraint disjunction, or one of several other constraint operations (I'll get to these later). In effect, every constraint expression we have could find itself as a "group" which requires filtering. So the ConstraintExpr interface will need to include a filter property.

When it comes time to resolve a constraint expression with a filter attached, we can wrap the resolution in a FilteredResolution object, based on the filter expression. Of course, construction of such an object should build some sort of functor which will perform the required operation. In essence, this means that we compile the expression (an obvious thing to do), rather than interpret it on ever iteration <shudder>.

The biggest problem with filtering like this, is that there is no way to count your results without iterating over the entire set. This is highly undesirable, as it blows away many of the time/space efficiencies we have with lazy evaluation. Fortunately, the definition of Tuples.getRowCount() has always been for an upper bound on the size of the tuples. I believe this was so lazy evaluation could guess at the size of a conjunction, though I seem to recall that the concept of filtering was kept in mind for this decision as well.

No comments: