Sunday, April 15, 2007

Distributed Resolver

TQL has always allowed multiple models to be mentioned in a query, either in an expression in the from clause, or on individual constraints, via an in clause. This lets you perform selections on expressions describing unions and intersections of models, with arbitrary complexity. However, all the models described must be on the same server.

After 2 years of procrastinating, I finally implemented a resolver for distributing queries. This means that models on other servers (on the same machine, or across the network) can also be accessed. It's not an optimal approach, but it works quite well.

I actually did something similar 2 and a half years ago, but this was for TKS and not Kowari. This means it could never be released as open source code, and I lost access to it once Tucana was sold. It was not particularly difficult to implement in the first place, so I knew I could do it again, but I really didn't want to. I kept putting it off, as it was demoralizing to have to do something from scratch that I'd already done before, even if I did forget the specifics. At least this time I think I did a better job (2.5 years of extra experience will do that for you). I also have the feeling that I may have implemented more Resolver methods than I needed to back then.

One thing that is missing is blank node resolution. At the moment, blank nodes from 2 separate servers may give the same temporary identifiers. This is a bug. The solution I want is to have blank nodes from another server to use extended string representations which include the server IP address, rather than the simple _:## format currently in use. This will require a blank node factory that returns the required type of blank node as needed. The factory will need to be given to the Answer as it is created, so that it can return the correct type of blank nodes as they arrive. I'll have to check more carefully, but I think this can happen in AnswerWrapperRemoteAnswer.

For the moment, the approach is naïve, and does not take into account how much data may try to move across the network. A more scalable approach will get done eventually, but this may be part of a commercial offering from my employer Fourth Codex. I'm expecting to have to add some of the required infrastructure to Mulgara, but keep the "secret sauce" in an optional external module. Of course, this will depend on my time and resources at work.

Optional

One SPARQL feature that we need to implement is the OPTIONAL keyword. It seemed relatively easy to implement, and would work similarly to existing constraint operations.

While a constraint disjunction expands both sides out to contain the union of variables, it does nothing to join the results on both sides. This can leave any created variables unbound. A constraint conjunction does this join, but limits the results to those which have bound values. The optional operation needs to do both. In a sense, it reminds me of when I had to introduce the minus operation to provide a new kind of join operation that was previously impossible.

I'll probably try to introduce this operation shortly, but it won't be of much use until we can use it in a query language. Since SPARQL is still some way off (Inderbir is looking at the grammar parser for me right now) then I should probably introduce it as a language feature in TQL. I guess it would appear like an AND operator, right?

Saturday, April 14, 2007

SPARQL

When I needed a break from the DL Handbook, I read the SPARQL Query Language for RDF. I've browsed this enough to have an idea of how to write SPARQL, but I'd never gone through the entire document in detail before now. It's unfortunate that I didn't have a pen with me, as I found a few small errors, mostly where the text disagreed with sample data (probably due to inconsistent revisions of the document).

I found it interesting to see elements which look a lot like TQL in there. Admittedly, I don't know which elements of TQL we borrowed from other languages (Simon would be the best oen to describe that), but I'm pretty sure that some of what we included was original. FROM NAMED, and GRAPH look like TQL FROM and IN clauses. OFFSET and LIMIT also look to have come straight out of TQL. There are many other similarities, but I know that these look a lot like RDQL as well, so they could have come from anywhere. I should ask Simon and Tom just how much influence they had.

Filters

From a performance perspective I've never been too happy about filters. Simple filtering in Mulgara is done more effectively using appropriate resolvers. However, I was never too clear on exactly how expressive filters could be. Consequently, I always wondered if we could perform filtering using Mulgara resolvers. The short answer is that we shouldn't (in my opinion). I know that others (such as Andrae) have also looked at this, and probably have a better idea than I have, but I'll put forward my ideas anyway.

Unfortunately, SPARQL defines filters as "constraints" (which they are), and a pattern to be matched as a TriplesBlock. This is at odds with Mulgara terminology, which defines both concepts as "constraints". This is because the pattern in a TriplesBlock is also a constraint on the data to be returned. I want to refer to the internal structure of Mulgara queries, so I'll say TriplesBlock when I'm talking about the SPARQL construct, Constraint when talking about the Mulgara internal construct, and Filter when talking about SPARQL filters (which are SPARQL constraints).

A filter could be passed into a Mulgara query as a constraint, which is to be resolved against the current group. This has several advantages. The first is that it can be used to great efficiency for some functions (such as when the function returns a set, which is the case for type tests like isBlank and isLiteral). They also work within the current query resolution system.

The first problem I have with this approach is that it requires a mapping of semantics from filtering the output to an inner join against some theoretic set of tuples returned by a resolver. I'm confident that this semantic works, but fiddling with semantics does seem inappropriate. All the same, it is an idea worth thinking through.

Constraints of this type would get mapped to a resolver that knows how to handle filter functions. The main problem is that resolver constraints usually get passed in using some kind of triple pattern, which tends to limit the width of the data. This might work for a simple filter:
 e.g. "?x < 5" could map to "?x mulgara:lessThan '5'^^xsd:integer"
But it starts to fail for complex expressions like:
 ?x > 5 && ?x < 10 & regex(str(?y), "foo$")

I also considered the idea of packaging the entire filter into a string that gets passed to the resolver as a literal. Filter parsing is pretty much a separate stage in SPARQL already, so I don't think it is an issue to put it off into a resolver. The main problem with this is that there is no real mechanism to pass more than a couple of bound variables into a resolver, whereas a filter could refer to an arbitrary number of variables. In theory, the resolver could instead ignore already bound variables, and instead return a set of data to be joined against with a constraint conjunction, but for some functions these sets need to infinite!

It might be possible to break down a filter expression into a constraint expression (constraint conjunctions and disjunctions), but it is still possible to construct difficult filter expressions that don't break down so easily. For instance:
 regex(?z, str(?x + ?y))
This is also ignoring external functions, which might take more than two parameters. The example of geographic distance, shown in section 11.6 demonstrates the importance of supporting such extensions.

There are other reasons to avoid using a resolver, but the restriction on bound variables is a real show stopper.

Filtered Constraints

After deciding not to use the usual constraint conjunction method for filtering, how should solutions be filtered? SPARQL defines filters to operate on a group of patterns. For groups with more than one constraint, this corresponds to a constraint conjunction in TQL. For groups of just one constraint, this is just a simple constraint. For groups containing a GroupPatternNotTriples, this becomes either constraint disjunction, or one of several other constraint operations (I'll get to these later). In effect, every constraint expression we have could find itself as a "group" which requires filtering. So the ConstraintExpr interface will need to include a filter property.

When it comes time to resolve a constraint expression with a filter attached, we can wrap the resolution in a FilteredResolution object, based on the filter expression. Of course, construction of such an object should build some sort of functor which will perform the required operation. In essence, this means that we compile the expression (an obvious thing to do), rather than interpret it on ever iteration <shudder>.

The biggest problem with filtering like this, is that there is no way to count your results without iterating over the entire set. This is highly undesirable, as it blows away many of the time/space efficiencies we have with lazy evaluation. Fortunately, the definition of Tuples.getRowCount() has always been for an upper bound on the size of the tuples. I believe this was so lazy evaluation could guess at the size of a conjunction, though I seem to recall that the concept of filtering was kept in mind for this decision as well.

Sleep

All Thursday night and Friday I was doing a sleep study at Northwestern Memorial Hospital. If you are ever given the choice to do one of these, then try to avoid it. It's horribly boring, and you won't get as much sleep as you think you will. (Though you probably wouldn't be doing one if you were sleeping properly anyway).

Description Logic

Knowing that I'd have a lot of time on my hands, I brought some reading material that I've been procrastinating about for some time: The Description Logic Handbook and a printout of The SPARQL Query Language for RDF.

Some of my thoughts on SPARQL bear going into in detail, so I'll write about that in a separate post.

I have the Description Logic (DL) Handbook in electronic form, and I've only really browsed it in the past. However, I figured that I should really get into it, and have now discovered how vital it is to a proper understanding of description logics - including RDF-OWL. I already knew a lot about this area from all the papers I've read in the last couple of years, but this book is filling in a lot of the gaps, including those I never knew I had. You didn't think that those logic equations in previous posts were things I made up on my own, did you? They were from the start of Chapter 3, which is the chapter I was reading while in the hospital.

The book is useful in a few different ways. Other than providing a basic understanding of a lot of principles, it reads very much like a literature review on various topics in knowledge management. Fortunately, this means that it also cites references (since I need to cite original sources when I finally get my thesis written, and I could hardly cite a textbook).

Each chapter is written by someone who is an expert in that particular area, which is both good and bad. It's good, in that the complete story is told, but it's bad in terms of consistency of the chapters. For instance, I found chapter 2 to be highly accessible, while chapter 3 was a real slog. Part of my difficulty with chapter 3 was that the topic is about complexity of reasoning, whereas I only possess a general understanding of complexity.

As an engineering undergraduate I learnt how to determine complexity for constant, linear, logarithmic and exponential algorithms, but that was about it. From cryptography (and it's relevance to Shor's and Grover's algorithms in quantum computing) I learnt about NP-complete problems (and the question of P=NP: the solution of which would make a good techno-thriller in my opinion). In just the last couple of years I've learnt about more of the various complexity classes, especially in relation to logic reasoning. However, I've avoided getting into a deep understanding of these, in lieu of more pressing reading. I'm starting to wonder if I can really afford to take that approach. At the very least, I should probably read the various Wikipedia articles on each class, though I may need to devote the time to read Complexity Theory: A Modern Approach.

Even considering my shortcomings in complexity, I still found chapter 3 of the DL Handbook to be difficult. The proofs and lemmas often refer to back to propositions made several pages before, after the context has been lost. This makes it look like variables and concepts have been introduced at random. For instance: Therefore G={....}, which is equivalent to CT. Huh? What's CT? Oh, here it is... 3 pages ago.

A few times I had to re-write the proof, so that the various substitutions could all be seen in the context of one another.

Many of the so-called proofs were also supposed to demonstrate a particular complexity, but what they really did show a length of a chain in a structure. It was left up to the reader's own understanding of complexity to understand that this would therefore lead to a particular complexity class. I understood this, but was unable to see the relationship myself, due to my lack of background in complexity. Considering that this is a book on logic, and when compared to the other chapters, I found this one chapter to be quite obtuse. Hopefully I won't encounter many more like this one.

Thursday, April 12, 2007

Logic and Sleep

After revealing my confusion last night, I went to bed thinking I'd just exposed myself as an idiot. Nothing like that to motivate you to solve a problem for yourself! So after sleeping on it, I finally worked it out for myself.

I sort of understood role restrictions. While it's true that (∀(R|C).A) returns the same set as (∀(R).(AC)), the semantics of value and role restrictions are different. Value restriction means that every usage of a role must refer to a category, while role restriction is a selection of those usages which refer to that category.

However, I was wrong that it was the inability to see this difference that led me to false conclusions about those properties. My main problem was failing to take proper account of the universal qualifier used in the value restriction (and forgetting that this does NOT imply existence).

I finally "got it" when I thought about some example roles. Consider a town where there are a number of people named Smith and Jones (among other names). I could ask the question, "Who in town knows only the men in the Smith and Jones families?" Note that these people may also know males and females in other families, as this is not a restriction imposed by the question.

I could encode the question according to the original proposition with:
Rknows
CSmith
DJones
AMale

So then when I say (∀(R|C).A) this is now equivalent to (∀(knows|Smith).Male). Once I got that, I realized it was easier to think of a restricted role as a new role altogether (a sub-Property in RDFS parlance). So the new role might be called knows-a-Smith, and the expression (∀knows-a-Smith.Male) means the selection of people who only know Smiths that are male. Since I'm describing this in natural language, I should point out that the expression does not say that these people actually know anybody named Smith, only that if they do, then that Smith must be a male.

So my question of equivalence goes from:

(∀(R|C).A) ⊓ (∀(R|D).A)     ≡    ∀(R|(CD)).A

To the expression:

(∀(knows|Smith).Male) ⊓ (∀(knows|Jones).Male)     ≡    ∀(knows|(SmithJones)).Male

Or even simpler:

(∀(knows-a-Smith).Male) ⊓ (∀(knows-a-Jones).Male)     ≡    ∀(knows-a-Smith-or-Jones).Male

This means that the set of people who know only male Smiths, and also know only male Joneses is the same as the set of people who know "Smiths or Joneses" who are only male. (These things can get hard to parse in English, so it's a good thing we have a syntax that is more exact - even if it is hard to read).

It's easy to see that the intersection of Smith-knowing-people and Jones-knowing-people fall into the union expression of people-who-know-Smiths-or-Joneses. But my problem was that union includes people who only know Smiths (and not Joneses), and people who only know Joneses (and not Smiths), and I couldn't see how these people could be in the intersection expression.

An easy way to find a flaw in the proposition was to think of a counter example. ie. Who could satisfy the union, but not the intersection? There's symmetry when considering Smith or Jones, so I only have to look at one of the names. The relevant interpretations include:
  1. People who know male Smiths and also know Male Joneses
  2. People who only know male Smiths
  3. People who only know make Joneses
  4. People who don't know anyone named Jones or Smith
People who know female Smiths or Joneses are not in the intersection of (∀(knows-a-Smith).Male) ⊓ (∀(knows-a-Jones).Male). They are also not in the value union of (∀(knows-a-Smith-or-Jones).Male), since the role restriction (ie. referring to a male) is not met if they know a female with those names.

When thinking of the "union" it is easy to see that all three of the listed groups here are relevant. But for the intersection it wasn't instantly apparent for me that everyone was included:
(∀(knows|Smith).Male) ⊓ (∀(knows|Jones).Male)

For this to not be in conflict, we need all the relevant sets present on both sides of this intersection operator.

This is where I was getting caught up with the universal qualifier. Anyone in group 1 (knows both Smiths and Joneses) is on both sides of this intersection. Anyone in group 2 (knows Smiths, but not Jones) is on the left hand side of this intersection... but how are they on the right? The point here is that anyone without use of the role (knows|Smith) satisfies the condition of (∀(knows|Smith).Male), since there is a universal qualifier here, and NOT an existential qualifier. The (knows|Smith) is an empty set for these people (which threw me off), but an empty set doesn't violate the condition because of the use of the universal qualifier.

Doh!

This same reasoning applies for groups 3 and 4, so I find there are no conflicts.

Wednesday, April 11, 2007

Description Logic

I've been having some issues with description logic recently. I can't tell the difference between value restriction (∀R.C) and role restriction (R|C). These have the semantics of:

 ∀R.C      {x ∈ ∆I | ∀y.(x,y) ∈ RIyCI}
R|C      {(x,y) ∈ ∆I × ∆I | (x,y) ∈ RIyCI}

Value restriction (∀R.C) uses an "implies" operator (→) to say that any use of the role will refer to an instance of the class C. Implication means that there can be values for y which are not part of the role description, and are still instances of C. However, since we also have a universal qualifier for y when used in the role, then we have no need to consider elements of C which are not used in the role.

Role restriction says that for any use of the role, then the value in that role (y here) will also be an element of C.

Given that value restriction has the universal quantifier on y, I don't see the difference between the implication here, and the conjunction used in role restriction.

So to me, value restriction defines a class where all uses of a role refer to elements of a specified class (C), and role restriction refers to a role that can only refer to elements of a class. They means slightly different things, but then I don't understand how (∀(R|C).A) is any different to (∀(R).(AC))

Maybe I'm reading it wrong.

Anyway, I think it's because I don't understanding this difference that I can't get the following property:

(∀(R|C).A) ⊓ (∀(R|D).A)     ≡    ∀(R|(CD)).A

My reading of it says that the right hand side should include an intersection of (CD) and not the union (CD).

I know I'm wrong here, but I just don't see it. Unfortunately, I don't know where to ask either. Can anyone help?

Distributed Queries

A few weeks ago I was having a chat with the people from Fedora about the things they'd like/need in Mulgara. One of the features that came up was distributed queries. I recall that it was pretty simple when I did this (the naïve way) a few years ago. So I thought I'd have another go at it.

I've had a lot of things to get done lately, but as of this afternoon I have it running again (yay). Unfortunately I haven't had time to set up tests which start 2 servers and query them together, but my manual tests are all working fine. I'm able to describe models from multiple servers in the FROM clause of my queries, and also in the IN clauses.

My first priority is to get the tests written. Then in the longer term I need to get an intelligent algorithm going to properly distribute the query around the network. I know what needs doing, but it requires some help in the query optimizer, and can't be done entirely in the resolver. So I'll be waiting until I'm given some significant time for this feature before I get stuck into it.

In the meantime, I don't actually have any use cases which need optimized network activity for distributed query resolution, so I'm pretty happy that I have the feature going.