Tuesday, September 21, 2004

Recovering
I've been away from blogging for a few days due to illness. I was (rather stupidly) trying to code with a fever of 38.4C at 9pm last Thursday night (Google tells me that's 101.12 in Fahrenheit). Anne accused me of being a geek as I was taking great delight in watching the fever go up and down every 10 minutes with our new digital thermometer.

I finally realised that I couldn't work any more, and went to bed without blogging. I spent the next 4 days trying to kick the thing, and today is my first day back in the salt mines. Of course, everything I'm supposed to be doing has changed again. Perhaps I should have taken another day of rest (I could have used it).

TQL
I had planned on getting each part of Volz's paper codified into TQL rules by Thursday night, and to start working on mapping triggers between each of the rules. I did not quite make it, but I came reasonably close. The only things missing are a couple of OWL Full rules from the end of the document.

Converting the DL rules was not always as straight forward as I'd hoped. This was partly due to some of the DL rules only expressing part of the problem, and partly because some of the rules were expressed in predicate logic instead. Fortunately, iTQL appears to be good enough to fulfill many of these requirements, but I still needed to put in a bit of thought to get them right.

For instance, the owl:someValuesFrom rule was represented as:

 ∀X∃Y : C(Y) ← P(X,Y) ∧ D(X)
This translated to an inconsistency test of:
select $x $p $c count(

select $y2
from @@SOURCE@@
where $x $p $y2 and $y2 <rdf:type> $c)
from @@SOURCE@@
where $r <rdf:type> <owl:Restriction>
and $r <owl:onProperty> $p
and $r <owl:someValuesFrom> $c
and $x $p $y1
having $k0 <http://tucana.org/tucana#occursLessThan>
'1.0'^^<http://www.w3.org/2001/XMLSchema#double>;
By "inconsistency test", I mean that any returned rows represent inconsistent data.

Some items in the above need explaining...

The "@@SOURCE@@" string gets replaced as required by a model expression, typically indicating the model for the ontology as well as the base data, and often including the inferred data. Also, Kowari only supports float for the moment, so we can't yet use a numeric type of non-negative integer.

The syntax for count() is still rather obtuse, but in this case it provides a bonus. Since the variables in use within the sub-query (inside the count function) are projected in from the set used in the outer "select" clause, we get an effective "group by" for free. This is essential, as this query is supposed to only return results where a given subject/predicate pair do not have any objects which meet the owl:someValuesFrom restriction. In other words, I'm looking for a count of 0.

The above code and predicate logic demonstrate a few of the difficulties in translation. First, the predicate logic makes no statement that the owl:someValuesFrom restriction applied to the predicate. This also happened with several of Volz's DL statements as well. Instead, this rule is supposed to be applied on all predicates where such a restriction is known to hold. Similarly, the class D is understood to be the target class that the restriction refers to.

I expect that the external requirements on the predicate and class D would normally be implemented in some wrapping code, but this is not appropriate for a purely TQL solution. Fortunately the above TQL shows that it is possible to express these conditions easily, even if they are harder (or impossible) in DL or predicate logic.

The second issue is that rules like the above declare that X and Y are instances of classes D and C. However, the class C has no significance to owl:someValuesFrom, and can be dropped when generating the TQL. This happens a few times, where a instance declaration (such as C(Y)) is either superfluous or marked explicitly as "optional" in the RDF OWL mapping.

The hardest part in translation here is that some operations, such as the existential quantifier, require a count operation, and the obscure syntax makes this very difficult to have a purely mechanical translation.

Full Cardinality
In order to achieve full cardinality I need to include a variable in the having clause. For instance, owl:maximumCardinality can be done with:
select $s $p $x count(

select $o
from @@SOURCE@@
where $s $p $o )
from @@SOURCE@@
where $r <rdf:type> <owl:Restriction>
and $r <owl:onProperty> $p
and $r <owl:maximumCardinality> $x
and $s $p $obj
having $k0 <http://tucana.org/tucana#occursMoreThan> $x;
This didn't exist in TQL, but AN was able to put it in yesterday. This is the kind of upgrade cycle that inferencing is supposed to provide for TQL, so I was pleased to see it happening.

What to Inference
A discussion with AN on Thursday showed that he is hoping to infer some OWL rules from data that is provided. One example is that he was hoping to be able to read the number of times a predicate is used on a subject, and infer cardinality. I really can't see how that can be done, as OWL is supposed to dictate the structure of data, rather than the other way around. More importantly, extra data could invalidate some inferences taken from the original data.

In the end AN has agreed that we can't really infer OWL from data, but he did have the suggestion of storing some data (like a cardinality of a predicate on each class instance) for optimisation of future consistency checks. That seems fine, but I'm not sure how we'd hold that information. Perhaps in a JRDFmem graph.

Inconsistency
Many of the DL rules from Volz describe the construction of an inconsistent statement. One example is:
  inconsistent(X,Y) :- sameIndividualAs(X,Y), differentIndividualFrom(X,Y).
Now these won't actually get generated as statements describing inconsistency, but any returned data will indicate an inconsistency with the data store. As a result, many of the "inferencing" queries are actually consistency-checking queries, where any returned data indicates a problem with that data. This seemed consistent with the way the DL is structured.

Discussing consistency with AN has shown that there are many more rules we can make that Volz did not include in the paper we are using. For instance, it is possible to count predicate/object pairs on an IFP predicate, and confirm that each pair only occurs once. We will probably develop a lot of these as we go.

Because I am testing for inconsistency many of the required operators are inverted. As a result, I realised today that these checks are all to be based on the inverse of the required operator. For example, minimum cardinality describes a number as greater than or equal to a given number. Checking for inconsistency therefore only requires less than. Similarly, there is only a need for greater than when considering maximum cardinality.

These restrictions are important, as the predicates which test like this are only valid for the having clause, and this clause does not allow for predicate expressions. This means that >= cannot be represented as a combination of > and =.

New Requirements
In order to make sure everything can be done right for the coming release, we are not going to include the rules engine for executing the TQL for each rule. Instead, we are going to make sure that the TQL used for OWL inferencing and consistency checking is complete, and properly documented.

This has a few advantages. First, it makes sure that all the relevant TQL has had adequate time for testing and documentation. Second, it provides the basis for anyone to do their own inferencing. Third, we don't know exactly what use cases there are, and this lets us test what customers actually want, as opposed to what we think they want. A final reason appeals to me personally, and that is that we won't need to include Drools with the project, especially as I hope to do our own rules engine instead.

With the completeness and stability of TQL in mind, I'm currently creating a virtual "model" which allows a query to restrict a variable to non-blank nodes, literals, or URIReferences. This was needed for full RDFS compliance, and before now I have not been able to restrict data like this. I'll be continuing work on this in the morning.

rdfs:Class vs. owl:Class
I've been meaning to ask AN about these two class types. OWL describes RDFS Classes as optional, but OWL classes as mandatory. This leads me to question what happens with inconsistencies. If an owl:Class does not exist, but an rdfs:Class does, then should I create the OWL version, or should I report an inconsistency? Do I repeat many of the owl:Class operations for the rdfs:Class as well?

These questions and others are probably answered somewhere, but I haven't found them yet. I'll ask AN in the morning.

TQL vs. iTQL
In the past I've often referred to iTQL when I should really have been saying TQL. The difference is that TQL (the TKS Query Language) is the language, while iTQL is the interactive program that lets you type TQL into it. It's a mistake many of us commonly make, and I'm probably being overly pedantic worrying about it. :-)

No comments: