Thursday, May 06, 2004

Transitivity
RDFS inference rules 5a and 8 are both transitive. Transitivity isn't defined in RDFS (unlike OWL), but at least this aspect of these predicates is defined in these 2 rules. Unfortunately, these rules, as they stand, will only infer transitivity for the first link beyond the one defined. In order to infer all possible statements it will be necessary to iterate (or follow recursively) over the links. This has implications for guaranteeing termination, but more importantly for my purposes, there is no facility for iteration or recursion in iTQL.

We've been discussing how to implement statements based on transitivity. We ended up with 2 ideas. As usual, the one from SR was mathematically correct, flexible enough to cope with any conceivable situation, expressed with verbose and difficult to learn syntax, and quite probably slow. He was able to argue for significant efficiencies, but it is based on binomial expansions of the generated data, and is likely to suffer accordingly. AN and I prefer to iteratively traverse the graph finding links as we go, and creating the inferred statements as we get there. Even SR agreed that this would be faster. The syntax would be much simpler, though we have yet to work out how we want to do it.

The iterative approach would have a syntax which basically has a where clause which refers to a triple with the predicate of interest defined, and to then announce that this predicate is transitive. This is unlikely to look like a normal constraint, and as a result it has a few here feeling uncomfortable about it. Maybe so, but at least it will be easy to write and read. Perhaps we can do it with a "special" constraint where we provide a triple stating that the triple has a Tucana property of "transitivity" (like the OWL property). This is certainly a hack, as it is really providing a definition in the where clause, and not a constraint like everything else is. However, it doesn't belong in the select clause, and putting it in the from clause would end up looking like the mathematical approach that SR was advocating. (He wanted to "define" a variable as a select statement which has a from clause which recursively refers to the variable being defined. This variable is then ORed with the model in the from clause of the statement which generates the infered statements).

Whatever syntax we use will be non-standard in regard to current iTQL. However, blindly trying to stick with standard syntax and structure can lead to extremely verbose and complex constructs, while a well defined exception to the rule can lead to something simple, concise and effective. Hence, we're definately going with the iterative approach. One thing we'll miss from SR's approach was a cute trick where all properties which were marked as OWL transitive, could have been automatically inferred. Oh well, we'll just have to do them manually, just like the current rules do.

I was concerned that the iterative approach would not be flexible enough for more complex inferencing rules, but a discussion with AN has assuaged my fears. An example of a more complex rule might be where 2 non-transitive properties become transitive when found together. This can be collapsed back into the original problem simply by defining a new transitive property which is a collection of the non-transitive properties, and applies to their subjects and predicates.

Another example is for a collection of objects which are related in pairs (or more) by a particular predicate. Say we refer to the first object as x then we'll call its paired mate x'. We can envisage a rule with the following:
if
   object is related to subject with a non-transitive predicate
and
   object' is related to subject' with the same non-transitive predicate
then
   the pair of predicates when taken together is transitive.
So the triples:
  a predicate b
  b predicate c
  a' predicate b'
  b' predicate c'
results in:
  a predicate c
  a' predicate c'

AN's solution to this is the same as for the other case. Create nodes which represent the pairs (so for instance, a new node gets made for the collection of a and a'). Then we have the above predicate operating twice on these nodes to connect them (the a -a' node is connected to the b-b' node). These predicates can be grouped together like in the previous example to form a single predicate between the nodes. Now we just have a chain of nodes which can be iteratively have transitivity rules applied to them. This all creates more nodes, but it's simple and covers the general case.

RDFS Rules
Some of the rule organisation in Sesame seem a little inconsistent. 5a refers to transitive properties, and 5b defines what a property is. So these are both grouped into "5". Conversely, rule 7a defines a class, and 7b defines a subclass... but then the transitivity of subclasses is put off into rule 8. I'd have thought it belonged closer to 7b, like 5b was kept close to 5a.

I'd also love to know why the last rule is named with roman numerals! It's the only one which defines a "pattern" for its variable, but I hardly think that's adequate reason.

Patterns in RDFS Rules
This final rule specifies the following premise:
<premise>
   <subject var="xxx"/>
   <predicate var="id" pattern="&rdf;_*" escape="\"/>
   <object var="yyy"/>
</premise>

This is fine if you want to slect everything in a database and filter the results, but it's going to blow scalability out of the water. If you have 20 million triples and only 10 triples in the rdf namespace, then filtering is no longer an option. You need to join this against a constraint that restricts the id to the pattern instead. That way you would only see those 10 triples you were interested in.

For our purposes I'll be changing this rule to have the following premises:
<premise>
   <subject var="xxx"/>
   <predicate var="id"/>
   <object var="yyy"/>
</premise>
<premise>
   <subject var="xxx"/>
   <predicate uri="tucana:startsWith"/>
   <object literal="&rdf;_"/>
</premise>

I should point out at this point that I haven't yet worked out what the escape="\" attribute is all about. :-)

Sure, the new set of premises could have been built internally from the original single premise, but that would not have been very clean, nor would it explain very well what was going on.

Lawrence Lessig
Lawrence Lessig was on TripleJ just now, so we all stopped to listen. He was discussing oft-mentioned piracy that established the music and movie industries in the first place. Heard it all before, but it's nice to hear it in a public arena like that. Maybe people will start to pay attention.

No comments: