Monday, February 21, 2005

To Collect, or not to Collect?
I've decided to come down in the middle and implement the full query structure in RDF as well as provision to do everything in iTQL. That way it will be possible to pick up subqueries and other more complex structures without having to deal with them up front, but I will get the full functionality I require in 90% of cases with the complete RDF.

Now that I've determined that, I'm left in a quandary about using RDF collections. I really dislike collections. The whole rdf:_n sequence structure is unpleasant, but the rdf:first and rdf:rest cons list structures are a real pain. It's no big deal when described in RDF/XML, but in N3 the whole structure becomes painfully obvious. They make sense when it is possible to duplicate an entry in a list, but other than that, they require the complexity of a trans to find all the elements, and that should not be necessary.

As a result, I don't want to use collections when describing groups of objects. For instance, a conjunction takes several different constraint arguments. While it is possible to describe this with a collection, it is just as easy to keep re-using a predicate like hasArgument. This makes selection of all the arguments of a conjunction as simple as a constraint which says:

  [ <domain:myConjunction> <kowari:hasArg> $c ]
So what is the problem? Why don't I just use collections? Well just take a look at OWL. The range of every predicate that can refer to a group of objects is a list. Since I was talking about conjunctions a moment ago, let's look at a logically analogous predicate, owl:intersectionOf:
  <rdf:Property rdf:ID="intersectionOf">
<rdfs:label>intersectionOf</rdfs:label>
<rdfs:domain rdf:resource="#Class"/>
<rdfs:range rdf:resource="&rdf;List"/>
</rdf:Property>
Obviously I should be trying for consistency, if not standards conformance (which I'm already doing poorly at since I'm not using SWRL). So if systems like OWL use collections all the time, then shouldn't I? It would make the rules engine code more obtuse to write, and would take longer to run, but the overhead is a one-off expense, so it shouldn't be too big a deal.

In fact, I'd started using collections already. I'd been away from Kowari for too long over Christmas (and Hawai'i) and since I was back into RDF/XML I'd forgotten that I don't like collections when stored as triples. Fortunately Andrae asked me about it, which made me pause to give it some thought.

At least now I've thought about it, regardless of the direction I choose. It is almost better to choose a poor direction after carefully considering the reasons than just stumbling over the correct one.

In the end I decided to go with the non-collection approach (re-doing some of my previous work). It's cleaner, and as I've already said, when you don't need collections then I don't think they should be used, simply due to the poor use of predicates that they often entail.

Object Creation Order
I've been looking at the algorithm for object creation in the rule set. My initial instinct was to follow through the network and build things as I came to them. However, this would be fatally flawed.

The rule network contains many loops and interdependencies. If I only built objects as I came to them, then it would be impossible to set up all of the "triggers" links. Obviously I need to find all inter-linked objects, build them, and only then can I link all of the triggers together.

After considering this carefully, I started to think about the whole linking and triggering scheme. I remember that RDFS entailment rules link almost every rule to trigger almost every other rule. Since the rules will avoid execution if the constraints have nothing to add, then it almost makes sense to trigger everything all of the time. This would make writing rules much easier. For instance, I would not need to work out which OWL rules trigger which other OWL or RDFS rules.

The reason to avoid this is due to efficiency. The whole point of the network is to prevent unnecessary execution of each rule. If every rule could potentially trigger every other rule, then the only possible efficiency to be gained would be in constraint evaluation indicating that a rule was not needed. While this would work, there is some expense in doing constraint tests. So have rules trigger each other is still valuable.

On the other hand, this consideration has me thinking that it may be useful to have a rule type which triggers all other rules. There are several rules which do this, and it would be useful to describe it in on fell swoop, rather than with dozens of "triggers" statements. It will certainly make the RDF for those nodes simpler and easier to read and write.

Constraint Counting
Triggering and execution of all rules is controlled by semi-naïve evaluation on individual constraint nodes (which are effectively rule tests in the parlance of a rules engine). This means that we need fast counting of a constraint's results.

While counting is done with a significant speed multiplier (~180 times, on average) it is still done with linear complexity. DavidM was going to add subnode counts to the AVL tree to change this to O(log(n)), but he put it off since it was not going to be necessary with the new XA2 datastore.

Now that XA2 is not happening (at least, no time soon), I may need to factor in some time to do this. After all, a system can't scale if it does anything with linear complexity, no matter how fast the multiplier is (fast multipliers just hide the speed problems for small sets).

Confidence
While looking for some OWL documentation today I saw the current list of OWL Reasoners (look for "Reasoners" on the main OWL page). I have to say, it does shake my confidence sometimes. There are some great systems in there, so why should I think I can do it any better?

At the very least, I know that Kowari needs a reasoner of its own. Kowari scales better than other RDF stores, so it is important that it has a reasoner than can scale with it. Also, Kowari works with set-at-a-time operations, and I don't believe that any of the listed reasoners do this (though there are one or two that I don't really know... I should look them up). Hence, using a Prolog or DL system is not going to reason very well for Kowari at all. They would be slow for a start, but more importantly, most DL and Prolog systems use main memory, which means that they can't scale. Those which do use disk use their own data structures, meaning that data from Kowari would have to be duplicated.

So while I may not end up building the fastest engine available, I'll build the fastest engine that could be used on Kowari. Given the scalability of Kowari, I feel that this is still worthwhile.

Hey, if all else fails, at least I can stick on an OWL reasoner that is known to work! :-)

Thesis Structure
I finally sorted out a broad outline of what my thesis should look like. I still haven't done my confirmation, so I know I'm getting ahead of myself, but I'm pleased to have a plan anyway.

It should break down into the following:
  1. Introduction (obviously).
  2. Literature review: Along with the introduction, this will be built partly from my confirmation paper.
  3. RDF Indexing: This will describe the structure behind the indexes, starting with the mapping to tuples of nodes/sets and how this maps into the rectangular structures that Kowari uses. The evolution to 4-tuples along with the subsequent discoveries on the number of required indexes will be included here. This describes why we moved to 6 indexes, and how to expand the system if more than 4-tuples are needed. Also describe the implementation, with the writing advantages of using AVL trees over B-trees. As I mentioned the other day, this is all new work and I designed and analysed most of it, with some help from others at Tucana. I had some early concern over the relationship of this work to the subsequent OWL work, but since the subsequent sections rely heavily of the effectiveness of the indexing, it is worth going into.
  4. Transitivity: Describes the algorithms and structures used for the trans keyword. This appears to mirror the DISK_TC algorithm of Hirvisalo, Nuutila, and Soisalon-Soininen. The interesting thing here is that the choice of the index structure along with the on-disk tuples implementation (thanks Andrae!) leads naturally to efficient implementations like this.
  5. Set-at-a-time Rules: This will describe the set-at-a-time rules engine I am building. It borrows some ideas from Rete, along with several other features. Again, the choice of the indexes leads to efficient constraint resolution, allowing set operations and semi-naïve evaluation to really pay off here. It also appears that this operation may be analogous to the set-at-a-time XSB engine. A proof of equivalence would lend legitimacy to the technique, as well as prove that it is as efficient as a Prolog engine. Again, it demonstrates how the original index selection has led us naturally into an efficient implementation.
  6. OWL, and DL to iTQL Mapping: A formal description of how Description Logic maps into iTQL. When taken along with the OWL to DL mappings, this gives legitimacy to the operations which are done to implement OWL DL.
  7. OWL Full in iTQL: Introduce the extra features of iTQL which let it go beyond description logic, such as full cardinality. Also introduce the OWL Resolver, which does the OWL description work in a transparent way.
  8. OWL Implementation: Present the full OWL rule set as RDF.
  9. Conclusions.


The nice thing about this is that I should be able to get a few papers out of it. Nothing too fancy, but having read a lot of papers recently the ideas would seem good enough to be published. I'm inclined to create separate papers for items 3, 4, and 5, while items 6, 7 and 8 would go together as an OWL/DL/iTQL paper.

While I'm here, I should note that I want to carefully compare our algorithm to that of DISK_TC, to see if there is anything that I've missed that would be useful. However, I believe that the basic structure is the same, meaning that anything required modification would just be a set of tweaks.

Similarly, I need to build a proof that the XSB engine is in fact equivalent to my rules engine design. Again, this proof may help me spot any shortcomings in the design.

So it's all coming together. The following is not all in order, but now I just need to:
  1. Write the code.
  2. Write my confirmation.
  3. Write the papers.
  4. Do the proofs.
  5. Write the thesis.
Easy, huh? :-)

No comments: