Thursday, July 22, 2004

Duplicate Variables
AN had a look at the problem of repeating a variable name in a select clause, and spotted a simple way around it. I know it isn't ideal from the perspective of AM, but personally I prefer it as it makes sense to be able to make a query like this. Consequently I was able to change rules 5b and 7b to match the other rules.

In a similar vein, I discovered that count statements also have a problem in select clauses. Count statements are used for "grouping" on variables (as they do in SQL). However, if a count is used without any accompanying variables then a RuntimeException is thrown from the server. Of course, this is appropriately caught and dealt with, but the fact that an exception like this can be thrown is a bug. The only exceptions to be thrown during a query should be thrown from the iTQL interpreter, and no lower. The interpreter should not pass down any kind of construct that is illegal, and if it isn't illegal then the code that executes the query should not throw an exception (except under exceptional circumstances, like a server being unreachable). If anything lower down the stack than the iTQL interpreter throws an exception due to a grammatical construct then it indicates a major problem.

In the meantime I've logged the count problem on Sourceforge.

Tweaking
I tweaked the rules a bit today in order to make them faster. There was a little re-coding, and I re-ordered the tests in the .drl file. It seemed to help, but I'm still not completely happy with it.

Given the way that the system is currently built, the biggest problem is due to all of the rules objects having the same type. Drools needs to test everything in the working memory by label in order to make sure that it has the objects required for a rule. This is really because Drools was designed to work on data which is in its working memory, rather than rules objects which in turn work on the data. It's made me think that it might be a little to heavyweight a framework to provide what we need, but I'm sticking with it for now. After all, it is working.

As I mentioned yesterday, it may be possible to make Drools operate a little more efficiently if each rule is given its own type. However, I don't really like that idea, as it doesn't scale to new rules at all. I'm also thinking that Drools might simply find its variables via instanceof statements, meaning that a label comparison is almost as efficient anyway.

The other change I could make would be to have the rules objects operate on Kowari/TKS at a lower level than iTQL. However, this means that each rule would need to be coded in Java, and not really configurable. So long as efficient iTQL commands are available then I don't see that much would be gained by this approach. For the moment, not all of the rules have efficient iTQL at their disposal, so it's a tempting avenue to take, but I think it should be avoided as long as possible.

Shortcomings
Rule XI is desperately in need of prefix matching to be made available from the Kowari/TKS string pool. This would be relatively easy, given that strings are stored in lexical order, but no one has the time to implement it.

While on the string pool, another problem is a lack of types. This has a major implication for finding anonymous nodes. Since statements are being inserted into the inferrence graph in bulk, then the only way to remove anonymous nodes is to go through after the insert and remove them one at a time. This is the second worst possible solution (the worst would be to filter the inferred statements before inserting them one at a time). Types would let us select all statements containing a "resource" in the appropriate position, skipping the anonymous nodes.

The other immediate problem is tautalogies. That is, inferred statements which exist in the base data. It's a shame that inferred data has to go into a separate model, as redundant insertions into the base model would get silently and efficiently dropped.

Testing and Integration
With RDFS going I'll need to check the tests on the RDF site. Unfortunately I'll need to build a translation layer, as Kowari doesn't yet allow for N3 output (which these tests rely on).

I talked with AN about integrating the inferencing rules with the rest of the system, but he hasn't yet worked out where he'd like them inserted. I know that he doesn't want to make a decision that makes it hard to use the rules for backward chaining, but I think that the choice of Drools is never going to work for that anyway. Hopefully we can discuss that in the morning.

Extending the Rules
RDFS is only the first step on a long road of inferencing. Next will be OWL Lite. I have yet to determine what the rules will look like there, but I know that many of them won't be like the ones for RDFS.

Apparently Jena tried to go with simple rules for OWL, but had to incorporate a lot of Java code to do it instead. I think I'll go through the list of available inferences first, and see which ones map easily to iTQL-based rules. Then I'll see what is needed for the remaining ones (the obvious one that comes to mind is cardinality). These should individually be easy enough to implement, but a generic framework would certainly be better.

Non-Rules Solutions
The whole rules-based structure has not left me feeling impressed with their efficiency. It is possible to make each individual rule more efficient (such as we did the trans statement, but the overall structure has not received all that much attention. I asked Bob what he knew of alternatives, and he suggested something called tableaus. I know nothing of these yet, but I'm thinking I should check into them shortly.

No comments: