Saturday, May 28, 2005

SMOC
Coding over the last couple of days has gone well. I've been using the standard XP cycle of incrementally adding a bit, and running it to see that it works as intended. At this stage "working" is really a matter of making sure that the logs contain the data they are supposed to contain.

It's all been going well, but I'm only part way through. As Andrae likes to say, it's just a Small Matter Of Coding (SMOC). Meaning that the design is done, but the rest of the work is going to take some time. Now that I have the classpath issues resolved, I'm getting through a few hundred lines of code per day, so I'm happy the current pace. I just hope I don't run into any other major snags.

One issue that I've hit while building the Query objects has been the difference between a krule:Variable and a URIReference. This was bothering me when I wrote the RDFS, and the parsing of it made me consider it again.

Every time I run into constraint element I can ask for it's type. In the case of a variable, I can then use the name to create a new variable. Easy. But when I get a krule:URIReference I need to look for a krule:refersTo property to find the URI to construct the object. I hate if() constructs in code if I can avoid it. :-)

On the other hand, there is no real need for a variable to pick up a new attribute. If it did, then it should really be rdf:value.

This made me think about the krule:refersTo property that I've used for krule:URIReference, and so I looked up rdf:value again. I had thought that this was a datatype property (to use OWL parlance) instead of an object property, but the range is rdfs:Resource. Also compelling is the comment that the use of this property is encouraged to help define a common idiom. All of this was enough to make me change krule:refersTo to rdfs:Resource on krule:URIReference.

You'd think that would be enough to convince me to add a similar value to the variable, but I haven't. :-) The reason for this is usability and readability of the RDF for rules. Ideally I'd not have an extra indirection on the URI for URIReference either, but that has semantic consequences. I'd end up saying that some arbitrary URI has an rdf:type of krule:URIReference, which is just plain wrong.

The problem with having this difference between values and URI references is that each of them uses a different set of conjunctions to get all the data for construction. This means that I need two separate queries. To minimise the queries I'm doing, I'm initialising the whole rule reading procedure with a method that reads in everything of type krule:URIReference with their values, and mapping one to the other. This map is available later to any method that sees a krule:URIReference, and needs to get the URI it refers to.

Entailment and Consistency
Inferencing falls into two areas: entailment and consistency. I'm going to need to handle both. I haven't yet thought a lot about consistency checking, and that has worked out well.

Early on, I had thought that there would be a lot of consistency tests to perform, but after my learning exercise with cardinality I realise that consistency is not as strict as I'd thought. In the general case, if there is any possible interpretation that can make a data and its ontology correct, then it is consistent. This can rely on the most unlikely of statements which are not in the datastore. It is only when there are some direct conflicts that consistency becomes evident. Conflicting cardinality values, sameAs/differentFrom pairs, and the like are the most obvious examples. Fortunately, many of the less obvious examples will show up in the simpler tests after entailment has been performed, so the number of tests to be performed is not as daunting as it may first appear.

In the meantime, I've been learning more about logic at Guido's Wednesday sessions, which has also given me a better understanding of what I need to do.

As a first iteration, I'll be performing consistency checks only after entailment is complete. This will be more efficient, as it will only need to be done once, and will wait until all possible conflicting statements have been generated.

The problem with this approach is that it may be difficult to tell where the conflicting data came from. If entailed data conflicts with other entailed data, then the original data may be difficult to find (this is especially the case when the entailed data was entailed from other entailed data). The only way I can see around this is to perform checks after each entailment operation. This could get very expensive. Maybe it could be set as a debugging flag by the user?

Another thing to try during debugging will be more complicated consistency checks. Further entailments may show an inconsistency with simple tests, but it would be ideal if inconsistencies could be found and acted upon immediately. However, I believe that this approach could be of arbitrarily complexity, and only bound by the amount of work that the developer wishes to put into it. As a result, I don't think I'll be pursuing this for the time being. It would only be of use for debugging anyway, so the case to implement more complex test would have to be very strong (and involve a lot of $$$). :-)

Even with simple tests, I still need to define the consistency tests. The current rules are for entailment only, so I need to expand the vocabulary to handle them. The structure will be very similar to the entailment rules, so it should be easy to extend my current system to handle both. The main difference is when to run the rules, and to test for the existence of any result rows, rather than inserting results back into the model.

No comments: