Thursday, September 30, 2004

I've spent pretty much the whole day documenting, so there's not a lot to say here. Somehow the following ended up lengthy anyway, and given that it's late I won't be proofreading it until later. Read at your own risk. :-)

The early part of the day was spent modifying the camera ontology from XFront to better demonstrate the OWL-Lite features we needed. The original camera ontology has moved on their site since it was first published, but thanks to Google I discovered that it's still there.

I then had to create an RDF file which declared cameras with various properties. Short on inspiration I went and found real specs for cameras from Olympus and Pentax. I even discovered what a "Large-Format" camera is, and several models from Sinar were included as well. It's enough to make you interested in photography.

We now have valid cameras, and several invalid cameras. The invalid ones violate restrictions on cardinality, someValuesFrom and allValuesFrom. We also have the documentation to explain how to validate the RDF using iTQL. This iTQL is the basis for the consistency and entailment rules in Kowari, and will eventually be incorporated into the rules engine (which works, but is not integrated and not all of the rules are written).

One problem I had was with namespaces. Now I'm the first to confess that I really don't know much about XML. I just stumble around in it, mostly copying bits from other sources as I need them. Learning it all seemed too time consuming when I have so many other things to be reading.

To start with, for convenience I declared an entity:

<!ENTITY camera "">
Note the # on the end? I watched AN use this namespace to declare an instance in the following way:
<rdf:Description rdf:about="&camera;Olympus-OM-10">

Not too pretty, but when queried the database for this node I got:
Which was what I expected.

However, taking a lead from XFront (and some half-remembered method of declaring a namespace for a node) I tried to use it this way:
<rdf:Description rdf:about="Olympus-OM-10" xmlns="&camera;">

This had the unexpected effect of losing the #:
Now I know this will be my fault, but I really don't understand how the # got lost here. Maybe I really do need to learn XML.

A New Type Predicate
After sleeping on the OWL description problem, I realised that it is not such a difficult problem. What we need a another "magical" predicate that lets us define "type". At the moment we can say:
  $instance <rdf:type> <ns:Class>
This will pick up all simple class types, and after a round of using trans($x owl:subClassOf $y) it will also pick up every superclass that was explicitly declared. However, this will not track down if a complex class expression is applicable to an object instance.

Adapting some terminology I picked up in a recent paper, I believe that we can solve the problem if we introduce a predicate which describes if an expression subsumes a node. For instance, we may have a node of type X, and we need to see if it is subsumed by an expression E. A use case for this is to see if the node meets an owl:allValuesFrom restriction on a predicate that a subject uses to refer to it. If X is a subclass of Y, which is a subclass of Z, and E is a union between T and Z, then the instance will meet the criteria, and E can be said to subsume the instance. This is a much more complex expression than would normally be described in iTQL, and a single operation which could return true or false on the matter would be very useful.

The operation to test if x was subsumed by description D could go as follows:
1. Get the types of x. If this is an empty list then return "unknown" since we don't know anything about x and can't say anything about it. (Throw an exception to do this?)
2. If any item in the types list is equal to D then return true (this could be skipped, as the list gets checked again in step 4, but it does allow for a quick exit).
3. Get all the supertypes for x and add them to the list of types.
4. If any item in the current types list is equal to D then return true.
5. If D is an owl:Restriction then test the instance with the restriction and return the result.
6. If D is an owl:unionOf then recurse for each description in the union, returning the ORed results.
7. If D is an owl:intersectionOf then recurse for each description in the intersection, returning the ANDed results.
8. If D is an owl:complementOf then recurse on the description contained in the expression, and return the complement (using a logical NOT).
9. If D is an owl:oneOf then test if x is equal to (or declared the owl:sameIndividualAs) any of the elements in the list, returning true if it is, or false otherwise.
10. At this point throw an exception, because the description D was not well formed.

This can be done in a straightforward manner, with the execution attached to a predicate like <tucana:subsumedBy>. I could even use a <tucana:type> predicate, as it is conceptually similar to an <rdf:type>. Using a predicate like this I could modify the iTQL for the incomplete owl:someValuesFrom test:
select $s $p $x count(

select $o
from <rmi://>
where $s $p $o
and $o <rdf:type> $x
from <rmi://>
where $r <rdf:type> <owl:Restriction>
and $r <owl:onProperty> $p
and $r <owl:someValuesFrom> $x
and $c <rdfs:subClassOf> $r
and $s <rdf:type> $c
and $s $p $obj
having $k0 <tucana:occursLessThan> '1'^^<xsd:nonNegativeInteger>;
This would be changed to the slightly shorter:
select $s $p $x count(

select $o
from <rmi://>
where $s $p $o
and $o <rdf:type> $x
from <rmi://>
where $r <rdf:type> <owl:Restriction>
and $r <owl:onProperty> $p
and $r <owl:someValuesFrom> $x
and $s <tucana:subsumedBy> $r
and $s $p $obj
having $k0 <tucana:occursLessThan> '1'^^<xsd:nonNegativeInteger>;
Thinking about this just now made me realise that this iTQL may not even be strictly necessary. Having a subsumption operation as I just described takes away a lot of the work of the restriction consistency tests, as the operation does the consistency tests internally. In this case, it may just be necessary to see if a class is subsumed by the description that a rdf:type declaration refers to. In other words, if a statement says:
  <ns:instance>  <rdf:type>  <ns:Type>
Then consistency could be tested simply be checking if ns:instance is subsumed by ns:Type. Of course, this would necessitate skipping ns:Type in the list when executing steps 2 and 4, or else ns:instance of type ns:Type would be considered consistent simply because it was of type ns:Type. I should think about this a little more, but I'm pretty sure this last idea will work.

Of course, this means that I need some way to test an owl:Restriction in the general case, but this is not all that hard. There are essentially 3 types of restrictions, and each is placed on a predicate. All that is needed is to take the instance in question, and find all statements which use that instance as a subject with the given predicate:
1. For owl:allValuesFrom and owl:someValuesFrom test subsumption of the objects from the found statements. For owl:someValuesFrom return true the moment an appropriate object is found, or false if none are found. For owl:allValuesFrom return false the moment an inappropriate object is found or true if all objects are correct.
2. For the cardinality restrictions, check the number of the objects found, and return true if it is larger than owl:minCardinality, smaller than owl:maxCardinality, or equal to owl:cardinality.
3. For owl:hasValue, test each object, returning true as soon as one is found to be equal to the given value, or false if nothing in the list matches.

Looking at it, this takes a lot of work out of writing iTQL for consistency checking, and the code to do it all will be quite straightforward. Once we have the current release out of the way I'll present it all to AN and TJ and confirm that it's as effective as I believe.

Wednesday, September 29, 2004

The majority of today was spent on documentation. This started with some updates to the docs for the "having" clause, which I wrote last night, and moved onto the "node type" models. I tried to structure this similarly to the documentation for the XSD data type models, though I provided more examples.

Once I'd finished with this, I moved onto working on the example data AN needs for the demonstration iTQL used to implement OWL-Lite rules. He had hoped to use the Wine ontology, but it turned out to have a number of OWL Full features, which made it incompatible with the queries he wanted to perform.

Initially I suggested removing the statements which are incompatible with OWL-Lite, but the size of the file made me reconsider this. So now I'm looking at fleshing out the Camera ontology to use a few more features.

The whole morning was spent looking at the iTQL required to make owl:allValuesFrom work. It didn't take long before discovering that there was a problem in our ability to perform the query.

Using iTQL it is quite easy to find a predicate that has a restriction placed on it, and in the case of owl:allValuesFrom, the class of the range for that predicate is also easy to pick up. Passing these predicates and the associated range into a subclass, it is quite straightforward to find those statements which use that predicate, and have objects which are instances of the range class. With an "exclude" operation (as of today, "not" is now named "exclude" to make the semantic of this operation on constraints much clearer) it is a simple matter to find usages of the predicate with objects which declare a type outside of the given range.

This sounds fine, until you realise that some objects declare that they have more than a single type. A predicate might be used on an instance of the appropriate class, but if this is also an instance of another class, then our query will pick it up as being of the incorrect class.

AN and I spent hours trying to find some way of taking the list of instances of classes which are of the wrong type, and removing those objects which are also instances of the correct type. Every time I thought of a different way of approaching the problem, I always ended up with a set of data from which I needed to extract some other data, and yet there seemed to be no way to do that.

In the end, AN believed that he found a way of doing it, using "exclude" and 3 levels of subqueries. Unfortunately, it exploits a corner case for "exclude" which has not been properly implemented, so he is going to work on coding that tomorrow. However, the partial query he already has, contained enough information in it to make him believe that he has it right. I'll confess that I didn't follow it properly when he showed it to me, so I'm going to take his word for it until I see it in action.

AN claims that he and AM determined that it is possible to get any type of result using the "and" and "or" operators in conjunction with "exclude" and subqueries. I'm not so sure about this, but even if it is true, the complexity of the subqueries in order to obtain some results is ridiculous.

Instead of this I believe that we need a "difference" operator. This would work internally in a manner similar to the "or" inner join operator. It would match the two constraints' variables, and wherever there is a match of rows in both constraints, those rows in the first constraint would be removed. This would allow numerous operations which we have not been able to do in the past, including owl:allValuesFrom this morning.

Some years ago I thought that we might need an operation like this, but somehow we have always managed to avoid it. Once I discussed this with others I discovered that several others have had similar thoughts. SR had once proposed such an operation, and DM had also come up with a join operation which he called "and-not". In each case the idea was to perform an operation just as I've described here. It looks like it might be time to finally implement it. Fortunately SR looked at how it would be accomplished, and he thinks that it will be very similar to constraint conjunctions, and just as efficient.

I finally discovered the difference between owl:allValuesFrom and rdfs:range, and of course, now I feel really silly.

The mistake I made was that I thought that the restriction was applied to the predicate globally. I have indeed seen that owl:Restrictions are used by having owl:Class types inherit from them, but I had conveniently forgotten this. When applied correctly, the restriction works on a particular class.

The application of this is that a restriction on a predicate is only applicable when that predicate is being used on a subject of the type that subclasses the restriction. Any types which do not inherit from the restriction can use that predicate with impunity. This is different to rdfs:range, which declares the object type for a predicate no matter what the type of the subject is.

An example could be a "ns:hasParent" property. If this property is applied to a subject of type "ns:Human" then I want to ensure that the object is also of type "ns:Human". However, I don't want to declare the range of this predicate to be ns:Human, because when applied to subjects of type ns:Dog then the object will also need to be of type ns:Dog.

Actually, this demonstrates some of the distinct advantages that OWL has over simple systems like RDFS. It was once I realised this, that I then realised that an owl:allValuesFrom restriction based on a union of classes could start to get complex to enforce. Fortunately, unions are an OWL DL construct, but I will definitely be looking at this during the course of my research.

Just looking at the syntax now, I've realised that it is possible to have a restriction of the following form:

<owl:Class rdf:ID="Riesling">

<owl:onProperty rdf:resource="#hasFlavor" />
<owl:unionOf rdf:parseType="Collection">
<owl:Class rdf:about="#Sweet" />
<owl:Class rdf:about="#Dry" />
This would make the owl:allValuesFrom iTQL much more complex. In fact, since things like OWL descriptions are recursively defined in terms of class IDs, restrictions, unions, intersections, complements and one-ofs, then we probably need to find a more general way of representing a description in iTQL. This will be needed if we ever hope to implement OWL DL, since a "description" in a restriction can be arbitrarily complex, and iTQL as it stands will need to be just as complex to traverse the entire length of the description.

Tonight I'm probably going to have nightmares about restrictions which are owl:allValuesFrom of an owl:unionOf of classes, one of which is another owl:Restriction with a owl:someValuesFrom on an owl:interectionOf of an owl:complementOf..... and so on.

Now that I think about it, we can't even contemplate OWL DL until we've worked out how to traverse and represent descriptions in iTQL. And it's called "DL" because it's supposed to be computable! :-)

I think I need to learn more Prolog.

Tuesday, September 28, 2004

TJ wanted to get everything as stable as possible today, so rather than implement the BlockCacheLine.reset() method I've been concentrating on documenting the iTQL needed for OWL. I did get to speak to AM on the phone tonight, so I have an idea of what I will need to do if I get to this before Monday, when AM gets back.

In the meantime, the JXUnit tests for the node type code needed to be checked in. To ensure that the created HybridTuples never gets large enough to be backed by a file, I put a step into the test which drops all the models. Unfortunately the test still failed because the HybridTuples was still too large.

I put some more steps into the test, this time counting how large the string pool was. The result was a number somewhat larger than 222,000. That was a little larger than the 4 I expected! I added a step to print out the contents of the # model, and this showed that there was a Wordnet model loaded. That would certainly explain the size of the string pool.

After looking at the step that dropped all the other models (including the Wordnet model) I realised that it must not continue to drop models once a single model fails to be dropped. Since the Wordnet model was near the end of the list, this would explain the problem. So I created a single step which dropped just the Wordnet model. This fixed the problem, and the tests all started running well.

The only problem with this "fix" for the tests, is that the Wordnet model now needs to be re-loaded by another test which needs it. This slows the whole test suite down by about 10 minutes on my machine.

Yes, I'm back to calling it iTQL even though that's the shell and not the language. Everyone else has fallen into the habit, so it's just easier.

I'm working on documenting the iTQL required to implement OWL, restricting the entailments to those which do not require a rules engine. These require some example data, so we are using the Wine ontology as a framework for this.

I've also provided some documentation for the "having" clause, and I will need to write something which describes how to use the new "node type" models.

I had a meeting with Bob today, and we discussed a number of technical issues based on what I've been learning recently. This was probably more out of a sense of convincing him that I'm making progress, rather than actually getting the benefit of him as a supervisor. However, I did find it useful to talk to someone who has a good technical knowledge of this stuff, as it helped to clarify a few things for me.

After talking with SR, and today with Bob, I now know that Kowari is capable of being a complete 2nd order predicate logic system. However, there may not be a lot of point to that. For instance, is there really a need to express existence of a statement which uses a particular predicate, without actually have such a statement? As such, I'm starting to understand the advantages of a 1.5 system.

One thing I mentioned, was that it has struck me that many of the DL and predicate logic equations, used to describe consistency checks and entailments for OWL, are incapable of describing the property of the predicate which defines the rule to be run. For instance, the consistency check for owl:someValuesFrom is a predicate logic expression:

 ∀X∃Y : C(Y) ← P(X,Y) ∧ D(X)
Note that nowhere in this expression does it state that the predicate P must have a restriction of [owl:someValuesFrom Y] declared on it. The mechanism which finds the predicates to apply this to is separate from the equation. I have already mentioned some of this in an earlier blog.

Describing attributes of predicates in this way is only possible in 2nd order logic. This is one of the important features that iTQL supplies for us, and one of the advantages of a 1.5 order system. However, in order to gain a better understanding of what 2nd order (or 1.5 order if I'm to look at the limited case) I will need to learn a bit more about higher order logics. And here I haven't even finished learning predicate logic!

The most useful part of my meeting with Bob came at the end. He pointed out that he would like to see my confirmation paper by early next year. Given that I'm enrolled part time, I thought that he might have given me longer, but I'm actually pleased to be made to think about it earlier.

My only apprehension about my confirmation is that the more I learn about this stuff, the stupider I feel. I now know that everything I have learnt in the course of Kowari is really just a re-invention of earlier work, albeit with different structures, efficiencies, etc. If I'd had any doubt of that, then Kunii's book would have laid the matter to rest. Given this, then what do I possibly have to contribute?

Fortunately, this is where only doing an MPhil has its benefits. I'm not really expected to expand the field of knowledge significantly. Instead, Bob suggested that my work to implement OWL in a database (like Kowari) could be the basis of my thesis. Implementing a system like OWL in a new way is apparently a decent enough project, so long as I accompany it with a solid theoretical foundation of the work.

This actually suits me. I have to do this OWL work at Tucana anyway. This will involve both the entailments and the consistency checks. It will also require interaction with a rules engine. While I don't strictly need a solid theoretical understanding of this process (with tight deadlines we are often engineering solutions which work, even if we don't have a complete grasp on the problem) having one will be quite valuable. More to the point, the papers I write to do this, and the final thesis, will all contribute to the body of public work about Kowari. This last one has direct benefit for DW as he spreads the word about Kowari, particularly in academic circles.

I suppose this leaves me with one significant problem in all of this. People who know what they're talking about will be reading this stuff. That means I can't fake it. That's a little scary. I'd better get it right!

Monday, September 27, 2004

Appending Tuples
I started the day by looking at Friday's problems with the Type pseudo-model. Rather than using a memory profiler, it occurred to me that I might see the JXUnit tests work if I moved all other tests aside. If things started to work at that point then I could add the other tests back in, one at a time, until things broke again. The idea was to keep things a little more controlled, and hopefully avoid those OutOfMemoryExceptions.

It was a tactic that paid off. With just the node-type tests running, everything worked as expected. The first tests that I added back in were the "advanced_queries" tests, and these immediately caused incorrect data to be returned. Armed with this I was able to duplicate the problem within an interactive iTQL session.

After a little while I narrowed down the problem, and found that it was the presence of a second model which caused the query to fail. Selecting all the literals from the string pool showed that the two literals which were supposed to be returned were now separated by a large list of other literals. This made me reconsider how I was selecting the literals from the string pool.

The literals from the string pool come out in three lumps: Strings, doubles and dates. The next version of the string pool will have many more types, but for at least the next week we are still restricted to these types. Each type is returned as a GNodeList, and gets wrapped in a thin wrapper class which provides a Tuples interface. In order to return the three tuples as a single Tuples object, they have to be appended.

At this point DM was able to advise me that the TuplesOperations.append() method which I was using was supposed to take a list of ordered tuples as arguments. While the tuples from the string pool are ordered numerically, alphabetically, and by date, none of these were the ordering required by the append() method. This method creates an object which holds it's list of constituent tuples, and shuffles forward through them according to which one has the next consecutive item to be returned. If any of these tuples is out of order, then its next item will never be considered the "next" item to be returned by the whole group. Hence, the list of tuples was being cut short.

The package holding TuplesOperations also holds an internally used class called UnorderedAppend which can do an append using unordered tuples. Unlike the ordered case, this class does not hold references to the constituent tuples. Instead, it creates a new tuples object and populates it with the contents of its arguments. DM suggested that I change the scope of this class to public so it could be used externally. With a lot of this code changing soon anyway, SR had no objections to this change.

The result was a correct set of answers. Before checking everything in to CVS, I started running through a more thorough set of tests to confirm that everything was working well.

Failed Tests
Initially everything appeared to be working, and I started looking at removing some of the hacks I'd put in to help me debug the problem. One of these hacks had been a step in the node-type tests which explicitly dropped every other model being used by the full suite of JXUnit tests. The idea of this step was to help clean out the list of literals, reducing it to just the literals found in the data needed it for the test.

Once I allowed other data to stay in the database during the node-type tests, the tests suddenly failed with the following output:

ItqlInterpreter error - org.kowari.query.QueryException: Couldn't resolve local query

Caused by: (QueryException) Couldn't resolve local query
Caused by: (TuplesException) BlockCacheFile.reset(prefix) untested.
Grepping for this error string led me to the BlockCacheLine class (the name "BlockCacheFile" seems to be something left-over from an earlier version of the file). The lines in question were something like:
if (prefix == null) {

} else {
nextTuple = -1; //!! FIXME
throw new TuplesException("BlockCacheFile.reset(prefix) untested.");
Now code like this is a real worry! Evidently I had enabled a new (and unimplemented) code path which had not been used until now. The code was written by AM who is away on holidays this week, so if I want to get this method working correctly I'll have to work out the inner workings of HybridTuples for myself. (BlockCacheLine is an inner component of HybridTuples).

TJ is very keen for me to help AN with TQL for OWL, so with time being an issue I decided to see if I could try something else, and avoid this code path altogether. My first thought was to attempt to use the TuplesOperations.append() method which I had incorrectly been using at the start of all this. This required a sort() to be applied to each tuples object before appending. The consequence of this would be a whole new sorted tuples object being created for each block of nodes from the string pool, and to wrap these new objects in the appending tuples class. The original tuples would then get closed. Not quite as efficient as letting all the sorting get done in an UnorderedAppend object, but still acceptable.

Implementing this was easy and straightforward, and didn't help.

Upon reflection, I realised that the problem was exactly the same. Using the UnorderedAppend code had created a new HybridTuples object to hold the sorted tuples in. Using the ordered version of append() had not created a new HybridTuples object, but each of the tuples which it wrapped had been a HybridTuples. Each of these were created in order to sort the original tuples.

Since the missing code was only when the BlockCacheLine.reset() method was called with a null or empty prefix parameter, TJ suggested I see why this parameter was being passed in, and if it could possibly be avoided.

Tracking this was remarkably easy. The BlockCacheLine.reset() method is only called from HybridTuples.beforeFirst(), which has its own prefix parameter. This prefix parameter is passed straight through to the BlockCacheLine.reset() method. I confirmed with SR that it is the constraint resolver's join() method which calls HybridTuples.beforeFirst() with the prefix. He also explained that this is fundamental to how join operations are performed.

In other words, BlockCacheLine.reset() must be able to accept the prefix parameter. So first thing in the morning I'll be trying to disseminate the HybridTuples code to work out how to do this. Where is AM when you need him?

When tuples need to be resorted, it is necessary to create a new object which can store the newly sorted data. To ensure that the tuples can be of any size, it is necessary to allow this object to be backed by a file, otherwise a query could run out of memory. However, a file-backed object will be quite slow for small data sets, which is just as unacceptable as running out of memory.

The purpose of HybridTuples is to provide a fast and scalable Tuples object. When the data set is small, the entire object is held in memory. When the data set gets to a particular size, HybridTuples creates a backing file in which to store its data.

In the past Kowari had an in-memory Tuples implementation separate from the disk-based FileTuples object. This worked, but required the code instantiating the tuples to know when to switch from using one to the other. HybridTuples, on the other hand, is supposed to do this automatically. It normally performs this well, but it seems that one piece of code required for join operations has not been implemented.

SR, DM, and I were all perplexed how this code path could have been missed being run in the past. After some thought SR appeared to have the answer. The optimiser works hard to make sure that the tuples resulting from each constraint is in the natural order found in one of the index files. This is almost always possible. If it is absolutely required that one of the tuples be resorted in order to perform a join, the optimiser tries to ensure that this happens to the smallest possible tuples.

Since joins on constraints usually serve to reduce the size of the data, and it is usually this joined data that the optimiser is forced to re-order, then it appears that the HybridTuples the system resorts to performing joins on are always small enough to avoid being disk based. Perhaps certain results have been large enough to go over the size threshold for HybridTuples, but they have not being joined against, otherwise we would have seen this error before now.

It's probably fortunate that the pseudo-model code has triggered this bug, as a sufficiently complex query on a significantly large data set would probably have also triggered it.

As the basis for some of the testing I've done with OWL, I've often used the Camera ontology example supplied with Jena. This file contains only 2 literals. The first is a string, which is a comment on the ontology, and the second is a cardinality constraint of 0.

While debugging the code which appended the contents of the different parts of the string pool, I expected to see a string and a double being appended into a single tuples object. Instead I found that the string pool contained 2 strings. This was because the number 0 had been erroneously encoded as a string.

When I looked in the camera.owl file I found this:
I normally prefer working with N3, so AN was kind enough to point me to a file (which I'd written!) to get the correct syntax:
<owl:cardinality rdf:datatype="&xsd;double">0</owl:cardinality>
Of course, the correct data type is xsd:nonNegativeInteger, but we won't be supporting that for a week or so. I also needed to declare the xsd entity.

This fixed the type problem for the data. Hopefully the original camera.owl in Jena has been, or will be, replaced with an updated example which has the correct data type in it. I didn't feel like downloading the whole Jena package to check for a simple update on a single example file. At least, not today.

Graph Data Model
The last time I saw Bob, I said something that made him think of a book he owned, called "Graph Data Model and Its Data Language" by Hideko S. Kunii. He'd met the author some years ago at a conference, and she recommended the book to him. He thought there was something familiar in my description of Kowari, and suggested that I try reading it.

I picked the book up at the UQ library yesterday, and made a start on it last night. The first surprise was that the cover has pictures of several constellations on it. This was the foundation of the company name and original logo of Tucana Technologies. Then as I started reading, I realised that the book describes a very similar system to Kowari.

The book describes a database called the Graph Data Model (GDM), and a data access language called the Graph Data Language (GDL). Kunii describes GDM as:

The data structure of GDM is a labeled directed graph. Records are represented as nodes characterized by attributes and holding values. Links, which are interpreted as binary relationships between records, are represented by directed arcs. Links serve as logical access paths. GDL is based upon algebraic operations on the graphs.

The principal difference between Kowari and GDM/GDL, is that the core of Kowari is completely symmetric about the three components of a subject-predicate-object tuple, while GDM considers predicates to be different to the nodes they connect. This is familiar with respect to RDF, and demonstrates some of the principles of indexing that I've seen in other systems. However, while the philosophy may have been different, the eventual implementation may not have been all that different. From the chapter on implementation:

A link occurrence file contains pairs of URI-tuples. We can choose either a B-tree, B*-tree or indexed sequential file structure for a link occurrence file. Since the data structure of the link occurrence file does not differ from one link type to another, we can store multiple link types in a single physical file.

In a sense, this sounds very much like the predicate-object-subject index in Kowari. This index is a balanced tree, grouped into sections based on predicate. The contents of each section is a set of tuples of object-subject pairs.

Of course, the implementation of much of Kowari is different to GDM. Kowari also has a meta node for each tuple. More importantly, Kowari is symmetrically indexed about all 4 of its nodes, with 6 index files providing every possible direction of indexing. However, some of the mathematical foundations are identical, and definitely worth pursuing. In particular, I'm interested in the scope and structure of GDL. It appears that Kunii has done a thorough analysis of the requirements of a data access language of this type, and I'm interested to see if there are operations missing from TQL.

As I read more, I started recognising elements from our own recent work in TQL. Early on in the description of GDL, Kunii describes "Link Operators":

The Graph Data Model provides extensive link traversal operations. Because of this facility, the model can express many types of relationship between record types. There are basically two types of link operators: numerical link operators, and transitive link operators. Existential link operators and universal link operators are defined as variants of numerical link operators.

It was only last week that I started to realise that the universal and existential operators can be realised with numerical operators. Seeing the things I've only just started to understand laid out as a natural consequence of this kind of data structure only serves to remind me that there is a lot of theory that I should have already read. I wonder how many people in the DAWG have read up on this kind of work?

Friday, September 24, 2004

JXUnit Tests
I expected that converting the tests for the new pseudo-model into a set of scripted JXUnit tests would be trivial, but this was not the case.

It didn't take too long to create the XML for the tests. The XML results took a little longer, but a set of vi macros applied to the results from the iTQL GUI helped here. However, when the tests were run, the expected results were not obtained.

The first test looked for all statements with subjects of type URIReference. This worked fine. The second test looked for all statements with objects of type rdfs:Literal, but here is where the strangeness started.

The query looks like this:

select $s $p $o

from lt;rmi://servername/server1#camera>
where $s $p $o
and $o <rdf:type> <rdfs:Literal>
in <rmi://servername/server1#type>
order by $s $p $o;
When executed against the camera ontology, the results look like:
  node60 <> "

Camera OWL Ontology

Author: Roger L. Costello
Acknowlegements: Many thanks to the following people for
their invaluable input:
Richard McCullough, Yuzhong Qu,
Leo Sauermann, Brian McBride and
Jim Farrugia.
_node159 <> 0
However, when run inside the JXUnit framework the results were merely:
  _node159 <> 0
After working on this for some time I started asking if anyone else might have ideas. DM pointed out that he had left a bug in the string pool which could cause a single element list (of literals, for instance) to show up as an empty list. This didn't appear to be the problem, and a bug fix for the problem demonstrated that it wasn't.

Since the original code for the pseudo-models was from SR, I started to question him on a few possibilities. The first thing he suggested was to do a query from just the pseudo-model like this:
select $l

from lt;rmi://servername/server1#camera>
where $l <rdf:type> <rdfs:Literal>
in <rmi://servername/server1#type>
In iTQL it returned:

Camera OWL Ontology

Author: Roger L. Costello
Acknowlegements: Many thanks to the following people for
their invaluable input:
Richard McCullough, Yuzhong Qu,
Leo Sauermann, Brian McBride and
Jim Farrugia.
These values were expected, with the first two coming from the camera ontology, and the second two coming from the system model. However, when I put the same query into the JXUnit tests the output file reported an OutOfMemoryException. Of course, exceptions of this nature don't come with a stack trace, so I had no real indication of the specific cause of the problem. The JXUnit code does not set up any logging either, so I'm unable to get any debug statements from the system.

I thought that perhaps the string pool was returning too much data for "node type" code to deal with. There should not be a size problem here, but if there were, then having a lot of literals in the string pool could be enough to trigger it. Based on this idea, I went and tracked down every model created in the JXUnit tests, and put in TQL to drop all of them. Unfortunately, this had no effect.

While looking at the specifics of the implementing code, SR pointed out that it could use a shortcut if the variable were already bound. Rather than retrieving all nodes of a requested type, the value bound to the variable could be looked up directly in the string pool. If the value is present, then the "unconstrained" tuples can be returned, else the "empty" tuples would be returned. This is much faster, and would not use any memory for operations when the other constraints return fewer results than the "type" constraint would return.

I went and implemented this suggestion, and demonstrated that it worked by testing against an interactive iTQL. This was an easy test, as all the variables could be removed from the TQL:
select 'dummy'

from lt;rmi://servername/server1#camera>
where '7'^^<> <rdf:type> <rdfs:Literal>
in <rmi://servername/server1#type>;
So with this working, any significant number of literals in the string pool would be avoided if a limited set of nodes were returned from another constraint.

With these adjustments in place I tried running the JXUnit tests again. Once more, the required query returns only the second line, and the query for all literals still throws an OutOfMemoryException (the latter was not surprising, as the old code path should still have been executed).

Since the queries all work fine from an interactive iTQL session, and with no logging available when running the JXUnit tests, I'm in a quandary for my next move. SR has suggested running a memory profiler during the tests to track down the OOM. This is the best suggestion I've had so far, so I'll be trying that on Monday morning.

Thursday, September 23, 2004

I'm a few days overdue in writing this, but I should be reasonably accurate.

First thing in the morning, AN said that he needed to change the owl:someValuesFrom query in order to make it work. This wasn't completely surprising, as I'd spent most of my time working on the conversion, with very little testing of the actual TQL on real data. Most of the queries were trivial, so this wasn't really a big deal. However checking consistency on the more advanced restrictions required thorough testing, which is what AN was doing. Concerned that I may have originally misunderstood the problem, I sat down for some time to work through it with him.

I was initially confused about the problem, until we started looking at AN's sample data and the results he was getting. It turned out that we had differing ideas on the meaning of owl:someValuesFrom and owl:allValuesFrom. This was probably a good thing, as it made me go back to the W3C documents and confirm the exact meaning of each. I was finally able to confirm that the TQL I'd done for owl:someValuesFrom was actually correct. This was fortunate, as it validated my previous blog on the matter. :-)

Since we were looking at these queries it was natural to go straight onto looking at owl:allValuesFrom. I had made a mistake for this one, but not because of my interpretation of the operation. Instead, I hadn't quite grasped the nature of the not() operation in TQL. And it turns out that I'm not alone in this.

People perceive the not() operation to have one of several different semantics, depending on the nature of their current requirements. In many cases people are getting this wrong. I was one them. For this reason SR and AM have been arguing that we should rename the operation. AM's preferred name here would be "excludes". Hearing this helped explain for me some of the semantic of not() but I think I'll need a little more practice with it before I try and document it here.

The result of all of this was AN doing a rewrite of the owl:allValuesFrom code. It is now based on a subquery, so that each subject-predicate pair can be queried for the type on the objects, and make sure that there are none from outside of the required class.

Now I know I can probably look this up, but could someone reading this please save me the time and explain to me the difference between owl:allValuesFrom being applied as a restriction on a predicate, and rdfs:range being declared on that predicate? It seems that they perform the almost same job, but I'm guessing that there is some sort of completeness aspect that I'm missing where predicates can be of a type owl:Class in OWL Full.

Much of the day was spent testing the pseudo-model for node types. It required a little debugging, but was all working by the end of the day.

Since the problems from Wednesday could be shown to be in unrelated JRDF code, TJ was happy for me to ignore them (fine by me!). I was able to get straight into debugging and testing the new model type.

To start with I wrote a few tests looking for literals and URIs from the camera.owl file supplied with Jena. Once I had the tests and expected results lined up, I tried running it for the first time. Of course, it didn't work straight way.

For a start, I didn't realise that these pseudo-models require creation before they can be used. I had been under the mistaken impression that I could simply modify a constraint with:

  in <>
SR was able to explain that is a model type, and that an actual model of this type must first be created. Consequently, a model creation statement must appear first:
  create <rmi://servername/server1#type> <>;
And the modifier becomes:
  in <rmi://servername/server1#type>
This is a little disappointing, as it requires at least 2 TQL statements, while we have worked hard to keep all of the individual inference operations to a single statement. I asked TJ about this, and he agree that it might be worth creating default instances of the special pseudo-models on creation of the system model.

Once I had my syntax correct, I encountered a NullPointerException in the String Pool code. This was entirely DM's code, so I asked for him input. After an hour or so he had it working again. At that point the tests all produced exactly the output required, so I started converting them to JXUnit tests. This took me until the end of the day.

While waiting for DM to look at the string pool, I started to look at some of the statements that AN was considering for OWL. The first one to look at was owl:backwardCompatibleWith.

I hadn't really considered versioning before this point. If anything, I expected that it was about attaching version information to the OWL data, with few consequences for the data. Looking at instructions like owl:import it quickly became apparent that these instructions perform actions of real consequence.

Considering the owl:backwardCompatibleWith statement, I realised that it needs to participate in consistency checks during inferencing. However, it is unlike most other OWL statements, in that there is no single TQL statement which can enforce it. The only way to confirm the consistency of this statement is to confirm the consistency of all the statements from the older ontology against those of the ontology containing the owl:backwardCompatibleWith statement. This means not only confirming each rule, but running a complete rules engine to confirm the consistency of the entire document.

The owl:import statement would have to operate similarly to this, only it would also need to perform the importation of the specified ontology. While the consistency checks would need a rules engine, the importation operation could be done with a single TQL statement. However, there is currently no TQL statement which will do this. While the load command does something along these lines, there is no way to select out the object from an owl:import statement and use that as the source of a load or an insert statement.

I suggested that to get around some of these restrictions we could introduce variables into the from clause. It turned out that SR has been suggesting this for a while, so he was happy with this suggestion. Perhaps with the requirements of the OWL versioning instructions, we may have the impetus to make these changes soon.

Wednesday, September 22, 2004

Virtual Graphs
Today was spent coding up a new "in" option for "where" clauses. This will let us add an option in TQL to restrict the type of a node using a constraint like the following:

  $x <rdf:type> <rdfs:Literal> in <tucana:typeModel>
The original idea was to provide functionality to restrict a variable to literals, URI references, or blank nodes. However, it became quickly apparent that this was akin to sorting a phone book and finding all listed home numbers, listed business numbers, and unlisted numbers. The first two are possible, but the last one is not (unless you want to search for all unused numbers, and then try dialing each one).

When I thought about use cases for this requirement I realised that we don't have any need to find blank nodes. We only need to know how to avoid them. Since literals and URI references cover the rest of the possible node types then this is easy to do. TJ agreed with this, so my requirements dropped to only needing to find nodes which are not blank.

This was all done on the old string pool. DJ has finished with the new string pool, but it has yet to be checked in to the main branch of Kowari. This is because of continued testing and because it has not yet been optimised to be as fast as the old string pool. These changes should be checked in soon, and when they are it will be possible to restrict nodes by other criteria as well, including prefix matching, which is needed for RDFS. It won't be a quick merge however, as the API has changed, but the additional functionality will make this worthwhile.

In the meantime I finished the coding for the "virtual" type model, and have started on testing. I didn't get too far, as the main branch is in a state of flux, and a few unit tests are not passing. I tried my changes (which should not have affected any running code) and had only one unit test fail. Then I tried a clean checkout and tried the tests again, to make sure that the failures were the same. But the clean checkout had 4 failures, and they were in a different set of tests. I don't get this, as my current work could not have fixed these 4 failures, and they should not have introduced the new failure. In all cases the failures were in JRDF code, and none of my code has any influence to anything related to JRDF.

Finding the source of these problems may be interesting.

Tuesday, September 21, 2004

I've been away from blogging for a few days due to illness. I was (rather stupidly) trying to code with a fever of 38.4C at 9pm last Thursday night (Google tells me that's 101.12 in Fahrenheit). Anne accused me of being a geek as I was taking great delight in watching the fever go up and down every 10 minutes with our new digital thermometer.

I finally realised that I couldn't work any more, and went to bed without blogging. I spent the next 4 days trying to kick the thing, and today is my first day back in the salt mines. Of course, everything I'm supposed to be doing has changed again. Perhaps I should have taken another day of rest (I could have used it).

I had planned on getting each part of Volz's paper codified into TQL rules by Thursday night, and to start working on mapping triggers between each of the rules. I did not quite make it, but I came reasonably close. The only things missing are a couple of OWL Full rules from the end of the document.

Converting the DL rules was not always as straight forward as I'd hoped. This was partly due to some of the DL rules only expressing part of the problem, and partly because some of the rules were expressed in predicate logic instead. Fortunately, iTQL appears to be good enough to fulfill many of these requirements, but I still needed to put in a bit of thought to get them right.

For instance, the owl:someValuesFrom rule was represented as:

 ∀X∃Y : C(Y) ← P(X,Y) ∧ D(X)
This translated to an inconsistency test of:
select $x $p $c count(

select $y2
from @@SOURCE@@
where $x $p $y2 and $y2 <rdf:type> $c)
from @@SOURCE@@
where $r <rdf:type> <owl:Restriction>
and $r <owl:onProperty> $p
and $r <owl:someValuesFrom> $c
and $x $p $y1
having $k0 <>
By "inconsistency test", I mean that any returned rows represent inconsistent data.

Some items in the above need explaining...

The "@@SOURCE@@" string gets replaced as required by a model expression, typically indicating the model for the ontology as well as the base data, and often including the inferred data. Also, Kowari only supports float for the moment, so we can't yet use a numeric type of non-negative integer.

The syntax for count() is still rather obtuse, but in this case it provides a bonus. Since the variables in use within the sub-query (inside the count function) are projected in from the set used in the outer "select" clause, we get an effective "group by" for free. This is essential, as this query is supposed to only return results where a given subject/predicate pair do not have any objects which meet the owl:someValuesFrom restriction. In other words, I'm looking for a count of 0.

The above code and predicate logic demonstrate a few of the difficulties in translation. First, the predicate logic makes no statement that the owl:someValuesFrom restriction applied to the predicate. This also happened with several of Volz's DL statements as well. Instead, this rule is supposed to be applied on all predicates where such a restriction is known to hold. Similarly, the class D is understood to be the target class that the restriction refers to.

I expect that the external requirements on the predicate and class D would normally be implemented in some wrapping code, but this is not appropriate for a purely TQL solution. Fortunately the above TQL shows that it is possible to express these conditions easily, even if they are harder (or impossible) in DL or predicate logic.

The second issue is that rules like the above declare that X and Y are instances of classes D and C. However, the class C has no significance to owl:someValuesFrom, and can be dropped when generating the TQL. This happens a few times, where a instance declaration (such as C(Y)) is either superfluous or marked explicitly as "optional" in the RDF OWL mapping.

The hardest part in translation here is that some operations, such as the existential quantifier, require a count operation, and the obscure syntax makes this very difficult to have a purely mechanical translation.

Full Cardinality
In order to achieve full cardinality I need to include a variable in the having clause. For instance, owl:maximumCardinality can be done with:
select $s $p $x count(

select $o
from @@SOURCE@@
where $s $p $o )
from @@SOURCE@@
where $r <rdf:type> <owl:Restriction>
and $r <owl:onProperty> $p
and $r <owl:maximumCardinality> $x
and $s $p $obj
having $k0 <> $x;
This didn't exist in TQL, but AN was able to put it in yesterday. This is the kind of upgrade cycle that inferencing is supposed to provide for TQL, so I was pleased to see it happening.

What to Inference
A discussion with AN on Thursday showed that he is hoping to infer some OWL rules from data that is provided. One example is that he was hoping to be able to read the number of times a predicate is used on a subject, and infer cardinality. I really can't see how that can be done, as OWL is supposed to dictate the structure of data, rather than the other way around. More importantly, extra data could invalidate some inferences taken from the original data.

In the end AN has agreed that we can't really infer OWL from data, but he did have the suggestion of storing some data (like a cardinality of a predicate on each class instance) for optimisation of future consistency checks. That seems fine, but I'm not sure how we'd hold that information. Perhaps in a JRDFmem graph.

Many of the DL rules from Volz describe the construction of an inconsistent statement. One example is:
  inconsistent(X,Y) :- sameIndividualAs(X,Y), differentIndividualFrom(X,Y).
Now these won't actually get generated as statements describing inconsistency, but any returned data will indicate an inconsistency with the data store. As a result, many of the "inferencing" queries are actually consistency-checking queries, where any returned data indicates a problem with that data. This seemed consistent with the way the DL is structured.

Discussing consistency with AN has shown that there are many more rules we can make that Volz did not include in the paper we are using. For instance, it is possible to count predicate/object pairs on an IFP predicate, and confirm that each pair only occurs once. We will probably develop a lot of these as we go.

Because I am testing for inconsistency many of the required operators are inverted. As a result, I realised today that these checks are all to be based on the inverse of the required operator. For example, minimum cardinality describes a number as greater than or equal to a given number. Checking for inconsistency therefore only requires less than. Similarly, there is only a need for greater than when considering maximum cardinality.

These restrictions are important, as the predicates which test like this are only valid for the having clause, and this clause does not allow for predicate expressions. This means that >= cannot be represented as a combination of > and =.

New Requirements
In order to make sure everything can be done right for the coming release, we are not going to include the rules engine for executing the TQL for each rule. Instead, we are going to make sure that the TQL used for OWL inferencing and consistency checking is complete, and properly documented.

This has a few advantages. First, it makes sure that all the relevant TQL has had adequate time for testing and documentation. Second, it provides the basis for anyone to do their own inferencing. Third, we don't know exactly what use cases there are, and this lets us test what customers actually want, as opposed to what we think they want. A final reason appeals to me personally, and that is that we won't need to include Drools with the project, especially as I hope to do our own rules engine instead.

With the completeness and stability of TQL in mind, I'm currently creating a virtual "model" which allows a query to restrict a variable to non-blank nodes, literals, or URIReferences. This was needed for full RDFS compliance, and before now I have not been able to restrict data like this. I'll be continuing work on this in the morning.

rdfs:Class vs. owl:Class
I've been meaning to ask AN about these two class types. OWL describes RDFS Classes as optional, but OWL classes as mandatory. This leads me to question what happens with inconsistencies. If an owl:Class does not exist, but an rdfs:Class does, then should I create the OWL version, or should I report an inconsistency? Do I repeat many of the owl:Class operations for the rdfs:Class as well?

These questions and others are probably answered somewhere, but I haven't found them yet. I'll ask AN in the morning.

TQL vs. iTQL
In the past I've often referred to iTQL when I should really have been saying TQL. The difference is that TQL (the TKS Query Language) is the language, while iTQL is the interactive program that lets you type TQL into it. It's a mistake many of us commonly make, and I'm probably being overly pedantic worrying about it. :-)

Wednesday, September 15, 2004

Something I forgot to mention earlier was consistency checking. The DL I've been translating describes quite a few rules for determining if a system is inconsistent. These are just as easy to translate into iTQL as any other rule, only I don't know what to do with the results.

In all other cases, the results become statements which get inserted into the datastore as inferred data. It is not appropriate (though entirely possible) to treat consistency checks in the same way. If I did then I'd need to define a new "inconsistent" class, which would be very messy, and could be easily missed.

I believe that inconsistency should be reported as some kind of exception, but I don't really know how and when to do it. It will probably depend on how we integrate the rules engine into Kowari. I'll have to ask AN's opinion in the morning.

My health has been terrible today, so my writing here will be necessarily brief.

Today was spent converting the OWL Lite Horn clauses into iTQL. Mostly easy, but it come up with a couple of surprises.

First of all, it seems that it is quite easy to convert from simple DL (description logic) statements into iTQL. An informal set of rules to do this results in something like:

  • The head of a DL statement gets translated directly to a "select" clause in iTQL.
  • Property statements, which are binary functions, are the only function type allowed. Functions with three or more parameters are not permissible.
  • Instance declarations and variable restrictions are the only other allowable constructs.
  • Instance declarations get converted into RDF "type" statements.
  • Property statements get converted into RDF statements of "subject,property,object".
  • Variable restrictions are converted into RDF statements using iTQL "magic" predicates.
Following these rules is actually quite simple, and I was able to get through many of them today. However, there were a couple of occasions when it is better to step outside of these rules.

A transitive predicate like owl:subClassOf can find a set of inferred predicates with an iTQL query like:
select $x <owl:subClassOf> $z

from source
where $x <owl:subClassOf> $y and $y <owl:subClassOf> $z;
This will require the rules engine to iteratively insert the results of this query until the query generates no new statements. To optimise this into a single operation we have created the trans() function in iTQL:
select $x <owl:subClassOf> $y

from source
where trans($x <owl:subClassOf> $y);

Almost all of the conversions that I mentioned above could be done automatically. This would be great, as it would let us create a restricted Horn clause interpreter for Kowari. However, finding special cases for optimisation, like transitivity, could get very difficult in an automated interpreter. Failure to find these special cases would not even show up as a bug, as the program would execute correctly, just non-optimally.

Another problem is in the limitations of DL. For instance, the owl:transitiveProperty property can be applied to properties in OWL, but this is illegal in DL. The DL for owl:transitiveProperty is:
  P(X,Z) :- P(X,Y), P(Y,Z).
This is fine, as far as it goes, but it needs the system to apply to every possible "P" in the datastore. This would occur outside of the DL framework. What would be ideal would be something like:
  P(X,Z) :- P(X,Y), P(Y,Z), transitiveProperty(P).
But this is illegal, as "P" is a property, not a class.

Fortunately, iTQL doesn't have any such restriction, and the code can be created accordingly:
select $x $p $z

from source
where $x $p $y and $y $p $z and $p <rdf:type> <owl:transitiveProperty>;
Of course, since this is a transitive property (by definition) and we have a backward chaining optimisation for that kind of operation, then it should be possible to further optimise this particular example:
select $x $p $z

from source
where trans($x $p $z) and $p <rdf:type> <owl:transitiveProperty>;
However, trans() wasn't really written with the idea of a variable predicate in mind, so I'll have to lower its priority during execution to make sure that the predicate variable gets bound before it gets resolved. I think this will make it work correctly. SR agrees that it should.

Language Restrictions
While I'm discussing restrictions on DL, I should point out an observation that SR made. I mentioned that Kowari does not support the existential quantifier (as DL systems are unable to express this). SR pointed out that while RDF does not support this, Kowari is quite capable of it. For instance, to declare that a node has a property, all that is needed is to include a statement which has that node as the subject, and a blank node as the property. It may be possible later to resolve the blank node to be the "sameAs" another property, but in the meantime it provides existential quantifier functionality.

Unfortunately, the interfaces into Kowari do not permit a blank node to be a property (or predicate). This is simply due to the standard RDF class hierarchy which we are using. Similarly, it is not possible to use a literal as a subject, although the underlying database supports it. I'll confess that this last restriction can be frustrating, as it prevents the provision of types on literals using RDF statements.

Because there are some useful things to be done with Kowari when these restrictions are not in place, SR, AN, and I had a short discussion about providing a less restricted version of the interfaces. With such a system, and a supporting language, it may be possible to come very close to a complete second order predicate logic system.

This then brought me back to Monday's discussion with Bob, when he suggested that I work out the differences between iTQL, descriptive logic, and second order predicate logic (description logic and second order predicate logic have some overlap, but some features are exclusive). Working out where iTQL overlaps with these systems can help us work out what iTQL needs to have added. It can also help us work out the tradeoffs involved with some features, particularly when working out which features will be impossible when another feature gets added. It should also help me to follow the DAWG a little better.

The more I look at iTQL the more I think that it may be possible to get a very good degree of overlap of the two competing logic systems. I don't pretend that we can squeeze in every feature of both, but I've already been shown that some features which I thought were solely the domain of only one system or the other, can all be expressed in iTQL.

Tuesday, September 14, 2004

I've now checked the existing Drools inferencing code into Kowari. It is still an external application with no integration into the Kowari application, but it included and built as a part of the Kowari package now.

I should create a set of test files so inferencing can be tested from Ant, but I won't be working on that until I get a little more time. However, I do have Ant correctly building the code, and this is a significant portion of the work, since the most difficult part of making Drools work as a Kowari client was getting the classpath correct.

In the meantime I've moved onto getting inferencing working for OWL, and not just RDFS. I'm basing the work on one of the OWL Lite papers by Raphael Volz. He has done several, but this one seemed to be the most convenient.

As I've already commented, it seems to be possible to take all of the required Horn Clauses required for OWL Lite and convert them into iTQL. I already have the required statements for RDFS, so now I just need to convert the OWL statements as well, and add them to the Drools configuration file. I started this afternoon on this, and I'll be continuing for the next couple of days.

To start with I thought I'd confirm the RDFS iTQL against the Horn clauses for RDFS. This way I could make sure I was doing things as I expected, and it could run a check on the work already done. Since this is the first time I've considered the specifics of conversion, I hadn't realised that the Sesame entailment code was written to perform much of its inferencing iteratively rather than directly.

An example of this iterative approach can be shown with rdfs:subClassOf. Given a statement of:

  C rdfs:subClassOf D
Then the resulting Horn Clause is:
  D(x) :- C(x)
To consider this statement in natural language for a moment, we can see that it says that if there is an instance of an object which is of class C, then it is also an instance of class D.

Both sides of this clause make statements:
 1. C is of type rdfs:Class
 2. D is of type rdfs:Class
 3. if x is of type C then x is of type D

However, the RDFS entailment code from Sesame does not bother with the first two statements. If D is of type rdfs:Class then this code will successfully create a statement that C is also of type rdfs:Class via rule 9:
select $aaa <rdf:type> $yyy from @@SOURCE@@

  where $aaa <rdf:type> $xxx and $xxx <rdfs:subClassOf> $yyy;
Also, if a predicate mentions that its domain or range includes objects of type rdfs:Class then this will create type statements as well, via rules 2 and 3:
select $xxx <rdf:type> $zzz from @@SOURCE@@

  where $xxx $aaa $yyy and $aaa <rdfs:domain> $zzz;
select $uuu <rdf:type> $zzz from @@SOURCE@@
  where $xxx $aaa $uuu and $aaa <rdfs:range> $zzz;

Unfortunately none of the above will create the statement that D is of type rdfs:Class, which is explicit in the horn clause. The statement that C is an rdfs:Class may or may not be created depending on whether D is mentioned.

For completeness sake it may be worth adding in a new statement to create a type statement for any resources appearing in an rdfs:subClassOf statement. However, I may be going overboard, as it is normal to declare classes explicitly, particularly at the head of a hierarchy. Assuming usual practice in the creation of an ontology, creating a rule like this may not pick up anything new.

I'll leave it for the moment, but I might refer back here if I have some time.

While I was at it I noticed that I'm not using trans in these rules. That makes the system completely forward chained, which isn't right. I may have fixed this before, but since I have not had anywhere to check the code in before today then it has possibly been mislaid. Fortunately, it's easy enough to fix this part of the code.

Monday, September 13, 2004

It turned out that the iterators for JRDFmem were not removing items correctly. This first showed up as entries in the index with a subject pointing to a set of predicates, with one of the predicates pointing to nothing.

The problem was due to "super" iterators being moved on immediately after the lowest level iterator (the item iterator) was moved. While this worked fine for normal iteration, it does not work for removal, as the super iterator will have moved on before a remove() can be called on it.

Early in the day I thought I could fix the problem by storing the old position and using this old value for deletion, but this was not carefully thought out, and it kept failing on corner cases. In the end I realised that I can only move the super iterators just as the item iterator is about to be moved. This required some restructuring of OneFixedIterator and TwoFixedIterator. Ironically the code ended up being simpler and smaller. Now that I think about it, perhaps that isn't so ironic. After all, the correct solution is often simpler and smaller.

Unfortunately, this took me until 11:30pm to get right.

There's lots to be done in only a short time for Kowari and TKS. Looks like I'll be leaving the problem of using non-existent URIs and instead getting distributed queries going with the (now working) resolvers. Hopefully I'll get some sleep in the next few weeks...

I had a conversation with Bob today about 1.5 order predicate logic. We started out by discussing what is possible with Prolog versus OWL. He demonstrated how OWL is incapable of describing a "Square" as a subclass of "Rectangle", while this is easy in description logic:

  Square(s) :- Rectangle(l,w), l = w, l = s
Conversely, existential quantifiers are expressible in OWL Full, but not in description logic.

In the course of this discussion he attempted to show me an example of a class system, and proceeded to write up a set of tables which were capable of holding both the class definitions (eg. domain and range on a property) as well as instances of the classes. While these tables wre complete, I was nonetheless struck by how well RDF is able to express the same thing. When I re-wrote the data in RDF it became far more intelligible.

Bob wanted to show me what was possible with first order predicate logic, and what is possible with second order. The idea is that a very useful subset of second order predicate logic exists which can be used to express almost everything from OWL Full. This is often referred to as 1.5 order predicate logic.

It was while going through the kinds of information required from a graph that I was able to demonstrate that iTQL is already able to do everything from first order logic, and many things from second order. Since there are a couple of things that we have yet to nail down for iTQL Bob has suggested that I look at exactly what it required for second order logic (or at least, the 1.5 order subset required to express OWL Full), and compare that to what iTQL is capable of. For those constructs which are still required it would be worthwhile considering how they could be incorporated into the language. If they cannot be added, then what would be lost (it may be insignificant) and if they can be added, then what would be the tradeoffs. Obviously this has implications for the DAWG as well. If I can pull it off effectively then it will help us to evaluate the DAWG language and compare it to what we have in iTQL. It will also help us nail down just what iTQL needs.

This doesn't sound too hard. After all, I have 2 years, part time, and I don't have to actually implement the language. Yet Bob thinks that it would make a decent Masters project. Obviously there's more to it that I realise!

Note that I've been talking about OWL Full here. Bob pointed out that just because a system is undecidable does not mean that it is useless. Turing complete languages (such as Java or C) are not guaranteed to complete (while(true) {...}) and yet these languages are still quite useful. Programmers just have to be careful that their constructs terminate. Conversely, decidable systems are not always useful. Databases are guaranteed to be decidable, and yet they may take a week to execute a query (I spent many nights as a student being paid to monitor a machine that was doing just that), and might run out of memory before the query is complete.

For this reason, Bob thinks that an OWL Full implementation would be quite useful. Given that I've already been thinking along these lines, then I was only too happy to agree. I'll just need to convince database administrators that an undecidable data access language is a good idea.

Sunday, September 12, 2004

New Format
Returning readers will notice that things look a little different here. Hopefully it looks nicer.

For some time I've been annoyed at the template I've been using for this blog. I chose that last style because I was after a "clean" look, but it was very light and I find that dark pages are easier on the eyes. This new template also offers a smaller set of fonts, which again is appealing.

I know that I could sit down and tweak the template to be "just right", but I'd rather spend the time on more productive pursuits.

Since the template wants to put in my personal description, I updated it to be a little more descriptive of me. (Though, how does one distill a personality down into 2 sentences?)

Friday, September 10, 2004

Horrifying Accounts
I just read the latest blurb from Telstra on their online services. One of them is online space to set up your own web page. If I want to set up a web page with them then I can have a space allocation ranging from 5MB for $2.95 a month, up to 20MB for $19.95 a month.

To prevent people from putting up large files (like home movies) and having people download them all the time, Telstra have imposed limits on the amount of data that may be downloaded (100MB for the 5MB account, up to 300MB for the 20MB account). And in the fine print at the bottom of the page: "A per-MB fee will be charged if this allowance is exceeded."

Great. So if I don't like someone all I have to do is create a script to repeatedly download pages from their site. Give me long enough, and I could rack up a massive Telstra bill for someone.

I think that it is always a mistake to charge customers based on actions which they have no control over.

Kowari Tests
Everything was working smoothly today, and so it was all checked in. I spent some time writing some documentation for the code, and also on the ideas I have for writing entailment Horn clauses as RDF. The more I think about the idea, the more I like it. A set of ontology inferences can be written as an RDF document, and stored as a model. Then one of these ontology models can be applied to a model of base statements to create a model of inferred statements. It all seems rather clean.

The one thing I'm trying to decide, is how much structure should be in the iTQL used to select the data, and how much should be in RDF. For instance, a constraint conjunction could be a collection of constraints, with a "Conjunction" property, while disjunctions could be stored in a similar collection having a "Disjunction" property. The clause itself would be an anonymous node, which then describe the body property using collections like this, and the head as a separate property.

This would work fine, but we already have a parser which can construct WHERE clauses in this way from an iTQL string. Is there really a good reason to avoid storing the entire iTQL string as an RDF literal? The most compelling argument I have is that it is far from elegant to mix data types in this way. A file should hold one format of data only, and not embed another format within it. This is one of the reasons I dislike the current Drools file. It holds a significant amount of iTQL, which is in turn embedded in a string in some Java code, which is really part of an XML element. Nasty.

Back to TKS
I was waiting to talk to DM about distributed queries for TKS, but he was unavailable today. Since distributed queries is a TKS feature, I decided to run the full test suite for TKS to confirm that everything was at a stable point for me to move forward.

Unfortunately the TKS tests are failing in all sorts of places. So I agreed to see if there was anything in there that I could hopefully fix.

Shortly before leaving this afternoon I spotted something interesting. One of the exception traces complained about an I/O stream which could not be opened for a fake URL ( It turned out that this domain is in a Node from a statement, and retrieving this node causes the system to try and open the URL and read it in as an RDF file. SR likened this to Lucene code, but this code is not in the Lucene areas.

I hope it won't take me too long to work it out on Monday.

Last night I managed to get SANE and the TWAIN-SANE driver going for Anne's and my Macs. I have SANE running on my Linux box with the scanner attached, so now we have network access to the scanner.

It wasn't as easy as I'd hoped. While SANE works fine on my Linux server, I was initially unable to see it with my Mac. I started using large amounts of logging for the "net" driver, and I was able to see that the server reported no scanners to be available when being accessed through the net. Everything seemed to be set up correctly, so I started looking through the SANE FAQs. It turned out that saned needed to have permission to access the scanner. This meant changing the group and permissions on the device node for the scanner to match those of saned. This also needed an update to the /etc/hotplug/usb/libusbscanner file to set the correct group.

Now it works, but there have been 2 problems. The first is that it is not quite stable. Because the shared library is loaded into another program (like Photoshop or Word) then a failure there causes the host program to quit, an event which has occurred several times now. The second is the preview function. It lets me zoom in on a preview, and to set a selection, but it seems to be impossible to zoom back out. Even when a program is restarted, the zoomed in area is remembered, and the larger picture cannot be seen.

I'm thinking of giving up and using a different application front end. So far I haven't been able to get xsane to compile for OSX, so I'm considering checking out Scanlite, as it is Java based.

In an effort to conserve hard drive space, I've been converting my saved videos recorded by MythTV over to a DivX format. This gives a space saving of about 80% even when I have a very high bit rate set. Another advantage is that the files are readable by more decoders. I'm using Mencoder, which is my favourite transcoder.

Since source files are stored with file names based on channel and date, the MySQL database used by MythTV needs to be checked for this info. Similarly, when I go to clear out the disk space of the original file, I need to remove it from the data store.

My code to do this is a major hack, and written heavily for my own system. But if anyone is interested, then please feel free to use it. Just remember to make sure you have the MySQL DBI installed for Perl, and set up the transcoding line for your own system.

Everything Old is New Again
My IEEE subscription just told me that Cold Fusion is back. Very cool.

The most interesting aspect for me was the Naval scientists who knew and respected Fleischmann were quite happy to ignore the bad publicity he and Pons received and looked at the case on its own merits. It is yet another demonstration about how eagerly people give credence to the stories put forward in the media (particularly those as sensational as cold fusion), with no sound reasoning behind their decisions. Fortunately there were scientists out there willing to ignore the hype.

It would be nice to see this research produce something. It would be much cheaper than the traditional idea of fusion, and projects like Sandia's Z-machine seem to be quiet at the moment.

Thursday, September 09, 2004

JRDF Removal
I wasn't well today, and I was also in and out of the office a bit, so the day didn't end up as productive as I'd have liked. However, I did finally get the JRDF removal problem fixed.

I did a clean build this morning before I thought about what I was doing. This removed the logs which described the location of the problem code. So I started out by running the time consuming tests again and it took me a little while to trace out the code I needed to change.

Once I found the problem code in ThingJRDFModel, I was still stuck trying to find which code to actually modify as this code only uses general interfaces. The iterator to be modified was an instance of the ClosableIterator interface, with no clue as to which of the numerous implementations would be the appropriate one to change. Fortunately, re-running the tests left a stack trace which showed the class to be ClosableIteratorImpl.

My first attempt to implement the remove() method took the session that the iterator used, retrieved the database, and deleted the requisite triple. This didn't work for two reasons. First, removing a triple in this way was actually going to remove the triple from all models (or graphs), and not just the one referred to by the iterator. More importantly, this deletion mechanism operates outside of any current transactions, so it was not visible to the transaction in which the iterator was being used. Hence, even though the triples were being removed, the results of this were not being seen in the tests.

A little help from DM helped track all of this down. The code which had worked for the Kowari graph prior to this, was using a method with a signature of Graph.remove(Triple). Looking up how this code worked internally showed that it was using the Session.delete(URI,SubjectNode,PredicateNode,ObjectNode) method. Obviously, it should work if I duplicated this call inside the iterator's remove method.

The ClosableIteratorImpl class did not have access to the model URI for the graph, which it was going to need to do the deletion. This meant modifying the constructor of ClosableIteratorImpl to take an extra parameter. Rather than working out which type of Graph implementation was creating the iterator, I tried to side-step tracing the code path completely and checked for all instances of "new ClosableIteratorImpl" in the code. I got lucky, and found it in only 2 of the find() methods of AbstractDatabaseSession (isn't it weird how everything end up in there eventually?).

Fortunately, each find() method had access to the URI of the requested model, so it was easy to pass this through to a modified constructor for ClosableIteratorImpl. The next() method was changed to save the returned triple, and the remove method was implemented trivially by calling:

  session.delete(graphURI, latestTriple.getSubject(), latestTriple.getPredicate(), latestTriple.getObject());
As with the in-memory implementation of JRDF, this was able to throw an exception, and again I resorted to the inappropriate IllegalStateException to throw out of the Iterator.remove() method. I'm such a sell out. At least I didn't use a raw RuntimeException.

Wednesday, September 08, 2004

I finished the remove code for the ClosableIterator classes today. I have yet to write unit tests, but the Kowari tests which needed this method are now passing, so it looks promising.

Part of the operation of a Remove involves calling methods which can throw a GraphException. Unfortunately, the remove method does not support throwing any new exception types, so I was forced to resort to a RuntimeException. After much angst I opted for an IllegalStateException, as the exception will only be thrown if the current iterator's index can remove a statement, but a different index cannot.

When I first ran the Kowari tests I discovered that nothing had changed. This is because the code in question was using the following idiom:

  Iterator ci = graph.find(...);

while (ci.hasNext()) {
Triple triple = (Triple);
This was not using the Iterator.remove() method, and it should be evident that it was going to cause ConcurrentModificationExceptions when used on a graph that is base on the java.util collections framework. The change to using Iterator.remove() was trivial, and allowed the test to pass, making KA happy:
  Iterator ci = graph.find(...);

while (ci.hasNext()) {;
However, this caused the Kowari graph implementation to fail the same test, due to an unimplemented Iterator.remove() method. KA was not too happy about that part, and asked if I could fix it.

I mentioned the problem to DM, who pointed out that he went to a lot of effort to make sure that removing triples from a graph with a current iterator would work perfectly. This is where Kowari makes heavy use of all of its micro-phases. Since I have no plans on making JRDFmem completely transactional, with micro-phases to protect each little operation, then I'm happy to let it throw a ConcurrentModificationException instead.

To get it working, I'll probably make use of DM's hard work, and have the Iterator.remove() method call the Graph.remove(Triple) method.

KA helped me set up Eclipse for a Kowari project. It certainly has a lot of features. However, with great features comes great complexity. Even KA was stumped setting much of it up, and he had to refer to his own setup for many items.

By the end of it, it seemed to be building within Eclipse, but there were still a few problems being reported. One of them was a complaint that a file was missing, so I went and tracked it down and added that file. Then I started getting a report saying that file was included more than once. KA didn't know exactly what was going on, so I'm sure I'll be working on this for a while.

Tuesday, September 07, 2004

Short Day
Anne had an operation today, and so I half the day off to act as chauffeur and baby sitter.

In the couple of hours available for me to program I worked on the remove method on the ClosableIterators. I've finished the bulk of the coding on the ThreeFixedIterator, TwoFixedIterator and OneFixedIterator, though I haven't compiled or tested them yet. I still need to work on the GraphIterator, plus any cleaning which I've missed.

It shouldn't take me too long to finish it all tomorrow, and hopefully it won't take too long to test (and possibly debug).

Monday, September 06, 2004

Proof Reading and Blogger Bugs
Well I came back and proof read this entry, only now I can't publish it. The publishing post keeps timing out with the following message:

    001 Connection timed out
It's very frustrating. Anyway, I'm having another go, with a new header (ie. this paragraph) in the hopes that it will start working.

Holiday Preparation
I'm back from 2 weeks break. Not all that relaxing, as it was mostly spent visiting relatives to introduce Luc. It was nice to see everyone, but I'm glad I'm back so I can relax. :-)

My last couple of days of work were the 19th and 20th. I had hoped to make a couple of blog entries for those days, but I wasn't able to keep up with everything, so it never happened. Those two days were spent showing AN what I'd been doing on inferencing with DROOLS, so he could finish off what I'd done, and working on the HAVING clause.

The HAVING clause needed to be done in 2 parts. The first part was the syntax, and the data structures created by it. The second part was the final filtering and grouping based on these data structures. I had enough time to get the first part done, but I only just started on the second before leaving. A lot of this time was spent talking about the design of the changes, and also for the hand over of what I'd done.

So then I got to have my time off.

Not a lot of time for work or study over the last 2 weeks, but I thought I'd mention a couple of things which I did do.

JRDF and Today's Work
During my first week I saw a message from AN about the in-memory implementation of JRDF. It turned out that there were a couple of little bugs, starting with comparing URIReference objects from 2 separate models. This didn't really surprise me, as I hadn't actually considered multiple models when I designed it. After all, I really just wrote it over the course of a few evenings, simply as an exercise to get me familiar with the interface, so I had some practice before doing a disk based JRDF in C.

It turned out that having an in-memory model like this was useful, and ML was using it. Only he was pushing it further than I'd ever really considered, and hence the bugs showed up. So I started thinking about the design, and I realised that the simple bug fixes that was doing only hid some more fundamental problems with the design. For instance, there could be a real problem if a model were serialized with all of its node IDs and a new model were to re-use those IDs. Node IDs had to be local to a model (meaning a URIReference which appeared in more than one model could have multiple IDs) or else they would be global, but only for the lifetime of a session. The latter would mean big changes to serialization to avoid the internal node IDs, but that seemed reasonable.

I mentioned these problems to AN, and he wrote back to tell me that he fixed it all. Whew. I didn't want to work on it all again.

When I got back this morning I learned that my little evening hobby is now being shipped as part of the production Kowari system. Eeek! :-) Oh well. I guess that what open source code is about.

There is still a remaining problem with the in-memory JRDF. The "remove" methods on the iterators are not implemented (I always threw an UnsupportedOperationException as I had no need to implement them before), but without them it is not possible to remove triples while iterating over a set of results. If Graph.remove() is called while iterating then a ConcurrentModificationException will be thrown. This is to be expected, but it's annoying for anyone who might want to do something like this. So this is what I spent part of today implementing. I have it all done on two of the ClosableIterator implementations, but I still have another two to go.

When remove is called it is a simple enough matter to pass the call down to the wrapped iterator, but that will only remove the final node (and hence the triple) for that index. The other two indexes also need to have their statement removed. Fortunately there was a private method for that on GraphImpl so I've promoted that to protected visibility, and I'm able to call it from the remove method on the ClosableIterators.

Actually, I just remembered that I missed something. If I pass down the remove call to the underlying map, then I need to check if it was the only entry in that map. If it was then I should remove the entire map. This check will then need to propagate up the maps to the head of the index if need be. I'll add this in the morning.

While discussing today's work, I should note that the HAVING clause was never finished while I was away. Once I have this "remove" code sorted out I'll get back onto that.

AUUG 2004
On my second week we went to Melbourne, where the AUUG 2004 conference was on. Though I was unable to afford the conference, I decided to take advantage of the free tutorial that Apple put on (thanks Apple!). When I showed up, someone had made a mistake and put me down as a full conference attendee! If I had the time I could have attended anything I wanted, plus I'd have been fed for free.

I didn't do anything that I shouldn't, but I did score a nice bag, with a decent T-shirt and the conference proceedings in it. I also got to reacquaint myself with a few people I haven't seen in some time.

I finally wrote out that paper on inferencing for Bob, just before I went to see him. He wasn't too concerned about the specific content, but had wanted me to take the time to think things through. It gave me a clearer idea on what I mean by inferencing, and how I think it can be done. When I get time I'll put it on a web site somewhere.

While I was in at UQ I was given access to the ITEE network, a key for 24/7 access to the ITEE building, and a key to an office where I share a desk with another part time student. I probably won't use it much, but it was fun to make Anne jealous with it, as she was never given an office when she did her Masters in Landscape Design. ;-)

I didn't do as much reading as I'd hoped, but I did get a little done. I've borrowed a copy of Bob's book to read a few useful chapters, and I'm hoping to get through that, along with some material on Magic Sets. This came up because I finally re-read "A Survey of Research on Deductive Database Systems", and I now understand a lot more of what it is talking about. It's very useful, but it is now 11 years old, so I will have to do quite a lot of my own research to get a more up to date view of the state of research in this area.

The most interesting thing about this paper was learning that we have re-implemented a number of things in Kowari which others have been doing for years. This is hardly very surprising, but it was strange to see. It is also useful to be able to put a name to things like semi-naive evaluation which is exactly the trick I'd used in the DROOLS code. The survey paper pointed out that several people had independently proposed this. The fact that I came up with it myself demonstrates the obviousness of this solution.

The only technology from the paper that I really know nothing about is Magic Sets, hence my desire to look into them a little more.

The single biggest thing I got from this is how big the field is, and how little I know about it. I'll confess that it has me a little intimidated.

I also noticed that all of the surveyed projects, going back to the 60's, are based on predicate logic. They expect their data to be in a predicate form, and yet almost all databases are in the standard tabular form, requiring a mapping from one form to another. I realise now that RDF is one of the few forms of data that predicate logic can be applied to directly. Given the huge body of inferencing research in this area, this makes RDF seem like an ideal tool for inferencing. I haven't seen this written down anywhere, but I suppose this might be one of the reasons RDF was designed in that way that it is.

Inferencing Rules
While doing my inferencing paper for Bob, and again during some of my reading, it occurred to me that almost all inferencing is done as a set of rules, often described with Horn clauses. These all boil down to: "If this set of statements, then this new statement."

OWL inferencing is a specific set of statements from this general structure. So if I consider the general structure then I can cover OWL. Incidentally, I've noticed that I can't find much on OWL inferencing. I've seen validation on OWL, and I've seen RDFS inferencing (or "entailment") but I've yet to see anything that is inferred with OWL.

Many projects refer to Intensional Databases as those which store these rules for inferencing. It occurred to me that we also want to store such rules in the database. The closest I came was putting the rules into an iTQL form and putting them into a script used by DROOLS, but that wasn't particularly flexible.

I'm thinking that we want to define a model (or a number of models) which hold inferencing rules. A rule can have a property which contains iTQL which defines the body and head of the rule (the head makes up the resulting statements). While the whole lot can be defined in a single iTQL statement, testing and insertion could be separated if the head and body were stored separately. When being executed the rule would be given a destination, and a source for the body to be applied to (ie. a FROM clause).

The result would be models which defined the rules (eg. an RDFS entailment model, an OWL inferencing model, or a domain specific inferencing model), and a new iTQL command to apply a model-of-rules to another model, inserting the results into a destination model.

I'll have to speak to AN about this in more detail, but it appeals to me, particularly due to its flexibility, and because it can be mapped directly from Horn clauses.

File Systems
Several people have already pointed out that the properties proposed for WinFS look surprisingly like RDF, with a few things missing (hi Danny, if you're reading). I've been thinking for a while that this seems like a strange restriction, when one could always tack RDF straight into the file system.

I've let it slide for a little while now, but I've been thinking that if I get back to this C implementation of an RDF store, then I should be able to build a filesystem with RDF built in. (How do I implement the interface for that? ioctl() calls? It should be an interface that makes sense and doesn't break anything.)

Now that MS has announced that they want to push WinFS back a couple of years I'm even more keen to do something with this. Once I've sorted out the RDF store, and the interface, then it should be reasonably easy to build it into a user-space filesystem for Linux. I've been meaning to have a look at how to do a user space system, and this would be a great opportunity to learn.

Besides, it would be terrible for MS to have something that Linux doesn't. :-) While I'm at it, I should see how filesystems for Macs work. Macs are already doing something in this space, but portability is still a desirable trait in any system. There is an ext2fs module for the Mac, so I should be able to use that as a template.

First things are first though. I'll have to get this RDF store working before I try and patch it into something like a filesystem. Still, I'm pleased to have a purpose for the code, so it isn't just a mental exercise.