Working notes: 03/01/2005

Thursday, March 31, 2005

More Cardinality
After prototypo's comment I thought that I should try to be clearer in my criticism of owl:minCardinality.

A system will always be in two of four states. It can be valid or invalid, and it can be consistent or inconsistent. A valid system is true under every possible interpretation, while an invalid system has at least one interpretation where it is not true. An inconsistent system is false under every possible interpretation, while a consistent system has at least one interpretation where it is true.

With the open world assumption there are always going to be a lot of interpretations open to us. We cannot tell a user that the data has a problem until we know that the data cannot possibly be right. This means that we want our system to be consistent, so we have to report an inconsistent system. Invalid systems are fine, as they could well evaluate to true, depending on future data. Of course, valid systems are also OK, as they always evaluate to true.

If we can prove inconsistency, then we can report the problem, and Kowari will be done.

The problem is that owl:minCardinality statements will always be consistent in an open world. On occasion they will even be valid (when all the objects are owl:differentFrom each other). So what is the point of statements with the owl:minCardinality predicate? They can't ever be false, so they don't tell you anything! It's kind of like a belt and braces when your belt is unbreakable.

If owl:minCardinality is not useful in the open world, then what is the point? I have to infer that it takes on the more intuitive meaning, and gets applied in a closed world sense. In that case it becomes useful to a database like Kowari. It is for this reason that I think that the contributor of this statement wasn't thinking clearly about the open world assumption. Otherwise, why include a statement of no value?

OK, I should concede that a system based on an owl:minCardinality statement can be false... but only when the range of the predicate is a class with a restricted set of instances, where the size of that set is less than the cardinality. However, the Description Logic Handbook actually defines consistency as ABox data which meets the criteria of TBox data. This example demonstrates inconsistent TBox data, which isn't quite the same thing. In other words, it's a bad ontology, not bad data being described by the ontology. It's still inconsistent, but it doesn't meet the inconsistency definition put forward by the DL Handbook.

Testing
Testing on OS X is still impossible. The JVM always ends up dying somewhere in the process. I don't know any way around this, so for the moment I have to run all my tests on a Linux server. That's a little annoying, as I need to manually synch the files between my PowerBook and the server.

I'd love to find a small subset of tests that will run just the "store" tests, but for the moment I can't find it. At the moment I'm running the full suite in a scrollable window, so I can try and work out a containing target that might do the job for me. I haven't been able to do that before now, because I've used "screen" on the Linux server, and that loses the whole output buffer. And anyone who has ever looked inside the build.xml (and package build.xml files) from Kowari will know that it is not possible to work out what the necessary targets are for a single set of tests.

The test code also ended up more complex than I expected. This was because I hadn't really thought about every possibility for the data.

To start with, I wanted to check when the variables in the minuend and subtrahend had a null intersection, and also when they had intersections of 1 and 2 variables (and hope by induction that 3 or more would also work). I also wanted to check when the intersection equaled the minuend, and equaled the subtrahend, and then equaled both.

Once these parameters were set up, I then needed to consider the bindings. I needed to check when individual statements are removed, runs of statements are removed, statements at the beginning are removed, and statements at the end. I also need to check prefix matching on data that exists and does not exist, for most of the above conditions.

Finally, I ran into an interesting problem while comparing these tests to the UnboundJoin tests. This code makes a lot of use of unbound variables and the effect that they have on the results. I've tried to include several such tests for these as well, though their validity is going to depend on how well prefix matching works on the subtrahend.

Anyway, I've spent some of Tuesday, all of Wednesday and part of Thursday writing these tests. Last night and today has been spent fixing some of the bugs shown up, but I still have a few to go. While I'm still stuck running the full test suite on a remote server I'm spending a lot of time just waiting for it... hence this entry. I can't continue having a turnaround at this pace though, so I hope to improve the process on the next iteration.

Wednesday, March 30, 2005

OWL Cardinality
After posting my problems with cardinality in an open world, I've been giving it some further thought. Coincidentally, Andrew made a comment which closely reflected the opinion that I have come to.

If minimum cardinality means nothing in an open world, then there was very little point in including it in the OWL language. Whoever did so probably forgot about the open world assumption at the time. I can't blame them, as I find it very easy to forget. So I think I should assume that the intent was for this to operate in a closed world assumption. Similarly, I have to assume the same about the intent of maximum cardinality.

I'm still a little uncomfortable about switching from an open world assumption over to a closed world assumption without warning. I'm wondering if there should be some kind of user-settable flag which would allow cardinality to be determined in a closed world. Whether I do this or not, I should at least look at the iTQL structure for both types of queries. Once I have them both it would not be too much work to switch between the two paradigms.

Unfortunately, owl:sameAs makes the closed world assumption difficult as well.

Calculating
For either type of cardinality, any statements which are declared owl:sameAs will need to be considered to be a single statement. In this case, we need to count all objects in the relevant statements (relevant here means that the predicate is the one that the cardinality constraint applies to), which are also the owl:sameAs other objects in the relevant statements. If the number of these objects is n, then we need to reduce the count of the total relevant statements by (n-1).

An alternate method of counting, would be to find and count all relevant statements whose objects are not declared the owl:sameAs anything. Then count all the groups of owl:sameAs objects which appear in relevant statements, and add this to the total.

Unfortunately, both methods here require arithmetic operations (addition and subtraction). iTQL does not support this, though one day it may have to. While I need it here, I'm not sure that I want to add it just yet. I may have to consider a non-iTQL approach. I'll give it further thought before I commit to anything.

I should note here, that because of inferencing we can be sure that all owl:sameAs statements will be fully symmetric by the time constraints are being checked. Similarly, inferencing means that if one object appears in a statement, then all other objects which are the owl:sameAs it will also appear in similar statements. At least this allows for a certain consistency when dealing with these statements.

Tuesday, March 29, 2005

Difference
I've finished the difference operator, and am now part way into the tests for it. This operation has been frustrating, as it required a disproportionate amount of work for a relatively small amount of code. The problem was understanding all of the join code, and understanding the ramifications of apparently simple things. It took quite a lot of effort, including input from Simon and Andrae. It was also difficult making the mental context switch from the algebraic level (higher up in the abstraction stack) to the low level tuple-at-a-time interfaces.

Unfortunately I wrote at length about this last week before losing what I'd typed, and tonight I'm disinclined to go into all the details again. I'll just gloss over things, else I'll never write about it.

I had to look the terminology up, but the left and right sides of a subtraction are called the minuend and subtrahend respectively, and I've used these terms in the code. The idea of the difference operation is to remove all tuples (or "rows") from the minuend, where there is a corresponding row in the subtrahend. In this case, corresponding means that all the matching variables contain the same values.

An efficient way to do this involves sorting of the subtrahend. The code would then iterate over the minuend, and perform binary searches on the subtrahend to determine if each row should be removed.

This efficiency can be improved if the minuend is also sorted in the same order as the subtrahend. If a match is found on a row in the minuend (meaning that row is to be discarded) then a sorted minuend can use a binary search to jump over all the subsequent rows with the same variable values.

The problem with some of the steps in efficiency is the initial cost. Small data sets don't really matter, given that they will never make it to the disk, so it's only the large ones which need to be considered. Sorting large sets will be on disk where it is expensive. This complexity has to be traded off against the complexity of leaving the data unsorted.

In the case of the subtrahend, this tuples will need to be searched for every row of the minuend. If this were performed against unsorted data, then the complexity would be polynomial, which is unacceptable. Sorting can also be polynomial (we use the quicksort algorithm) but this is a worst-case condition, and has very low probability. Quicksort actually averages at n.log(n) which is pretty good. The data will often be sorted already, so a general algorithm which always uses sorted data and only sorts when it needs to, will perform reasonably well.

Also, the nature of subtracting means that our use cases typically have a small subtrahend, in relation to the minuend. This further increases the benefits of sorting the subtrahend.

Conversely, sorting the minuend is not so clear cut. For a start, sorting is only of benefit when there are a lot of rows with the requisite variables set to the same values. This does happen, but not for the majority of rows. Performing a search after every row to find which row that set of variables ends on, changes that part of the operation from a linear complexity up to n.log(n). There are ways to avoid the check every time (move on to the next row, and if it matches the last row then do the search), but there is still an expense. Given the minor improvement that is likely, then the initial cost of the sort would not appear to be worth it. As a result, the minuend is not being sorted.

Ideally we could develop a heuristic to determine when the minuend should be sorted, but this is not trivial, and smacks of premature optimisation. If there is a problem in the future then this could be revisited.

Sorting
Sorting is not as easy as I think it should be. For a start, I wanted to know when I needed to sort and when I didn't. The only indication of the sort order for a tuples is the RowComparator, so I thought I would be working with that.

After looking at this interface, I realised that there is no way to query the order that it imposes. There is also no way to compare two separate instances of RowComparator except by reference. While I understand the philosophy of making this an opaque object, it should at least be able to compare itself to other comparators so they can determine if they impose the same ordering.

Looking at the other comparators in the system, I realised that we only have two types. The first is used for ordering the AVL trees in the native quad store, while the second is used for ordering globalised results alphabetically. Neither of these were useful for ordering by a list of variables, based on local node IDs.

Andrae helped me to discover that I needed to use the TuplesOperations.project() method to get the ordering that I need. There is a sort method available, but it will sort by all the variables in the tuples (I don't mind which order the variables are in, which is good, because the sort method does not accept an order). So the unwanted variables must be removed before the sort can take place. This is done lazily with the project operation.

This operation can result in duplicate rows, so internally the method sorts the results to make duplicates appear consecutively. In this case, the project will do all of the work.

The main problem with using project blindly is when the data is already sorted in the correct way. In this case, no operation is necessary, and the expense of the projection should be avoided. This is easy to spot when the tuples to be projected only contains the required variables and no others, but we currently have no way of seeing the sort order when the tuples has other variables as well. This is a failing in the general case for join operations, and will need to be addressed eventually.

In the meantime, the sorting of the subtrahend is determined by a couple of simple steps. If the data has the correct number of variables, then it is checked for a sort order. If there is an order, then it assumed to be correct, otherwise a sort is performed. If the number of variables is incorrect (ie. there are extra variables which are not found in the minuend) then a project is performed. This unnecessarily reduces the number of variables, but also performs the necessary sort.

Lazy
After sorting the subtrahend, the implementation of the Difference class is relatively straight forward. Like many of these operation classes, the operation is performed lazily.

Almost all methods are passed on to the minuend, as this class represents that parameter, minus a few rows. The one major exception is the next method. In this case, the minuend's next method is called, but then the current row on the minuend is tested by checking for those variables in the subtrahend. If the row exists, then the minuend is moved on. This continues until a row is not matched in the subtrahend, or the minuend runs out of rows.

Describing this in hindsight makes it all look simple, but getting here has been anything but. I'm looking forward to getting the tests complete and moving back to the higher level work.

Friday, March 25, 2005

OWL and Cardinality
I've been giving thought to OWL cardinality in the open world model. My general rule of thumb for the open world assumption is that a model is valid if it is possible to add new statements which make everything consistent. In other words, I'm assuming that all of those statements could exist, but they just haven't been entered yet. A model can only be invalid if there is an apparent problem, and there are no possible statements which can fix it. For instance, the following model has an inconsistency which cannot be fixed with a new statement:

  <ns:A> <owl:differentFrom> <ns:B>
  <ns:B> <owl:sameAs> <ns:A>

This has big ramifications for cardinality in OWL.

Consider owl:minCardinality. This restriction says that there must be a certain minimum number of times a predicate gets used with a subject. However, if there are not enough statements to meet this restriction, then it is always possible to add them later. Hence the open world assumption means that it is generally not possible to check owl:minCardinality for consistency.

The only way that owl:minCardinality might be violated is if there simply are not enough options available for a new statement to be legitimately added. For instance, the range of a predicate may be a class which was defined using owl:oneOf. However, I can't see how this could cause a problem unless the list of available objects was smaller than the minimum cardinality. That would certainly be inconsistent, but it would be inconsistent TBox information, rather than inconsistent ABox data. This means that the owl:minCardinality would not be violated, but the OWL would be inconsistent. A different error, but one I should still check for.

owl:maxCardinality is a more complex story. At first glance, the following might appear to be inconsistent:

  <owl:Class rdf:ID="myClass">
    <owl:subClassOf>
      <owl:Restriction>
        <owl:onProperty rdf:resource="#myProperty" />
        <owl:maxCardinality rdf:datatype="&xsd;nonNegativeInteger">1</owl:maxCardinality>
      </owl:Restriction>
    </owl:subClassOf>
  </owl:Class>

  <myClass rdf:about="#someObject">
    <myProperty>
      <rdf:Description rdf:about="#firstPropertyObject"\>
    </myProperty>
    <myProperty>
      <rdf:Description rdf:about="#secondPropertyObject"\>
    </myProperty>
  </ns:myClass>

This is only partial RDF, and I haven't validated it, but it should demonstrate a property being used twice, when it has a maximum cardinality of 1.

The difficulty in checking this for consistency is that an <owl:sameAs> statement can make it all valid:


  <ns:firstPropertyObject> <owl:sameAs> <ns:secondPropertyObject>

It would seem that the only statements which can be validly checked for cardinality are those which refer to objects with explicit owl:differentFrom statements to differentiate them. However, it is only possible to work with a set of objects which completely differentiate themselves from each other. To confuse things, there can be more than one such set, with intersections between these sets. Each set has to be considered completely separately, and it is only those sets larger than that the cardinality constraint that will count.

I tried to post the iTQL query for this yesterday, but that was mostly to think the syntax through. It isn't all that difficult to find the owl:maxCardinality restrictions, and to conjoin this with the set of nodes found in owl:differentFrom statements. The trick is that the owl:differentFrom statements have to be selected both ways and joined to the restricted predicate in order to be counted. This needs a constraint like:

  $subject $predicate $node1
  and $node1 <owl:differentFrom> $node2
  and $subject $predicate $node2

Careful use of the count() subqueries should let me pull these sets apart, and find how often the predicate was used on that set.

I can't test this approach just yet, as I'll need to build some good test data first. That will take a little while, as I need to get through my current tasks first.

Blogspot Lag
What's up with Blogger today? It is taking over 5 minutes for a page to come in, and often timing out (with a 500 at the server end). Blogs are quick enough to download, but the pages to edit them are unusable.

Are the administrators using the holiday to do some work on the server? I didn't see any warnings about this if they are.

Recent Blogging
OK, so it's later and I'm back to rewrite what I talked about last time.

I haven't been here for the last week for several reasons. The first is that virus I complained about recently. It hung around for a little while, and was very annoying. It wasn't enough to stop working, but I was very tired at night, and went to bed early every night instead of blogging. More sleep was probably better for improving my effectiveness at work than recording everything anyway.

At the same time, my parents took Anne and I to North Stradbroke last weekend, and now we're out at the family farm for Easter (No broadband out here. Just 128kb ISDN). I've had the notebook with me the whole time (very sad, I know), but I've tried to use the time for working, and not blogging.

Since I've been doing a lot, I'll just try to mention the more important points from the last week.

I've also been writing some long posts, with lots of sub-sections. I'm now thinking that it would be useful to break these up in to several smaller posts, with one topic per section. At the least, it will reduce the risk of losing a lot of typing. I'll try it today, and see how it goes.

Apples
I just spent several hours catching up on the last week. I hadn't finished, but I decided to publish anyway, and was just proof-reading it all before sending it in. I was only part way through when I left for a few minutes for a snack. When I got back I had a very attractive screen telling me that I had to reboot. I've rarely seen this on my Mac before, and on each occasion the notebook was extremely hot to touch. Perhaps a chip failed through heat? There's not a lot than an operating system can do about that.

So I've just lost a couple of thousand words. I'm a little frustrated. I haven't experienced this in years, as modern applications regularly save their position. However, the proliferation of web interfaces has led to a reversion in technology. I'm now thinking that I should do all of my editing in a text pad (that will auto save) and just use a web browser at the last moment in order to publish.

Hmm, maybe I should learn Firefox, just so I can put in a patch that will autosave text fields (particularly large ones), and re-load them on the next execution. The file can then be deleted as soon as a page is submitted.

On the other hand, I have so much to do at the moment, learning a new system just doesn't seem like an option.

Topics
Just so I don't forget, the things I wrote about were:

The problems with determining OWL cardinality in the open world model, with consideration given to owl:sameAs and owl:differentFrom.
Speculation on some specific cases with rdfs:range.
Difficulties in performing the subtraction operation, including some suggestions for making it possible to compare different RowComparators.
The problems where sort(Tuples) does not skip the operation when sorting is needed. This needs to be fixed, and when it is the join and subtract operations will both be improved.

I also need to write about bugs with variable predicates in the transitive function, the new insert/select bug, and a proposal for a simple open source semantic extractor (an important component that seems to be missing in the open source world - why is that?)

Anyway, I've been at this for hours, and I need a break. I'll work on it later.

Wednesday, March 16, 2005

Catching Up
Just a quick note to say that, yes, I am still working. Today is Anne's birthday, so last night I was getting a few things ready for it, and tonight we went out for dinner. It just makes it difficult to any writing done at night. We're also going away this weekend, so I can't catch up then. Looks like Monday will be a late night of writing. :-)

Since it's now late, I can't really go through anything in detail. So I'll just give an overview and fill in the detail later.

Joins
Still working through the join code yesterday. I hadn't picked up all of it, but I learnt enough to work out what was needed for the difference code, and have made a design for it.

I've decided not to order the left hand side of the operation, which is a compromise on complexity, but there are good reasons for this decision. It comes down to the fact that differences like this are rarely on large enough data sets for sorting to save any time, and because data to be removed is typically sparse, and not in consecutive node IDs.

I also got to chat with Simon in person today, and I've finally worked out the last missing pieces of the join code, to the extent that I can work out the complexity of the operation (a two operand join comes to n.log(N), where n is the size of the first operand, and N is the size of the database). The implementation of the difference code will have the same complexity, so this seems reasonable. It may be possible to improve this for large enough data sets if another algorithm is used, but it will want some heuristics to work out the changeover point between the two methods. For now, the current design should work just fine.

While I'm discussing optimisations like this, it may also be possible to improve certain types of joins for large enough data sets, but again it will require some heuristics to work out the changeover point between the two algorithms. There shouldn't be a need to go this far until we see really poor performance on our joins (which we've never seen).

So I now understand joins, I have a solid design for the difference operator, and I've written part of the class to do this.

Data Retrieval and Logic
Today the department held the first part of a two part seminar on data retrieval. Today's seminar covered indexing, and query resolution. It was amazing just how much of it was like learning how Lucene works, although there were a couple of new concepts.

The other part of today covered bi-lingual and multi-lingual querying. The assumption is that all documents are indexed according to words. This means that the best way to do a multi-lingual query is to translate the query into other languages, and try to match to documents, rather than indexing different translations of the documents.

However, since I've been looking at semantic analysers recently, I started wondering about indexing on concept (an idea that was mentioned, but not pursued).

The whole thing gave me an idea for a simple document concept analyser, based on using groups of words in a thesaurus as a pseudo-concept. It has its problems, but there are almost NO open source concept analysers out there, so maybe it would be worthwhile implementing anyway. If it's useful then people might even be prepared to work on it to overcome the deficiencies. :-) I'll have to describe it soon.

After the seminar I was going to catch up with Simon for lunch, but it turned out that he had a "Logic Group" meeting with his supervisor, Guido. I need to improve my logic (for the theoretical component of the work I'm currently doing), so I asked Guido if I could sit in. I found it really valuable, as it answered a number of questions I have about notation and some fundamentals of logic syntax.

With the exception of a short discussion with Simon on the operation of "join", the rest of the day was spent on code.

Monday, March 14, 2005

Joins
Today was one of those days where you feel like you've been working hard all day, but then there is very little to report.

Today I was looking at the join code... again. I've pulled the file aside, and I've been annotating it with comments. I'm tempted to check this back in, as comments for other people to read. However, I'm not sure that is appropriate, as I've been over commenting the code, and the code is likely to get lost in the noise.

Either way, I understand it a little better now. Once again, I was fortunate enough to have some input from Andrae as I went through a part of it.

I think I should have a better understanding of this code for a couple of reasons. The most pressing is that subtractions will use very similar operations, though the more I read the more I doubted this. Funnily enough, now that I know a lot more, I'm back to believing this again.

The next best reason is because until now, only Andrae and Simon have understood this part of Kowari. I understood the architecture, and the principles in use, but this is a far cry from being able to follow the code. Indeed, even knowing what the code was trying to do, I still needed to annotate it extensively before I could follow any of it.

Having another person understanding this code makes it more likely to be maintained. It also means that there is yet another person who can add features to it in the future, if required.

Refactoring
The first thing I noticed yesterday was that Andrew had refactored the UnboundJoin and IndexedJoin code. He moved all of their common code was moved into a base class JoinTuples. At face value this makes sense, but on further investigation it was a waste of his time:

$ find src -name \*.java | grep -v Test | xargs grep -l IndexedJoin
src/jar/tuples/java/org/kowari/store/tuples/IndexedJoin.java
src/jar/tuples/java/org/kowari/store/tuples/JoinTuples.java

In other words, this class is not used anywhere (except the tests). I believe this class became redundant due to the big join refactoring that went on last year.

This refactor is in now, so I guess it doesn't hurt to leave it. Maybe one day I'll refactor it back the way it was. :-)

Sorting
The most striking thing about this operation is that there is no sorting done on any of the input data. This stunned me, as it means that the algorithm is guaranteed to be inefficient when the data is unsorted. I asked Andrae about this, and it turns out that the answer isn't as simple as I'd have thought.

The TuplesOperations.join() method does all of the work of setting up the parameters for the UnboundJoin constructor. This is so the join code can just do a join, and not worry about optimising its input. I've discussed a lot of this set up work already in the last couple of days.

The last operation in the setup is to call the sortOperands() method. This method does two things. It sorts the constraints according to their result size (which is relatively cheap to obtain), from smallest to largest. This results in fewer iterations during the join. The other thing this method does is to re-order the results of a Tuples, using the defineIndex() method on the Tuples. This is only available to Tuples which are the direct result of a constraint, and are therefore based directly on an index. The idea is to use a different index once some variable bindings have been established on other arguments in the join, and a different index is therefore available.

The result is that the operands for a constraint are usually always in the required order for the join to proceed efficiently. The only way this will not happen is when one of the operands is not based directly on an index, but is the result of an append operation.

While this situation is plausible, it almost never happens in practice. Typically, a query is built as a sum of products, and the products are the joins. While it is possible to perform an append somewhere down the line, and join its results, this just never seems to come up.

If it becomes a problem then there would be two ways around it. The first is to reorganise the query into a sum of products form. This would perform some joins redundantly, but quickly. The second way is to find any compound parameters in a join, and wrap them in a HybridTuples for re-ordering. This would wear an initial cost, but would end up being much faster during the join operation.

An appropriate feature to add in this situation would be an algorithm choice based on complexity analysis of the unordered join vs. preordering. Plugging in some of the numbers obtained from tuples sizes would tell the algorithm which way to proceed.

Fortunately, there are no use cases for this at the moment, so there is no pressing need to perform any of this.

Subtraction
The details of the join implementation have definitely provided me with a lot of clues as to how I should go about the subtraction procedure. I still have a few things I want to sort out, so I'll probably work on this a bit overnight before trying to write up a description.

Saturday, March 12, 2005

Unification
It's overdue, but I should write a short post about Friday. The idea was to take this day off to get over the bug, but I spent a bit of time looking at TuplesOperations.join anyway. In particular, I was fortunate enough to have Andrae explain what "unification" meant.

The term comes from Prolog interpreters, and is used to simplify certain joins. If one of the constraints in a join returns just one row, then it can be the subject of unification. In this case, a constraint which contains some of the variables from the constraint with the single row solution can be modified. The constraint can have its variables set to this pre-calculated value, and looked up again.

As an example, consider two constraints A and B, where A looks like:
(<ns:foo> <ns:bar> $x)
and constraint B looks like:
($x <ns:pred> $y)

Now if A were to result in just one line:
<ns:foo> <ns:bar> <ns:obj>
Then B can be modified to look up:
(<ns:obj> <ns:pred> $y)

The result is that A can be dropped, and B can be re-evaluated (trivial), rather than having to go through the more expensive operation of joining A to B.

The only trick here is that the first column of B must still be labeled "$x". The label is needed for any subsequent joins, and for the select clause to get to the value. The iTQL way of setting a variable to a fixed value is with the magical predicate kowari:is:
($x <kowari:is> <ns:obj>)

So the equivalent iTQL for this is:
($x <kowari:is> <ns:obj>) AND ($x <ns:pred> $y)

Syntax
This dredges up some issues I have with some of the syntax in iTQL. I might just mention them here, so I have a reference point in future.

The kowari:is syntax is a "neat" way of doing it, in that it fits iTQL trivially, but it is a syntax I dislike. The reason here is that it isn't a "constraint" in the usual sense, but actually changes other constraints in the query. I'd have preferred a syntax which makes it easier for a user to understand what they are doing, rather than fitting in with the pre-existing syntax. The kowari:is syntax is also painful to work with in the query layer, as it gets parsed as a constraint, and gets put into the constraint expression as in individual entity. This means that these magical predicates have to be found, and their effect passed on to all the other constraints that they are relevant for. I don't think this algorithm is particularly elegant.

There are several other approaches which could have been tried. The first to come to mind is simply:
($x=<ns:obj> <ns:pred> $y)

This has the advantage of not introducing an AND when no join operations are going to occur. It can be argued that this is not a consideration: after all, the unification optimisation can result in a join being dropped when an AND is present. However, that is an implementation optimisation, and the join operation can occur, depending on the data. The kowari:is predicate on the other hand cannot result in a join.

On the other hand, this syntax has the disadvantage of getting lost in all the other constraints which include the variable $x. It would be redundant to set the variable for each constraint, but if you don't then this syntax would have one constraint modifying all of the others: a situation I would prefer to avoid.

So this kind of syntax would simplify some queries, but would possibly be confusing for others. It may be possible to create something that is distinct from a constraint which applies to all of the constraint, but then the kowari:is constraint almost does that anyway (if you're willing to accept that it is not a normal constraint). There are always tradeoffs. I won't be changing it any time soon, but I certainly won't be averse to having a go if someone comes up with a better idea.

Thursday, March 10, 2005

Friday
I'm writing up Thursday's news and views on Friday morning because I couldn't really do much typing last night. I've picked up a stomach bug, which has slowed me down a bit (the joys of having a child in daycare), so I probably won't get a lot done today. That's not so bad, because over the last few weeks I've been working late every day, and a lot on weekends as well... a point that Anne has been making recently. :-) A day to recuperate might be nice.

Subtractions and Joins
The join code caused me a lot of angst yesterday. I haven't worked in this layer very much, and not for a long time. So even though I understand the principles it was unfamiliar. It turns out that it is unfamiliar for Simon as well, though he originally wrote the join code.

Since the principles of a subtraction are very similar to a join, thought I should start there. This starts with TuplesOperations.join(). Of course, none of this is documented in the least, so working through it is a bit of a chore. I'll start with a cut down listing:

public static Tuples join(List args) {
  List operands = flattenOperands(args);

  List unified = unifyOperands(operands);

  List sorted = sortOperands(unified);

  switch (sorted.size()) { 
    case 0:
      return empty();

    case 1:
      return (Tuples)sorted.get(0);

    default:
      Tuples result = new UnboundJoin((Tuples[]) sorted.toArray(new Tuples[0]));
      closeOperands(sorted);
      return result;
  }

This has all the debug and exception code removed.

The function takes a list of Tuples. This is for optimising on the associative property of AND so that groups of AND operations like (a AND b AND c) can be calculated as a single operation, rather than the pair of operations ((a AND b) AND c).

The flattenOperands() method deals with the associativity. I wasn't initially sure what the "operands" were supposed to be, though I guessed that they were the incoming Tuples in the args list. To confirm this I looked at the method:

private static List flattenOperands(List operands) throws TuplesException {
  List result = new ArrayList();
  Iterator i = operands.iterator();
  while (i.hasNext()) {
    result.addAll(flattenOperand((Tuples)i.next()));
  }
  return result;
}

So it iterates through the argument list, and flattens each Tuples, before adding it to the flattened result. OK, so what does flattening a Tuples do?

  private static List flattenOperand(Tuples operand) throws TuplesException {
    List operands = new ArrayList();
    if (operand instanceof UnboundJoin) {
      Iterator i = operand.getOperands().iterator();
      while (i.hasNext()) {
        operands.add(((Tuples)i.next()).clone());
      }
      operand.close();
    } else {
      operands.add(operand.clone());
    }
    return operands;
  }

So if the Tuple is an UnboundJoin then each of its "operands" are added to the flattened list; otherwise the original Tuple is added.

So what is a Tuple's list of operands? The Tuple argument for this function is called operand, so are a Tuple's operands other Tuples which are related to it in some way? Checking out the Tuple interface tells us:

  /**
   * Return the list of operands to this tuples.
   * This is intended to allow debugging traversal of an unevaluated tuples Hirsch.
   * Be aware that the tuples returned from this method are not cloned, and should
   * be considered immutable.
   */
  public List getOperands();

Of course, the word "operand" is not mentioned anywhere else in the interface.

At this point I could either go hunting for the implementations of getOperands, or make a guess based on what I know of the system, and check it with someone online... and Simon happened to be online.

We have a few Tuple types which are virtual, instead of being "materialised" on disk. These Tuples are based on other Tuples, and perform merges dynamically. For instance, the UnorderedAppend simply holds its component Tuples, and while being iterated over, it will internally iterate over each component Tuples in turn. Joins work similarly to this, but they require the underlying Tuples to be ordered according to the variables being used in the join, so they can be iterated upon appropriately. If the Tuples are not in the correct sort order, then they have to be materialised (ie. have their own internal structure that represents the Tuples, rather than relying on other Tuples to do it for them. If this is large enough it gets pushed to disk) and then re-sorted.

I reasoned that a Tuple's "operands" might be the underlying Tuples which a virtual Tuples uses. This makes sense when viewed in the context of the flattenOperand method, as it asks for the operands, and adds them to the list, instead of the original Tuple. Note also that the operands of a Tuple are only retrieved when the Tuple is of type UnboundJoin. In other words, it only flattens the operation for joins, and not appends.

Appends could also be flattened in this way, and they possibly are, but this code is specifically for flattening out joins. Any equivalent code for append operations must be elsewhere.

I asked Simon about all these operations, and he wasn't sure either, but he agrees with all of the above, and offered a couple of pointers for me along the way.

While struggling through it all, Simon pointed out that I'm looking at the most obscure code in the system. Any documentation that it has is most likely out of date, and method names are often unintuitive, so it can be very difficult to decipher. At least I don't feel like an idiot after he told me that.

Unifying and Sorting
The next method in join() is unifyOperands(), followed by sort(). The first one I don't quite get. Simon's explanation seemed to involve a partial resolution of the join, and I know that is supposed to happen later on (if Andrae ever reads this then perhaps he can put a comment in below to explain it). However, the sort operation is a part of the optimiser. This method puts the list into an order that will result in the minimal processing of Tuples. For instance, if the first Tuple has 1000 rows, and the second has a matching 1000 rows, and the third has only 2 rows, then the join will involve matching the first 1000 against the second 1000, and trying (unsuccessfully) to matching against the remaining 2 rows. If the order were reversed, then when the first 2 rows are matched against the 1000 rows, then the result is likely to be very small, maybe 10. This can then be matched against the next 1000 row tuple to come up with the final result.

This ordering of tuple joins has a significant impact on performance. DavidM and I built a heuristic analyser that found the most common cases and optimised for them, but we needed time that we were never given to build a really good optimiser. A couple of years later Simon refactored it, and improved the system significantly. More recently, Andrae has cleaned it up, resulting with we have today. Apparently, the unifyOperands method is involved with this process, but I don't quite understand how.

However, all of these operations are for optimising joins, based on two properties of join operations: joins are commutative and associative. Neither property is applicable to a difference operation, so they can all be skipped.

Finally, the join goes through the switch statement. This part is actually pretty easy. If there are no items in the list of tuples to be joined, then the result is empty. I believe that this can probably be replaced with an assertion, as I don't think that it is possible to be trying to "join" nothing. Similarly, if the list is only one item long, then that item can be returned with no further processing. Again, I wonder if it is possible to even get here if a single argument is passed through?

Finally, if there is a valid list of Tuples to be joined, then these are passed off to the constructor for an UnboundJoin object. This is the class that does the actual joining, and all of the preceding code simply optimises its arguments so it will be executed as efficiently as possible.

I've started to go through this class, but I'm despairing that I can borrow the code I'd hoped to use. It really is very specific to joining. This makes sense because of the optimisation work that has gone into it. At least it gives me some ideas for working on the subtraction code.

It's a good thing I budgeted a good amount of time here.

Wednesday, March 09, 2005

Expanding and Resolving
I started out today trying to implement the ConstraintDifference class, but from the outset I ran into difficulties.

A couple of months ago Andrew implemented the visitor pattern for implementing constraint resolution, but this had a few problems. The most obvious hassle was that every separate constraint type needed to have individual methods be declared for it. This is a maintenance problem, and does not deal with new constraints very well. While not an issue at the time, it also can't deal with runtime additions to the constraint system.

Regardless of the problems, this is how the system was built, and I had to make ConstraintDifference work with it. I started looking at the dispatch functions, but this is where things went a little strange. While some of the classes full of functions were there, I couldn't find everything I expected. In particular, I could not find anything calling the accept() method for any of the constraints.

I was able to write a little bit of code to work in with what I could find, but after spending hours pulling my hair out I resorted to asking Andrae for help. I was hoping that he could enlighten me, as I felt like I'd forgotten everything I knew about Kowari.

It turned out that Andrae was exactly the right person to ask. His first comment was that the accept() method no longer exists. This didn't make sense to me as it was only introduced late last year, and I'd used it just before Christmas. Also, every class still has this method. Andrae then explained that this method had been refactored out completely.

I expressed dismay about this development when nothing had made it onto the Kowari mailing list, but Andrae said that it all happened while everyone was still at Tucana (and internal development like this didn't usually make it onto the list). I was confused at this, as I had heard nothing about it, so Andrae explained that it happened in the final weeks before Christmas.... which was when I was on leave.

So a major change to the constraint resolution happened, and somehow it slipped between the verbal discussions that were held at Tucana, and the emailed discussions that came about after the company was closed. Hopefully that was a one-off event, and it won't be repeated.

Andrae pointed me to LocalQueryResolver where all the work is being done now. This class uses a map from constraint classes (the actual class object, not an instance) to an inline class which implements a single method called "expand". It is this method that gets called to resolve the constraint.

I was confused while trying to read this code, as the syntax was all about "expanding" expressions. Since most constraints get expanded before resolution (eg. conjunctions and disjunctions) then I presumed that this was what the code was supposed to do, but I could not work out where constraint resolution was happening.

After a bit more reading, and some frustrated words sent Andrae's way, I finally figured out that this was the resolution code. It was simply a matter of unintuitive names. I explained my difficulties with the name, and Andrae agreed, stating that he had initially intended to perform a bottom-up resolution after performing a grammar expansion, but that this ended up being just the resolution task without any explicit expansion. Hence, the name is misleading, and he was going to change it, but the shutdown of Tucana meant that he never got to it.

With the time I had left today, I decided that the first thing I should do was a minor refactor where I changed all the references to "expansion" into "resolution", while carefully avoiding calling anything a "resolver", due to the obvious name clash. I finished this, and also included the mappings for the new ConstraintDifference class.

Everything is now compiling, but I have only really set aside the space for the difference operator. Tomorrow I hope to make headway into the body of the implementation.

Tuesday, March 08, 2005

Morning
I lost a bit of time this morning when I received a phone call from Luc's daycare to tell me he'd had an accident. So I went back to see Luc and make sure he was all right. When I finally got back to UQ there was no parking available, so I had to go home to work.

I had a few things to carry today, so I needed the car. However, I have to ask if it is worth it, given that I spent 2 hours and 20 minutes traveling traveling back and forth over a distance that I can cycle in under 10 minutes. Even if I want to use the bike there is a problem, as the cycle lockup in this building is full, and the rate of bicycle theft around here is too high to leave it chained up outside.

In the end I had to make up the time at night, which cut into my time for doing my confirmation writing. All in all, a frustrating experience.

SableCC
The majority of my coding was spent on the grammar for SableCC.

I now realise that the various definition levels create the levels of associativity. Hence, the following extract from the definition makes the AND operation more strongly associative than OR:

  constraint_expression =
    {term} constraint_term                          |
    {or}   constraint_expression or constraint_term ;

  constraint_term =
    {factor} constraint_factor                     |
    {and}    constraint_term and constraint_factor ;

The names "constraint_expression" and "constraint_term" are rather arbitrary. However, you can see that an AND-term is built from either a "factor" (constraint_factor) or an AND-term with a factor. Similarly, an OR-expression is built from either an AND-term or an OR-expression with an AND-term.

Going bottom up, this means that the factors will get found first, followed by the AND-terms, followed by the OR-expressions. This gives AND greater precedence than OR, as desired.

I decided to put MINUS below AND, not for any good reason, but because the WHERE clause evaluation could remain unchanged at the top level. I also have a idea that MINUS should be evaluated first, and that any complex statements for either the minuend or the subtrahend should be enclosed in brackets.

(I knew there were terms like "minuend" and "subtrahend", but I couldn't remember them and had to look them up. They may seem pretentious, but I thought they were better than LHS and RHS).

Consequently I created a new level between constraint_term and constraint_factor. Without any inventive ideas, I decided that this must be a "difference" term, so I called it constraint_dterm. After seeing the Java code for this I'm a little less sure of myself, as it is not so easy to see the difference in names when looking at constraint_term vs. constraint_dterm.

My first attempt tried to use a constraint_expression on either side of the MINUS keyword. This failed (and I see why now), with the following error from SableCC:

shift/reduce conflict in state [stack: PSelectClause TWhere PConstraintTerm *] on TAnd in {
[ PConstraintExpression = PConstraintTerm * ] followed by TAnd (reduce),
[ PConstraintTerm = PConstraintTerm * TAnd PConstraintDterm ] (shift)
}

I saw Simon online, and along with a little bit of help from the SableCC author, I was able to see where the ambiguity was coming from.

I expected that if SableCC saw an expression such as:
A - B - C
Then it would evaluate from left to right. ie.:
(A - B) - C
However, the compiler sees this as an ambiguity between both possible evaluations:
(A - B) - C
A - (B - C)
If you want left-to-right evaluation then you have to explicitly declare it. This makes sense, as a compiler should have the right to define this.

Anyway, after a couple of attempts I finally got it right. I also have all of the parser code implemented, so now I'm working on the actual difference code.

Monday, March 07, 2005

Recursion
Today was a bit of consolidation, going over the RDF I've already built, and tweaking it a little. I also worked on the Java/iTQL code, though I'm still trying to do more with iTQL, and less with Java.

One question at the moment, is whether I should create a recursive Java function to go down the Constraint trees, or if I should use the iTQL walk function. I'm loathe to execute queries indiscriminately, but since this part of the system does not necessarily need to scale then it may not hurt. There is only a little to be gained by trying to expand queries in iTQL, but since I'm considering scalability at every other level it seems strange not to do it here.

While I've experimented with some use of walk I'll probably just continue with the Java code for now.

Difference
My main problem is actually the lack of a difference operator. For instance, when I select all of the queries' selection variables I would start by finding all the properties of the selection clause:

select $rule $select_clause $predicate $object
from <rmi://localhost/server1#rules>
where $rule <rdf:type> <krule:Rule>
  and $rule <krule:hasQuery> $query
  and $query <krule:selectionVariables> $select_clause
  and $select_clause $predicate $object;

For rule rdfs1 this returns:

[ krule:rdfs1, _node99, rdf:type, rdf:Seq ]
[ krule:rdfs1, _node99, rdf:_1, krule:var_a ]
[ krule:rdfs1, _node99, rdf:_2, krule:ref_type ]
[ krule:rdfs1, _node99, rdf:_3, krule:ref_property ]

Note that while this is an actual result, I've abbreviated the domains. Hmmm, that might be a useful feature to automate, at least for the iTQL shell. I'll have to think about implementing that.

To get the selection elements, all that is needed are the nodes referenced by the sequence predicates. In this case there are two solutions. The first is to create a resolver that will let me just select those predicates. I expect to implement that code shortly. The other option is to just remove the rdf:type statement using a difference operation. That wouldn't work for every circumstance, but it will here, since the sequence will have no properties other than rdf:type and rdf:_#. It is also essential for certain other operations, so this would seem to be a reasonable trigger for me to implement it now.

A difference operation can be achieved with a subquery, but at that point all operations become tuple-at-a-time. This is provably less efficient. If I had to resort to a tuple-at-a-time solution, then I'd be better off saving time by putting a logic engine into Kowari and be done with it.

The RDF for the rules engine can be queried without scalability, but since OWL needs this as well, I really do need to get it implemented. A compelling, though less vital reason is the fact that the subquery syntax is very obtuse.

Like AND and OR, the difference operator will need an infix notation. For that reason I shouldn't use a word like "difference". Saying "A difference B" just doesn't sound right. Another alternative that was proposed by DavidM was NAND as it is a little like a constraint conjunction with the inverse set of the second operand. However, I don't like this, as NAND is usually used to describe a commutative operation, while the difference is noncommutative.

For these reasons I'm thinking of just using the word "minus". That way it will read quite naturally.

After some of the previous debate, I'm almost expecting opposition to my implementation of this. Some people seemed to have a problem with the idea when it was first proposed, as I think that they believed it was some kind of hack. This was why I decided to formalise the operation algebraically back in October. Even with an acceptance of the need for the operation, I just know that there will be a problem with the syntax.

Fortunately I'm now working "independently" on an open source project. If no one else wants to have a go with this implementation, then I get to decide how it is done. :-)

If someone else decided that they really hated my implementation, then they can always rip it out of Kowari, and leave me alone to run my patch. I doubt that would happen though, because the functionality is needed. OWL won't work without it, and I doubt anyone will intentionally break something just because they disagree with syntax. There is also no harm in leaving something (useful) that other people don't use. They won't be using it if they don't like it. The biggest change I can see would be if someone thought they had a better way, and then implemented that... but that would take someone willing to make the effort, and no one else has expressed an interest in that yet.

Anyway, I've been pulling out the code to ConstraintConjunction and perusing that, with the idea of using it as my template for a ConstraintDifference class. At the lowest level, I believe that it is simply a ConstraintConjunction with an inversion on the σ operator (selection). There are some extra efficiencies which can be gained, but I think they are almost the same as for an inner join, ie. a ConstraintConjunction. I'll have a go tomorrow and see how it goes.

These things always take longer than you expect, so I may well end up working on this all the way up until next week. Even if I don't, I'll still need to spend a day or two writing a complete set of tests for it.

Confirmation
I didn't get to write any of my confirmation of the weekend, due to family commitments. I'm about to start on that now. Hopefully I'll have something to show for it by morning.

Friday, March 04, 2005

Databases 101
I had a very interesting day today, though unfortunately I'm not able to blog all of it. I'll probably say something about this in a few weeks time.

In the meantime, I saw Bob in the lift this morning, which reminded me to return his book, plus a past thesis I'd borrowed from him. On my way up I met someone else who'd obviously been riding a bike. I asked where his bike is kept, and he explained that there is a lockup at the bottom of the building. Given the traffic and bus situation getting into the university, this might be a good option for me. I'd been worried about bringing a bike, as people will often cut security chains, and I really wouldn't want to lose my bike.

On returning Bob's book, I commented on how the structure of his chapter followed the history of the design (though not always the implementation) of Kowari. When first designing how a query should be evaluated, we essentially had a top-down evaluation pattern, but since Kowari constraints resolve sets, our implementation was actually implemented as a bottom-up solution. Several of the efficiency techniques which were then described have all been applied in the query engine over the years, with semi-naïve evaluation coming about partly in the query layer, and more recently in my own rule engine.

Bob suggested that I consider another chapter which describes magic sets. I've read about these, but only in generalities, with no one explaining how to implement them in practice. Bob explained that the implementation is rather easy, with the system creating temporary assertions which are used for further evaluations.

This is exactly what Andrew and I were discussing in order to make OWL reasoning more efficient. It would seem that I am doomed to reinvent the wheel, though in the context of a new type of database. At least some of my work has been original, though less of it than I initially thought. I guess this is no surprise, as I'd been pushing for more people at Tucana to read the literature on this and knew that I had to read more myself. It's just a shame that some of our "commercial realities" prevented the time for reading to be assigned, as it would have saved us in the long run.

One good thing to come from all of this is that all of the most advanced methods described in the literature have been independently implemented by us. This tells me that we were making the right decisions as we went. While we may have saved a lot of time by reading the literature and trying to implement what we found there, at least this way we really know what we are doing. Also, there were no guarantees that we would have converted the established techniques correctly, given the unusual nature of the Kowari structure.

OWL Discussion
Bob also wanted to ask me about blank nodes in RDF. I don't know if my explanation was adequate, but I gave it a go. My understanding of their need basically comes down to two things.

First, blank nodes are essentially an artifact of RDF-XML (and I think I've mentioned in the past how I feel about that particular encoding scheme). When writing XML it would be annoying and wasteful of space to have to find a unique name for EVERY unique tag in the document. Having the parser build some automatic identifiers (for internal use) for nodes which will not be re-used makes quite a bit of sense in this context.

The second reason comes down to my own experience with blank nodes. While every node in Kowari requires its own ID, only URI References and Literals take up space in the string pool. If every node were named, then the size of the string pool would blow out unmanageably. In fact, every possible implementation of an RDF database would also suffer from this. If you use unique names for every node, then these names will have to be stored somewhere.

There may be other reasons mentioned in the literature, but I haven't seen them yet.

This discussion led to the process of merging blank nodes, which everyone seems to agree is a bad idea (except when the same document gets re-loaded, but even then there is debate). The accepted method here seems to be to use the owl:sameAs predicate between the blank nodes to do the work.

That then led to a question of how to treat nodes in RDF which were declared to be the same as each other. Apparently, this question has come up on some UML mailing lists recently, and it demonstrates some of the problems of bringing pre-conceptions into a new domain.

The question started out with an assertion of two classes, A and B, and instances of each of these, a and b respectively. Class A declares n properties: P_a1...P_an, and class B has m properties: P_b1...P_bm. Instances a and b have assertions for each of these properties (in other words, a has n specific properties, which are all described by its class A. Similarly for b).

So what happens when we assert the statement: <A> <owl:sameAs> <B> ?

Some of the UML people expected that the properties of b should be copied onto a, and vice versa. I have no idea why they would have thought this, but anyone familiar with RDF will know that this is wrong. All m property declarations for B should be copied to A (and vice versa), but this does not affect the properties of any instances.

The main difference for the instances, is they each end up with new types:
<a> <rdf:type> <B> <b> <rdf:type> <A>

After this is done, both a and b will be able to have properties of P_a1...P_an,P_b1...P_bm. This was already legal under the open world assumption of RDF, but now the properties are completely described.

UML has numerous diagrams as a part of its standard, including the class description and the component diagrams. However, about the only things I ever see discussed or presented are class descriptions, particularly when UML is being compared to OWL. Perhaps it was this narrow view which led to this confusion between ABox and TBox data.

Kowari Background
I took up Janet's offer to come back for coffee this morning, but I discovered that no one was there today. So I returned to my own building for a coffee.

While lining up I met one of the staff members, named Richard Cole. He asked what I am researching, and when I told him, he asked if I knew about "Kowari". Isn't it a small world?

Richard has a project which measures how "connected" classes in a program are. He would like to store a lot of this data using RDF. Until this week he had been using his own implementation of an RDF store, but he discovered that the complexity was getting too much, and went looking for existing implementations. I don't know his exact selection criteria, but he ended up settling on Kowari. A fine choice. :-)

Richard had several questions about how Kowari works, and how to implement his data structures. I was only too pleased to go through this with him. As is usually the case, describing anything helps me to clarify the concepts in my own mind, so I find this process useful. That's one of the reasons I try to describe so much in my blog.

Speaking of blogs, Richard told me about a number of things he had been learning about RDF and Kowari from someone's blog that he had found. It was only later in the day that I discovered that the blog he was talking about was mine. :-)

I like to encourage people who are using Kowari or RDF. This has two effects. The first is that Kowari gets used more, which means RDF is used more. The second effect is that RDF is being used more, even when a non-Kowari system is in use. This serves to expand the Semantic Web, giving it more functionality, and getting it closer to the mainstream. I'm interested in this, because the resulting technologies are very useful (both because they are more fun, and they make things easier) and because the more semantic web applications there are, the more Kowari will be used.

So the more Kowari is used, the more RDF is used. The more RDF is used, the more Kowari is used. See? It all comes a full circle. :-)

BioMOBY
Today's lunch was spent attending the BioMOBY presentation by Mark Wilkinson that I described yesterday. I found the whole presentation to be quite interesting.

Bioinformatics suffers from a large number of databases which contain incompatible data. This data is all available on the net, but to take information from one database and compare it to data in another database is a difficult manual process, due to the disparity in the data formats. It can also be very difficult to find the type of data needed from all of the databases which are out there.

BioMOBY does a few things. First, the data from each database gets wrapped in some XML which describes what that data is. Note that the data is not pulled apart, so it would still need to be parsed by any system looking at it. However, this "wrapping" technique makes it relatively easy for a biologist to encode. Ease is important, as the participation of databases owners is an important factor in this project.

The XML which wraps the raw data also provides a lot of metadata about the payload. This makes it possible for a system to recognise the type of data, where the data came from, what it might be transformable into, along with several other features, depending on the data.

Once a configuration is in place, and a server is capable of handing out this data, it then gets registered with the main BioMOBY servers. These servers act as a Yellow Pages index for all registered databases.

A BioMOBY client can query the index, asking for a particular type of data. The index will return a list of servers, and the types of data available from them. The client will connect to one of these, retrieve the information, and then ask the BioMOBY directory what it can do with this retrieved data. The index will respond with a number of operations along with servers which perform them, and again the client can select one of these from the list.

The consequence of this is an interface where the user can start with one type of data, and seamlessly move through various transformations, picking up relevant data on the way. The significance is that this is done over a series of unrelated servers across the net. In other words, it is a semantic web application.

The key to making the data interoperate is describing it all with ontologies. The ontologies are all quite complex and broad, and have been built completely by the biology community. It makes sense that this was done by the community rather than the project itself, as databases can only be added when the owner writes the ontology to describe it, and then registers the database with the BioMOBY index.

BioMOBY and RDF
The most surprising aspect of this project was that it was not using RDF. The model they use is a graph of nodes with directed, labeled arcs. Even the XML syntax duplicates a significant amount of RDF functionality. Similarly, the ontologies are in their own vocabulary. So I was surprised that they did not try to leverage off existing RDF tools.

After the presentation, I asked Mark where the project stands in relation to RDF. He explained that the BioMOBY syntax was developed independently because RDF was not widely known back when BioMOBY started. The BioMOBY syntax is also simpler than RDF-XML, making it easier for biologists to encode their systems.

However, now that RDF has become the prevailing standard, the BioMOBY system has internally migrated to RDF. To maintain the ease of use for the biology community the BioMOBY syntax is still in use, and a set of scripts are used to convert this into RDF and back.

Given that the system is now using RDF, I asked about scalability issues. They are nonexistent at the moment, as the entire dataset on the index server can be measured in megabytes. As a result, the whole system is stored in MySQL.

A future plan is to move the served data out of large "blobs" in strings, and into a parsed structured in RDF. Once this occurs then the system will be able to provide significantly more functionality to process the data, and at that point scalability will become a concern.

Another important consideration at this point is with the ethics associated with "human" data. For instance, a person may give permission for their cells to be used for cancer research, but under no circumstances may they be used for cloning research. This will need a complex security system which can provide authority information for each individual's record. The plan is to describe this with an ontology. After having worked on the security on Kowari, I know that this is a tough problem, and I thought that the use of an ontology was a very clever idea.

Once the system gets to this level of RDF complexity, scalability will become an issue. Mark then suggested that they would move onto Jena or something like it. I expressed concern at the scalability of Jena (sorry Brian, but it's true), and suggested Sesame or Kowari. Jena has the best RDF reasoning engine, I don't deny that, but scalability is just not there.

With the security requirements that Mark will be facing, he is concerned about scalable reasoning on ontologies. Of course, this was my opportunity to explain that this is exactly what I'm researching now.

I'd like the opportunity to contribute to a project like this, whether Kowari were involved or not. Mark left me his card, so I'll get in touch with him again and see where it goes.

Rules
In amongst all of this I've been building iTQL to extract the rules as far as I can. I could take the easy way out and do simple queries, and join the results in Java, but I would end up making a large number of queries and writing a lot of Java code. It takes more effort to do all the work with iTQL, but with only a little thought a lot of work can be saved further down the track, and the program will be much faster.

Once I have that I'll be wrapping it all in Java code which will execute the queries through an ItqlInterpreterBean. That is coming next week.

Blogging
It always makes me feel self-conscious to think about people actually reading my blog. After all, it's mostly for me. I know that others read it (often to my benefit), but it is convenient that I can usually forget that as I write.

Janet made a comment to me about reading my blog, and the captivating effect that a blog has when it concerns people you know. That had me thinking about my use of full names here. The long time reader will note that I used to use initials for the people I worked with. I figured it would make sense for anyone who knew those people directly, and it would serve as a useful label for anyone I didn't know while preserving some anonymity.

Since leaving Tucana there has been no context for these labels, and I am regularly interacting with people from all over. At this point initials became useless.

I debated whether I should write a person's whole name here, or if I should leave people with their anonymity. I thought it might be worthwhile explaining why I've opted to use individuals' full names here.

There are a few reasons. First, providing context may well be useful for people who read this. For instance, when I was talking about ontologies and abductive reasoning then it was probably going to be of use for people to know that I was speaking with Peter Bruza. Second, almost all of these people have staff pages, personal web pages, etc, and their presence on the web already lacks anonymity (shades of Scott McNealy here?). Finally, I realised that few others seem to care about these issues and full names are regularly used in other blogs.

If it bothers anyone, then just let me know and I'll stop using your name. I can even track down old entries and modify them accordingly.

Thursday, March 03, 2005

RDFS
I want to keep it brief tonight, but as usual I have quite a bit to talk about. I'll probably gloss over a heap. :-)

I finished the RDFS rules structure today, sans rule XI. The reason for this is because it requires prefix matching on strings. I don't have a problem with this, but I want to know what the internal structure of the query will be before I go ahead and encode it in RDF. I will probably start on a resolver to do prefix matching some time next week.

I also reverted some of the OWL in which I tried to describe the structure of a transitive query in painful detail. There were a few reasons for this. The first reason was because I was having to subclass everything: the constraints, the properties, and the element types. This was completely divorcing the structure from the rest of the class system, which was the opposite of what I was trying to achieve.

The other reason is that the Java classes in Kowari don't try to manage any of this either. Instead, they assert that the structure appears correct when it comes time to process it. Since the class structure that this RDF is supposed to represent doesn't care about this aspect of the structure, it appeared unnecessary to try and encode it in OWL. The final result is much cleaner looking RDF.

My challenge now is to load it all effectively, and build a Kowari class structure out of it. That will take some time, so I'm glad that I've budgeted accordingly. The reason I was thinking I'll get to the string prefix resolver is just so I can take a break from the code. :-)

Numbers
DavidW asked for the numbers RDF generator yesterday, and I dredged it back up again. It's pretty small, but I decided to GPL it anyway, for the principle of it. I also considered the structure of the RDF (which was designed for me by Simon), and realised there were problems with it.

Way back when I first wrote this code, Simon said that he thought that people might balk at the idea of declaring a node to be the owl:sameAs a literal value (in fact, Simon used the old owl:sameIndividualAs predicate). Well I checked the OWL language guide, and he's right. Instead, I've opted to use owl:hasValue, which is valid for OWL Full and OWL DL, but not OWL Lite. Applying values like this meant that I needed nodes to refer to each other instead of directly to values, so I've had to give them all names. In this case I've opted for http://kowari.org/math#n where n is the same as the value of the node. That ought to blow out the string pool. :-)

The resulting output validates fine as RDF, but is not valid OWL. The only reason for this is because I use a heap of undefined classes. This is easy to fix with a few headers, but I felt it unnecessary. The reason is because this generator is supposed to create an enormous quantity of RDF, for testing purposes. OWL is not a requirement here. However, the semantic of owl:hasValue is useful here, so I've used it.

Meetings
I had a couple of meetings today, which worked out very well for me. It started with a chance meeting with Janet again, and she invited me over to a morning coffee break with her group. Unfortunately, I ended up doing a lot of the talking, but I was well received and have been asked back. Janet had some useful observations to make as well, so I'm starting to appreciate how valuable it can be to have regular contact with other people who are familiar with a lot of your research issues.

Then at lunch I had a talk with Peter Bruza. He was a little confused as to my request for a meeting, but was kind enough to make the time anyway. I tried to explain my questions on the lack of ontology scope in languages like OWL, and how this restricts the kind of inferences I can make. He agreed with me, and then went on to discuss deductive, inductive, and abductive reasoning, and how he feels that abductive reasoning is what we really need.

I didn't understand abductive reasoning (and I'm a little hazy on inductive reasoning too) so he took the time to explain it to me. In my own words, it's a little like the following...

Deductive reasoning is where all of the facts are taken, and logic is applied to deduce a new fact. For instance:

Confucius is a man
All men are mortal

Therefore - Confucius is mortal.

Abductive reasoning does not follow this chain. Instead, it makes a "guess", and then compares it against the facts for consistency. For instance:

All men are humans
All women are humans
Confucius is a human
Confucius is not a woman

At this point we can guess that Confucius is a man. We can't know it to be true, because we have not been told if there is anything that is a human that is neither a man nor a woman. After making this guess, the statement can be tested for consistency against all the known statements, and if it passes, then it is a valid theory.

It's kind of like the "Formulate a theory" step in the scientific process. Automating this process would be a real breakthrough, so I can see why Peter is so interested in it.

I explained that the specific problems I'm looking at already has all of the data needed to perform a deduction, rather than an abduction, so in some ways my problem should be easier than his. My issue is an inability to describe the ontology sufficiently that these deductions can be made. I described this in detail, with some examples, and Peter agreed with my analysis. He also said that the problem is an interesting one, and a good topic for a PhD.

Damn. I didn't want to hear that. How am I going to find out now? (I can't afford to take time off work to do a PhD, as a PhD income is rather paltry). I guess I'll still work on it, but I was hoping that someone had already made a good start on it.

In the meantime, Peter has lent me one of his papers. His abduction work needs a way to encode his information (otherwise a computer can't work on it), so I'm hoping to find some insights in his methods.

I also keep running into the term "Non-monotonic reasoning". I have an idea that this is referring to using facts that may result in different answers at different times, but I think I need to find out a clearer specific definition.

I then got a message from a friend about a seminar at lunch tomorrow in which a system called BioMOBY is being presented. BioMOBY is a "biological web service interoperability initiative". Interestingly, they appear to be using Semantic Web technologies. They make the claim that:
interoperability between bioinformatics web services can be largely achieved simply by ontologically specifying the data structures being passed between the services (syntax) even without rich specification of what those data structures mean (semantics).

This, along with other statements, has me thinking that Kowari may be able to work effectively with this data, a point that I'm sure was not lost on the friend who sent me the seminar abstract. So I'll probably be losing yet another lunch time to work. :-)

There's more to say, but it's been a long day, so I'll say it another time.

Wednesday, March 02, 2005

Bottom-up Datalog
I spent some time today reading chapter four from Bob's book, Deductive Databases and their Applications. Bob recommended it to me, as it touches on some ideas in set-based evaluation.

The first half of this chapter all makes intuitive sense to me, as it describes the processes behind a top-down evaluation of a datalog query. Once it moved into bottom-up semi-naïve evaluation, it started looking familiar, most especially because it describes the fixed-point termination as the "intrinsic" delta being reduced to zero. However, it was when queue-based Prolog interpreters and breadth-first searching were introduced that I discovered that these describe the implementation techniques that I'm using for the rules engine.

Coincidentally, one of the motivations described for breadth-first searching is to avoid infinite loops, as this had occurred to me as well. However, the real reason was to avoid re-running deeper rules when higher placed rules in the tree generate new statements which result in the deep rules being needed again. Still, it is interesting seeing the same ideas come out in a text like this, as it demonstrates that my thinking has been clear on the subject.

OWL
I also performed an update on the ontology to handle the sub-constraints of a transitive constraint. This just matched the RDF I'd already written, so no updates were needed there. That let me just move on to encoding some more RDFS rules. These rules are not too hard, but it would take longer to explain what to do than to just do it. I don't need these rules encoded just yet, but I will need them shortly, so it doesn't hurt to work on them now. It's nice to work on something easy for a change, especially since the OWL rules are not as easy to encode (the current requirement that some rules have for sub-queries gives me the shivers).

I still have a little RDFS to implement, but I might have that done by tomorrow.

Wiki
There was a lunchtime meeting run by Assoc. Prof. Janet Wiles for postgrads in ITEE in order to demonstrate a new Wiki the department has put up. The idea is to accumulate a body of knowledge on how best to perform research. Part of this is because ever supervisor has their own way of managing their students. It is also the case that every project can be different in unexpected ways, meaning that some students may not naturally encounter some information that would have been useful to them.

I was convinced to have a go, and have put in a few elements, including a definition under the title: "What is Science?" After finishing this last one, I realised that I had quite a bit of hubris presuming to tell the rest of the department what I think science is. :-) At least I've had some subsequent feedback to say that I didn't make a fool of myself.

During the meeting the word "Blog" was also mentioned. I pursued this, as I think it would be interesting to browse the blogs of others in the school. It could provide an insight on the processes that others follow, the progress they've made, and the directions they are considering in their research. I tried to put this forward, and I think that I may have convinced Janet to look into it.

Of course, in the middle of this discussion she asked to see a demonstration of my own blog. While I know that many other people can potentially read this blog at any time, I tend to ignore that, so I felt rather naked pulling it up in front of a room full of people. Fortunately the font was small for a projector. :-)

Universities
Time to get political... I was a little frustrated at the news today, as the next round of university reforms have been proposed.

The first item is that HECS is being authorised to go up again. This will never affect me, but it worries me about the future. Many young people I know are opting away from university to take on a trade. This is not because they are not suited for university, but because a degree will never make you a lot of money, and the debt incurred will wipe out any wage advantages the degree may provide.

A trade on the other hand costs nothing to get, and even pays a small amount while it is being obtained. All the tradesmen I know at my age make more money than I do. I don't worry about that, as I like what I do, but the notion of increasing fees on university payments seems wrong in that context. Americans may see this differently, as they all have to pay. However, we have a higher tax rate, and are supposed to get services commensurate with that.

The second item is an increase in full-fee places. I personally feel that this is a thin edge of a wedge. As more and more full-fee places are allowed, the government will have the credibility to reduce funding to universities (or maintain it while the system grows, which equates to the same thing) until almost all places are full-fee places. These things don't happen overnight, but the current trend looks to go this way.

The final item is a removal from the requirement that a university do both teaching an research. Australia does not have the money, nor the forward-thinking attitude that are both required to take an investment risk in research. This applies to both government and industry. The result will be that no university will be able to concentrate on research, and drop their (full-fee paying) students.

Conversely, since the money is starting to flow through from teaching, it will not be uncommon for a university to reduce, and perhaps drop, its research. I believe that this would be disastrous for a teaching institute, though it is hard to define in a 30 second sound bite. The best example I can think of is the level of knowledge and experience that lecturers gain by being a part of cutting edge research. As a student, there is something inspiring to realised that you are being taught theories that are still in the process of being developed, and that the body of knowledge you are learning is an evolving thing. Reducing lecturers to a purely teaching role reduces the quality of the degree. You might as well learn it all from books.

Of course, the other problem with dropping research from various institutions is that less research will get done. I don't believe that I need to explain what I see wrong with that picture.

OK, that is enough of a rant. Perhaps not all of these recommendations will be implemented (though they generally are). I can always hold out some hope that things will not go this way.

Tuesday, March 01, 2005

SWRL
I wanted to mention one other future plan.

Now that I'm working on the rules engine in earnest, I've been looking at languages like SWRL as well. While it won't suit the most efficient implementation of OWL for Kowari, I believe that it is still an important system. If anyone wants to build an inferencing system that isn't controlled by something like OWL (eg. they want to infer an uncle relationship) then they will probably be using a rules system, as this is the state-of-the-art in today's technology. With SWRL becoming a more popular standard, I can see a real need to implement this system in Kowari.

I'm expecting that the I can get onto a SWRL implementation once I've finished OWL inferencing. Hopefully I'll be able to reuse some of the Kowari rules code for SWRL, so it will make sense to do the work straight away while it is all still fresh.

I love Open Source work. You can work on anything you feel like. :-) (assuming you have enough spare time - if I get paid to work on Kowari then I may end up getting dragged into something else, like ontology updates).

Rules OWL
The majority of today was on encoding the RDFS rules into RDF, and modifying the OWL that describes it as I came across problems. Several times I found myself questioning some of the decisions I've made about this structure, but in the general case I stayed with what I was doing after thinking about it for a while.

The first thing I considered was how I am describing variables. Some example RDF for this is:

  <Variable>
    <name>aaa</name>
  </Variable>

If I want to reference this variable elsewhere (such as in the subject clause) then I need to give it a name:

  <Variable rdf:about="#var_a">
    <name>aaa</name>
  </Variable>

While I don't need to reference every variable from elsewhere, it finally occurred to me that I was being inconsistent, and I should just name all of them. Once I made that decision, I started wondering if I should bother with the "name" predicate, since I could just use the URI. However, I decided to keep the name, as the URI isn't a suitable name for a variable ("http://..." etc). I thought about having code which stripped the fragment of the end of the URI, but that seems more like a hack, and it's subverting the RDF (which is supposed to provide all the information I could want about a node). So the name stays.

The second thing I wondered about was the URIReference class. This is a class which refers directly to a URI, such as <rdf:type>. I use these in constraints, and also in the select clause, to indicate a specific URI. An example is:

  <URIReference rdf:about="#ref_type">
    <refersTo rdf:resource="&rdf;type"/>
  </URIReference>

The obvious question is whether I should use these nodes to point to the node with the required URI, or if I should just include that node. Later in the day I showed this to Simon, and it was the first question that he asked me. I found that amusing, as I'd spent ages wondering about it.

I think I've reached a conclusion here. I really want to be able to use an iTQL to find everything I need in one fell swoop, and using a URIReference node will let me do that. By using this node I can ask for the rdf:type of all of the relevant nodes, and I should either get a "Value" or a "URIReference". If I use the nodes directly (without going through a URIReference) then I might or might not get a type (depending on the node declaration). I would then know which nodes were Values, but finding the other nodes would be annoying, as I'd have to consider the typed nodes which are not "Value", plus I'd have to track down those nodes which were not returned because they didn't have a type.

From a semantic perspective, I'm more comfortable using this indirection as well. When I select this node in the rule I really want the URI, I don't want what that URI represents. I was initially tempted to provide the URI as a literal, but decided that referring to the real node made sense, as it provides some meaning that way.

So the indirection stays.

Finally, I came across rule 5a from RDFS. This is the one which handles transitive closure. In this case we really want to use the TransitiveConstraint object, so it is necessary to describe this object in the RDF as well. Unfortunately, these constraints wrap two other constraints which look like simple constraints, but have to follow specific rules. ie. The first constraint must have a fixed predicate and a variable on either side. The second constraint must have the same predicate, but be fixed at one end, with a variable at the other. This gets tricky to enforce in OWL, but I think I can do it (not that I need to, since I can enforce it in code, but if I can't do a simple thing like this, then I've got bigger problems). I've written the RDF for rule 5a, and I'm halfway through editing the ontology for transitive constraints. I should finish that first thing in the morning.

Code
My plan is to get the RDFS rules and complete ontology written and validated by Thursday. I then have two pieces of code to write. The first piece will be the library that parses this RDF into a Query object, and does it efficiently. I don't anticipate that to be hard, but it will take some time.

The second piece of code is a resolver for handling prefix matching. I gave this some thought before I realised that the only thing I want to use prefix matching for is picking up all of the rdf:_n predicates. This is useful for anyone using a sequence (as I am), and is essential for rule XI in RDFS (which I am also doing). Other people may have a need for prefix matching, but given that this is necessary for correct RDF operation, I feel justified giving it a resolver of its own.

The resolver itself should be quite easy. It will essentially be a copy and paste of the Node Type resolver, only the selections from the string pool will not be all nodes of a particular type, but rather all nodes which start with "rdf:_". If it is so similar, then maybe I should investigate the use of generics to share the code between both resolvers? (Only kidding Andrew).

Rules Paper
I also took a coffee break and spent it on Guido's paper. I like what it can offer, and it may even have some usefulness for Kowari one day. However, it doesn't have anything to offer the OWL implementation, so I skimmed the last few pages. The list of references may be a different story though, so I'll be checking some of them out.

While reading in the coffee room I introduced myself to Peter Robinson. He seemed quite nice, and later on I thought I'd check out his staff page to see what kind of work he does. It turns out that he's interested in logic programming, and theorem proving. I like coincidences like that. I'll have to have another chat with him.

Masters
In the late afternoon I went upstairs for a short meeting with Bob. This was mostly to give him an update on where I am and what I'm doing.

Bob seemed pretty happy about it all, but he pointed out that I'm now well beyond where I need to be in order to do my confirmation. I was already aware of that, but it was still good to be prodded a little. I've been so busy lately that it's easy to procrastinate on that item. My main problem is that I really need to do this writing at night (as I'm doing the implementation work during the day), and blogging plus reading means that I'm not really getting to it. I have to prioritize though, so I've asked Bob to check in on me by the end of the week to see that I've made a start on the write up.

I explained where I'm going with the set-at-a-time resolution, and he agreed that it sounded good. I mentioned the paper which proved that set-at-a-time is provably as efficient as tuple-at-a-time and he was surprised, as he thought it should be much more efficient. I agree with him there, so I'll probably be looking for more papers like that. I still believe that Kowari will be the only set-at-a-time OWL inferencer, so that will certainly make my project novel (an essential requirement).

Logic vs. Rules
I also asked about the difference between a Logic engine and a Rules engine. Perhaps I misunderstood Guido (Andrew can probably enlighten me here) but I thought that Guido considered the two systems to be distinct. Bob does not, and was wondering what Guido might have been getting at. As he rightly pointed out, a logic engine is usually implemented with a rules engine.

Bob asked what I believed the difference to be, and I'm afraid that I wasn't particularly articulate at the time. However, from what I can infer, a logic engine seems to perform algebraic manipulation to reduce the equations to their simplest form for efficiency, ideally eliminating elements of the equation which can be cancelled, and hopefully demonstrate the entire equation to be consistent or inconsistent. Once reduced in this way, the final result can be applied to the data to find all interpretations which makes the equation consistent.

Conversely, a rules engine simply tests each statement against each rule. For those statements which match, the head of the rule is executed (this is typically an assertion of new data). This is highly flexible, and can be used to perform all First Order Logic (FOL) operations, as well as many operations which do not conform to FOL. Often the last stage in the "logic engine" process described above is implemented in this way.

I should check out some Prolog implementations, as well as systems like FaCT++ and Vampire to see if they conform to this view of logic engines.

Bob also pointed out that he has a chapter in one of his books on bottom-up evaluation for Datalog systems. This basically means that these systems are implemented as sets of internal rules. He's lent me the book tonight, so I'll be reading that chapter when I finish here.

Meta-Ontology
I also asked Bob about some future ideas. I was careful to explain that I didn't plan on getting ahead of myself, but I'm really interested in where I'll be going after I've done my present course.

Rules engines are all very good, but they can only let you do things that you know about when you set up the rules. This limits their general applicability.

Ontology languages (such as OWL) offer a lot more scope. While they are implemented with rules, they let you describe a more flexible structure, and the underlying rules will then work within that. However, it is still difficult to get ontologies from different areas to work together, particularly for cross-inferencing between domains. At least OWL offers some form of portability of information, but even when nodes from different ontologies can be declared to be the owl:sameAs each other, there is still very little beyond type information that can be inferred from this.

In order to do more complex inferencing between domains (without prior knowledge to see a rules engine), a method of describing ontologies and how they interact must be developed. This doesn't really exist yet.

I expressed this to Bob, and he agreed that this is definitely the next stage of work in the field. He thinks that systems like OWL will provide the building blocks for the next layer up, where this higher layer will allow for descriptions like what I'm discussing. In his opinion, in 5 to 10 year no one will be doing OWL: they will be working on a layer above it.

I'll be continuing to think on this over the next few months. I know that Peter Bruza at the DSTC is working in this area, so I've made an appointment to speak with him on Thursday of this week. I don't know what to say, except to pose the question described above (how do I infer across knowledge domains?) but I'd love to hear what he may have to tell me.

Anyway, that's enough speculating for tonight. I have some reading on Datalog to get to.

Working notes