Working notes: 10/01/2004

Friday, October 29, 2004

New Build System
ML has spent the last 2 weeks re-writing the Ant build script for Kowari. It had been getting messy and difficult to modify, but it also occasionally missed some dependencies, meaning that we had to do clean builds most of the time. A full clean build took about 5 minutes on my system, so this was very wasteful.

ML's work was checked in on Thursday, so today I pulled in the new scripts, and started work on the NodeType resolver scripts to make sure everything was integrated. Unfortunately this took quite some time, and it wasn't until after lunch that I could compile again. Then ML told me that he needed a new method on ResolverSession, so I had to track down all of the implementing classes again in order to add the method.

I was starting to run my first tests by the end of the day, but hit a couple of initial hurdles. Initially the system would not recognise my model type. I had accidentally changed the type's URI, but this was not the issue. I realised that Kowari had not picked up my resolver factory, and so I spent a little time trying to remember how the factory was supposed to get registered. SR reminded me that this is done in the conf/kowari-config.xml file.

The next problem was that the newResolver() method was being called with a true value in the canWrite parameter. I was under the impression that this meant that the object was not going to be treated as read-only, so I was throwing an exception. SR explained that this was not the case, and that it was more like a hint for the system. If I really wanted to be read-only then I should just throw exceptions from any modifying methods. I'm already doing this, so I'm covered. In the meantime, I just have to remove the check on the canWrite parameter in order to proceed.

Thursday, October 28, 2004

Resolving
The resolve method came along reasonably well today. It needed quite a bit of pre-initialised information that I hadn't provided, so I spent some time getting all of that together. This included things like preallocated local nodes for the URIs of rdf:type, rdfs:Literal, and the URI objects and local nodes for the Kowari Node Type model URI, and the URI representing a URI Reference.

With this info in place I could get going on the resolve implementation, which is based mainly on the String Pool. Unfortunately, there turned out to be no easy option to get the String Pool, so this resulted in a half hour discussion about how it should be done.

The String Pools are stored internally in the ResolverSession object which is given to every Resolver instance. Ideally, we don't want anyone implementing a resolver to have unfettered access to this data, as it really is internal. It is only internal resolvers, such as the node type resolver, which need to see this information.

It is on occasions like this that I see the point of the "friend" modifier in C++ (a modifier I try to avoid... unless it is really called for). The closest analogy to this in Java is "package scope" which applies to anything with protected or default visibility. Unfortunately, this only applies to anything in the same package (not even a sub package) so it is of limited use.

Rant on Inheritance
Java does annoy me a little on inheritance. Both default and "protected" modifiers permit classes in the same package to see the data. However, there is no way to only allow inheritance, while hiding the same information from the rest of the package.

It would also be nice to permit subpackages to have the same access rights as package peers. That way, "internal" packages could see data that external packages can't, without having to pollute the parent package that everyone needs to see.

Back to String Pool Access
I suggested a class in the Resolver package which could see the internal data from the ResolverSession as this would let me get to the String Pool data that I need, without changing the ResolverSession interface. SR was against this, as he didn't see it providing much benefit in data hiding.

I finally ended up adding some of the String Pool methods to the ResolverSession interface. At least these methods were all about finding data in the string pools, so they were read-only operations. So even though external packages can now see the methods, they can't cause any damage.

In the course of the discussion, it turned out that ML also needs some access to the string pool as well. So instead of adding just a single method to the ResolverSession interface, I had to add two. The result was a modification of 6 classes which all implement this interface. Fortunately, only 4 of them were test classes which did not need to support all of the methods, so I just threw an UnsupportedOperationException.

Since each ResolverSession has two string pools, the methods I implemented actually performed the operation twice (on both the persistent and the temporary string pools), and then used tuples operations to concatenate the results. The NodeType resolver then calls the new findStringPoolType method twice (once for typed literals and again for untyped literals) and concatenates those results. So there's a bit of concatenation going on.

Finally, the results needed to be sent back as a Resolution object. Resolutions are just a Tuples interface along with 2 methods, called getConstraint and isComplete. The last time I implemented this class I was able to extend a TuplesImpl, but this time I don't necessarily know what type of tuples the data will be based on, so I had to wrap a Tuples object instead.

Wrapping an object with 23 public methods is a pain.

By the end of the day I had it all coded and compiling, but I hadn't run anything yet.

Wednesday, October 27, 2004

Node Type Resolver
Spent the day bringing over the URL Resolver into the Node Type package, and implementing all the fiddley bits. This means that I still have the main resolve() method to implement, but hopefully that will be very similar to the previous implementation.

Many of the other methods were trivial to implement, as the model is read-only.

The only issue I had with some of the methods was for model creation and deletion. I asked SR about these, and he told me about a class called InternalResolver which has several protected methods specifically for these kinds of operations. I had to extend NodeTypeResolver from this class, but this only required a few small changes. I also needed to promote an internal method called findModelType from private to protected so I could use it. It really is a utility method, so it was a valid change to make. I think it was only kept internal to the class as SR didn't need it for the two resolvers that used it.

Tuesday, October 26, 2004

Node Types
Bringing node types into the new resolver system had me stumped for most of the morning, but I eventually started to get a picture of it.

For a start, I will be implementing it as a new resolver, and registering it as "internal". It will then get found based on model type rather than protocol. As before, it will build its data from the string pool, only now it will be using more than one string pool, so it will be appending data.

The trick is to make sure that the node type resolver uses the same string pool, on the same session, as all the other resolvers. I was concerned abotu how to get this, but SR was able to reassure me that I can get it easily.

The other important requirement is that constraints to be resolved against a node type model will occur last. This is so all other resolvers will have already populated the string pools with data returned from their query. This is a little harder to guarantee.

At the moment, the order of resolution is based on the approximate size of the data that the constraint expects to return. One way to be executed last would be to return a size of Long.MAX_VALUE. Unfortunately, several other resolvers try to bias themselves to go last by doing this. In this case the resolver absolutely must go last, so it can't necessarily rely on this trick.

In the interim, SR has suggested that I try returning Long.MAX_VALUE as a size. If another resolver tries to get in the way then we can deal with it then. Since most resolvers play well, this should not be a real problem, at least not in the general case.

Armed with this design I've started coding up the new resolver. It will probably take me a day or two.

Monday, October 25, 2004

Remote Frustrations
I started early and caught up on the weekend's email, etc. Once I got going I started looking at the Remote Resolver issues once more, but these were really in AM's area. So when AM arrived I dumped it all on him, including my test data, and went off to look at the Node Type issues.

After a while, AM got back to me to say that he was getting a different kind of exception while running my queries. The exception was related to being unable to find a factory for a local statement store. This was unexpected, and he went looking for it. It turned out that I had used a method on SessionFactoryFinder called newSessionFactory which makes the assumption that the session is local to the current JVM. The problem with this is that when a connection fails, it tries to fall back to using a local session, which is incorrect behaviour when speaking to a remote session. It was just a problem of not understanding the interface. The solution is to use a different version of the same method which accepts a boolean to indicate if the session is local or remote.

The only reason this problem showed up was because AM hadn't completely adjusted the queries I'd sent him, and one of the model references still took him back to my computer. Since I wasn't running a Kowari server his connection failed, and tried to fall back to a local resolver. So he was supposed to be getting an exception... but it was supposed to be a different exception.

Node Types
I went searching for the Node Type code (completed only a few weeks ago) so I could update it for multiple resolvers. However, I could not find it no matter where I looked (and I couldn't remember the path). After failing some "grep" attempts I realised that the code was now gone, so I asked where it might be.

It turned out that it was not brought forward into the new code base. This is a problem because it means that I'll have to work out a way of hooking it in that is similar to the last method. Unfortunately, that will increase the amount of work to do it, and I'm not sure I've allocated enough time.

In the meantime I had to find the old code, so I asked AN. He told me a branch name to use in CVS and left me to it. Now I don't know much about CVS, so I did a "man cvs" and searched for the word "branch". I found the command "admin -b" which changes the default branch, and I thought that might be what I want.

For those of you who know anything about cvs, you'll know that this was not the command that I wanted, but I had no idea of this.

Of course, I ended up changing the default branch for everyone. It only took me a minute to realise this, and I went immediately to DJ for help. The command to fix it should have been the same one, with no branch label, but for some reason this did not work for either DJ or myself. In the end he simply restored a backup of the cvs respository, as the backup was only a couple of hours old, and no one had committed since then. So there was no harm done, but it certainly wasted some time.

At the moment I'm trying to find out how and where to insert this node typing code. It used to be in a module known as a "resolver", but since that has a new meaning in the current system I may need to find a different name.

Friday, October 22, 2004

Remote Tests
I discovered early on in the day that part of my problem came from the tests trying to use the local server as a remote server as well. This was fixed in two ways. First, I got Thursday's problem properly sorted out so that anything happening on the local server would not get sent off to the remote resolver. This meant that it was not possible to make the remote resolver work without running a second server. After stuffing around with the Kowari documentation for a little while, I was able to make a second server run, using the command:

java -Dshutdownhook.port=6790 -ea -jar dist/kowari-1.1.0.jar --servername server2 --rmiport 1299 --port 8081

The second part to making it work was trying to get the tests to handle a second server. I'm not sure how I was doing there, as it was going to involve some tricky Ant scripting. In the end I realised that I needed to make it work by hand first.

I built some simple test data which defined some nodes with literal "names" in one model, and some relationships between those names in the other model. A query which tried to get all nodes names for nodes meeting a particular relationship requirement failed, so I went to a simpler query.

The simpler query had a "where clause" pointing at one server, and a single constraint with an "in clause" pointing at the second server. This always returned no data.

Lots of logging code later, and I found that the correct data was being found, but there was a problem with it at the client end. It had the correct row count, but always returned no data when an iterator was used.

Finally I found that I had closed the session too early. This was because there was no other place to close it. After consulting with AM I was told that the transaction code does have to be implemented, and that SR was wrong in thinking that it doesn't.

For the moment I have code that works, but leaves a session open.

Then I tried the more complex query, and a new error showed up. This involved a problem with a "prefix" used in the beforeFirst method used in the join code on the client. This is not code that I know much about, so at the end of the day I had to leave it with AM.

Thursday, October 21, 2004

Late
OK, it's days late, but I've had other things to contend with for the last few days. So here's an overview of Thursday...

I thought I'd finish testing the remote resolver, but testing was very difficult. Whenever I attempted a remote query, it would deadlock. This turned out to be due to a bug in DatabaseSession.findModelResolverFactory(). This method looked in the system model for the model type of a URI given to it. If it was found, then it knew it was local, and it used a local resolver facotry. If it wasn't found then it checked the protocol for the type of resolver to look for.

The problem was that the code assumed that if the protocol was "rmi" then it should have been local, so it threw an exception. The fix was to compare the non-fragment part of the URI to the current server name, and then throw an exception if it matched (since a match meant that the model should have already been found in the system model). If there was no match, then the model appears on another server, and it returns a remote resolver factory.

After finding and fixing this, I then ran into deadlocking problems. This was because the local system was obtaining a lock, and then asking for data from a remote resolver, which was actually on the same machine and required the same lock. That was where the day finished, and I took the bug up the next day.

Tuesday, October 19, 2004

Test Port
I started the day by porting AM's unit test code for View resolvers over to the remote resolver. This was time consuming, but reasonably easy. I started by changing all the obvious items to the new resolver setup (for instance, I did a search and replace of RemoteResolver for ViewResolver), and then moved on to a method-by-method comparison of the old code with the View resolver test code. Strangely, the ViewResolverUnitTest tests had all been commented out. AM wasn't sure why he'd done that, but there were no problems putting them back in.

I was pleased to discover that many of my bug fixes and specific code to handle general cases were already in the template. This meant that I could delete some of what I'd written in the past. It bodes well for when someone gets the time to refactor this, and use inheritance to handle the common parts. We all wish we could do it now, but with everyone continuing to modify their own resolvers, it's difficult to find stable common ground between their tests.

Once the tests were happily compiling and running, the results were exactly the same. Wouldn't you know it? :-) At least I had a common base with which I could question AM.

The specific problem which was reported was:

  org.kowari.resolver.spi.LocalizeException: Unable to localize rmi://myserver.com/server1#dc - Couldn't localize node

This told me that the query engine was looking for a string in its string pool, and it wasn't available. Initially I thought that perhaps I had globalized nodes which had come back from the remote resolver, and the action of localizing them was failing. However, AM pointed out that it was actually happening in the Resolver.setModel method, which has no data returned from the resolver.

Resolver.setModel is called by DatabaseSession.setModel. Looking at this method showed that the problem was occuring when the local node number of the destination model was being looked up. The method being used here was:

systemResolver.lookupPersistent(new URIReferenceImpl(destinationModelURI));

This method looks up a node in the string pool, returning the node number if found, and throwing an exception if it isn't. Given that the destination model is on a different server, then the local systemResolver object will not know about that model URI, so the local node number for the model can't possibly be found.

The solution is to do exactly what the source model code was doing:

systemResolver.localize(new URIReferenceImpl(sourceModelURI));

The main difference here is that if a node is not found then it gets created.

At this point the power went off due to a storm that we had. With my test code all on a PC I couldn't test if anything worked. So I started looking at some more of the DatabaseSession code to see if this mistake had been made elsewhere as well. I found it in one other place, this time in the private doModify method.

While getting some advice from AM about the bugs I was seeing, I showed him the RemoteResolver.modifyModel method. While everything was fine, he pointed out that the set of statements could be a streaming object, and if I tried to send everything to the server in one go then I could run out of memory. I've since wrapped this in a loop to send pages of statements at a time. One more thing I should do is wrap this in a transaction, so it becomes an atomic operation on the server.

Spelling
For the last few days the spellcheck button on Firefox has been doing nothing (beyond briefly flashing a window up). I wonder why the behaviour has suddenly changed?

Resolver Tests
The test code class path problems were far reaching, and tedious. It seemed that this one test required every single resolver in the system to be available to it, which in turn required all the supporting libraries for each resolver type. I was trying to create a minimal set of jars to include, but with each addition I discovered yet another jar to throw into the mix. After spending way too much time on this, I added every remaining resolver jar that I hadn't yet included, and this finally allowed the tests to run.

The first problem with the running tests was an assertion when trying to resolve a model. Adding a string to the assertion showed that this was a "file://" model, which I don't handle. It turned out that 6 months ago I followed a comment a little too literally, and I explicitly registered this resolver as being able to handle file models.

Once the file registration was removed, a group of methods complained that there were no resolvers registered which could resolve a file model. Each test was taking a single element array of the resolver classes to register, and this only included the remote resolver. The simple fix was to declare a static array with two elements: the remote resolver, and the URL resolver (which can handle files). This still left 3 tests unable to handle file models, as it turned out that these methods used a different test mechanism and registered their own resolvers.

These fixes then got me to 2 passing tests, and 34 failures.

The first failure was due to an expected element not being found in the memory-based string pool that is used for temporary data. I don't believe that this reflects any problems with the remote resolver, but I'm not sure where the problem is coming from. I asked AM (who built the original framework) and he wasn't sure either.

AM pointed out that he has since refactored the whole test framework for resolvers. He suggested that I take the new test framework and apply my changes to that. I'm a little reluctant, as I had to make a lot of changes for the remote resolver. However, the problems appear complex, and I'll need to update it in order to have a consistent base in AM, simply so he can help debug it, if nothing else.

Monday, October 18, 2004

Resolvers and Tests
Same old, same old. Part of the day was spent finishing up with the build bugs for the remote resolver. Once this was all working fine I moved onto the unit tests. Like the resolver itself, I had implemented this already for the old interface, so it was more a matter of patching it up to work with the new interfaces.

While I went through the test code again I was reminded that it uses an in-memory model. The behaviour of these objects may have changed since this code was written, so any problems may stem from this area. I suppose I'll find out when I get there.

I also spoke to SR about the transaction interface. He hasn't determined exactly how to handle certain situations for distributed transactions, so for the moment there is no need to implement this code. He was pleased that I'm using the logging DummyXAResource class, as this will help him debug as he starts to implement this on the server.

My last step of the day was to try running the tests. The test code is particularly complex, so I didn't really expect it to work straight away, and it hasn't. For the moment the problem is a classpath issue in some reflection code. I'll probably spend tomorrow morning tracking this down.

Friday, October 15, 2004

Quiet
It was a quiet day with TJ away. I noticed AN and KA dealing with some minor crisis that needed TJ's involvement on his day off, but fortunately I was far removed from that.

There were quite a few minor changes to the RemoteResolver interface which needed attention before anything would compile properly, and I spent a lot of the day dealing with these. Many of the changes were straightforward, though time consuming (e.g. some data which was provided in local node ID form is now being provided in global URI form, which means that all references to it have to change). Changes like this take up a lot of the day, but don't leave much to report on.

The only thing worth mentioning is that the RemoteResolverFactory.newResolver method now takes an extra parameter called systemResolver. While it might be handy to have this object in some circumstances, I can't think of how it would be needed for the remote resolver. So after looking at the related code for some time I decided that I wouldn't need the system resolver, and have ignored it. On the other hand, I'm now concerned that I'm missing something important, so I'll need to have a word with SR about it on Monday.

The other thing about this method is that the signature changed from:

  public Resolver newResolver(ResolverSession resolverSession, boolean canWrite);

To the following similar form:

  public Resolver newResolver(boolean canWrite, ResolverSession resolverSession, Resolver systemResolver);

I've already mentioned the systemResolver parameter, but I'm a little lost as to why canWrite was moved, though it hardly matters. However, one thing that does matter is that the javadoc for this method in the interface was not updated... something I should really chastise AM about.

Thursday, October 14, 2004

Paths
Quiet day today. I finished going over the modifications to the remote resolver code, and got it to the point of trying to build.

I spoke to SR about the XAResource and where it will fit in. Not having used them much I thought that this was an object that would get used for managing the transaction, but SR is using it as a callback interface for each resolver. He hasn't implemented some of the finer details of distributed transactions, so for the moment he is happy for me to use the DummyXAResource class, as this just logs everything called on it.

I finished the day with failures in compilation. The compiler is complaining that it can't resolve classes that I have explicitly imported. This means that the compiler classpath must be at fault. This is annoying, as it can take a long time to find with jar is needed to get it right. Normally it would not be an issue, but Kowari is built as a single jar which in turn contains a large collection of jar files. Finding class files in jars, within jars can be very time consuming.

Jars within jars are not a common site, but it has certainly been handy, particularly when including third-party libraries. A few years ago TA wrote the code for this. From my understanding, it unzips the jar file into the temporary directory, and then uses a class loader to pick up all the new jar files. It's a bit tricky internally, but from the user's perspective it's ideal. This is why Kowari does not need a class path, and can just be launched with:
java -jar kowari-1.0.x.jar

Wednesday, October 13, 2004

Time and Resolvers
I'm trying to get to bed early, so I'll be brief.

I spent about an hour finishing off my timesheets today. It's amazing how slow Safari gets when it has to render as many widgets as the timesheet program creates. As a result, it is frustratingly slow to enter this data. So while waiting I killed a little time reading Slashdot.

The rest of the day was spent converting my old remote resolver code to the new SPI. Fortunately there appears to be very little which has changed.

My only concern is the method which requests a new transaction session. I'm not really sure of what to do with it, as no other methods use the XAResource object that is returned. For the moment I'll be returning a DummyXAResource (which was kindly supplied for me by SR) and using my own session objects. This will certainly work for tests, but then I will need to consider how the use these objects would impact on a resolver session.

Tomorrow I'll have to ask why we have a method that returns an XAResource object, but no methods which accept one.

Java 1.5
I took a little time to explain to AM some of the changes to the Java language in the new release. I didn't go into some of the non-language features (such as instrumentation) as I've yet to fully explore them.

Generics are nice (especially as I like the type safety of templates in C++), though he claims that some people have been complaining about some deficiencies in their implementation. Personally, I see the advantages of the new system outweighing any corresponding problems.

Autoboxing, for-each loops, and var-args are all syntactic conveniences, so they are not really anything to get excited over. They will certainly be handy though. Similarly, static imports will be useful, but again they just help tidy up the code.

The two standout syntactic improvements are annotations and enumerations.

The annotation system is very nice. The Sun example which annotates methods which are to be tested really demonstrates the power of this. Once this takes hold, JUnit testing will become much more powerful, and can eliminate all sorts of string comparisons used in reflection. I can see this extending into the definitions of MBeans for JMX, and other sorts of Java beans as well.

But my favourite addition is the new enumeration type. This allows the removal of the need to use integer identifiers, provides type safety, provides automatic toString methods for each element, and allows access to the entire range (and even subranges) of elements in the enumeration. It also allows for a user-defined constructor, and class specific methods to overload the behaviour of the type. It even permits the creation of and abstract method which each element implements to create custom behaviour. These are some very nice features, and will provide a significant saving in code when building up powerful enumerations types.

Unfortunately I won't be using much of this for some time. For a start, until Java 1.5 is also available on OSX then it can't be considered truly cross platform, and I won't be able to run it on my PowerBook anyway. Secondly, the wait until it is available on the Mac will be useful, as this should allow a significant enough period to have ironed out the bugs present in any x.0 release.

I didn't even have to mention these reasons to AM. I just said that there were two reasons I wouldn't be using it yet, and AM just named them without any input from me. Looks like I won't be the only one waiting then. I'm looking forward to it though.

Tuesday, October 12, 2004

Estimates
We had a discussion today about how long we expect each element of the changeover to take. Part of this process was discussing what each of us had learnt about the code in question, so we could determine the scope of the work. This let us describe what features each item may or may not have in the new system.

In order for trans and walk to work over multiple resolver backends, it will be necessary to perform these functions on each backend individually, and then perform the operation again on the full collection of results. In many cases the final operation would not find any similar nodes, and the result would simply be a union of each of the partial results, though if models from different sources referred to the same resource then this mechanism would find that as well.

The problem with performing a trans or walk over all collated results is that the partial results are in global space, making the operations expensive.

Given the difficulty and of the collation, and the rarity of common nodes across data sources which have transitive relationships, TJ decided that Kowari can do without it for the time being. At this stage we just need it running with all current features on the new codebase, so I agree with him here. Full transitivity can be performed later, as it is needed.

In the meantime I will be working on the distributed Node Type model, and on distributed queries. I've already started pulling the old distributed code into the new interface, and I should be working on that for the next few days.

Timesheets
A lot of today was spent updating my timesheets. I keep a time log in a text file, so it should be trivial to put these into the time keeping system. Unfortunately, the web pages used to input data have a lot of input fields. This makes each page painfully slow to render, even on a fast machine. Even scrolling can be slow. The result is a very slow process to enter trivial data.

I'd be better off just putting in each day's data as I finished the day, but when I'm about to leave I'm not normally interested in spending 5 extra minutes at work, especially when it is data that I should be able to enter in a matter of moments.

That said, I'll see if I can start putting the data in first thing in the morning. I've been getting in early for the last few days, so if I can keep it up I should be able to find the time.

MPhil
I saw Bob again today and reported on what I've been doing. Not a lot in terms of reading, which is something I should try to remedy next time.

He was interested that I've been looking more carefully at the documentation for OWL. This is something I should have already done, but who likes reading specifications? Bob agreed with me here. :-)

At least I ended up with a better knowledge of OWL. I've looked at some of my older posts, and have cringed at the naïvité of some of my statements.

While discussing some of the interesting restrictions I've found in the species of OWL, Bob suggested that I try and document them all, and pay particular attention to the ramifications of each difference. For instance, I can't see that there is a real decidability problem with cardinality, even though OWL DL restricts this to 0 or 1. The only reason for this restriction appears to be because description logic does not support counting, even though this is an easy operation in most systems.

Anyway, I'll be looking at these differences for the next few days. Once I have something done, I'll look to see who else has done this, and compare what we have found.

iTQL
Bob was also interested that iTQL has no RDF support in it. AN and I have been considering putting support in for some items such as collections, but to date we have been able to avoid it.

Now that I want to provide a mechanism to find descriptions in OWL, I'm starting to wonder where it fits. This should really become an OWL API, which is layered over the RDF API, but should this be reflected in iTQL? This would seem to break the separation of layers, but iTQL is currently the only user interface available. There is also the argument that we are building an ontology server, so it makes sense to have ontology support in the user interface.

For the moment we have only provided a set of complex, low level operations which will perform ontology operations when all applied together, but this does not yet make an ontology server. I suppose this makes me wonder if iTQL should start to support RDF and OWL, or if should be used as the building blocks for a higher level language which provides this functionality. However, this would mean writing a new language... and I really don't want to do that.

Monday, October 11, 2004

Estimates
Today started by trying to work out time estimates for the new "Resolver" work.

I still need to speak to TJ to see if we want trans to be able to operate over models from different sources. If this is to be permitted then it would require tuples to be translated to and from global space, forcing a performance hit. On the other hand, perhaps it would be possible to keep the current code for trans, to be executed only when models in the same data store are being queried.

The other area I looked at is how the special Node-Type model will work in the new system. The current model reflects the contents of a single string pool on a single server, a concept which does not translate well into the global system.

After much discussion with AN and SR, I've come to the conclusion that the query code which accesses the resolver interface will need to create the virtual model. Every node which comes in via the resolver interfaces is either available to the local string pools (including a temporary string pool which belongs to a query), or is returned in global format, and is then stored to the string pools during localisation. These string pools then contain all of the information needed for the types of all the nodes to be returned from the query. They will be missing much of the data pertaining to nodes which were not returned from the current query, but that is a benefit for the efficiency of the system.

The only shortcoming of this approach is that it will not be possible to query for all of the strings in a system, but this was never a requirement anyway. Until now, this feature has just been a useful debugging tool.

Hybrid Tuples
I spent some hours with TJ trying to find the problems associated with the Node-Types test. We have been able to show that there appears to be no problems when a moderate amount of data has been inserted into the database. The problems only manifest when a large amount of data is present.

Selecting all literals from the system has shown that all of the expected data is present, which means that something is going wrong in the join code. More specifically, the requirement of a large amount of data means that the problem stems from a join against a file-backed HybridTuples object. This is because HybridTuples is specifically designed to change from a memory-only mode to a file-based mode once a certain quantity of data has been detected.

I took AM through the code that I am using to sort and join the literals with, and he said that it all looked OK at first glance. It was at this point that I discovered that my code had been changed. It used to select the doubles, dates and strings from the string pool, sort and then append them all. Now it gets two groups called "UNTYPED_LITERAL" and "TYPED_LITERAL". Typing this just now I realised that DM changed this as a consequence of the new string pool. This is because the string pool is now capable of storing many more types than were previously available.

With sorting and appending the literals checked, it would seem that it is the join that is at fault. AM does not know where the problem might be, but he concedes that the code is not well tested, as it is difficult to create the correct conditions for it.

In the meantime, I have been trying to help AM by creating a smaller dataset to demonstrate this problem. We tried changing the size that HybridTuples switches from memory to disk, but the minimum allowable size seems to be an 8kB block. This is larger than the data that many of our test string pools return.

I thought to load a WordNet backup file, and then output it to RDF-XML where I could trim it to a large, but manageable size. Unfortunately, the write operation died with a NullPointerException. I'm not really surprised, given the size of the data, but I should report it as a bug in the Kowari system.

A couple of attempts with the N3 writer also received the same error, but I decided that it might be due to a problem with the state of the system. After a restart I was able to dump the whole lot to N3. I'll try and truncate this in the morning.

Excludes
My first item of the morning was a discussion with AN on the excludes operation. It turns out that he has recently implemented a new semantic for it. When used in a query with an empty select clause it changes its meaning. It now returns true if the statement does not exist in the database, and false if it does. While I don't like the syntax, the effect is useful.

If this is applied in a subquery, it provides the function of filtering out all statements which do not fit a requirement. This is essentially the difference operator I have needed. It is not a full difference operation, as it will not work on complex constraint operations in the subquery, but to date I have not needed that complete functionality. Besides, there may be some way to have a complex operation like that expressed in the outer query with the results passed in to the subquery for removal.

Pleased as I am that the operation is now available to me, I still have a couple of concerns with it. To start with, I dislike changing the semantics of a keyword like this. Amusingly, AN suggested changing the name back to "not", as the new usage has a meaning with a closer semantic to that word. I've realised lately that the word "not" has numerous semantics when applied to databases (every developer has a slightly different idea of what dataset should be returned), so using a word of "elastic meaning" in a place which can change semantic based on context seems appropriate.

I'm also uncomfortable that this operation is done as a subquery. While it works, it has two drawbacks. The first is the complexity of the resulting query. It is a simpler construct than many I've seen, but it is still ugly and likely to land users in trouble. The second is the fact that the query layer must be traversed with each iterative execution of the subquery. This is made especially bad as the data gets globalised and re-localised in the process. The result is an extremely inefficient piece of code. If it were implemented as a difference operator instead, then all of the operations could be performed on localised tuples. This is about as fast as it gets, so it offers real scope for improvement if we need it.

In the meantime, it works, so I'll work with what we have. If it proves to be too slow, or the syntax proves too troublesome, then we know we can fix it easily.

Friday, October 08, 2004

Directions
One of the big plans for Kowari is to introduce the "Resolver" SPI. This is a set of interfaces which is used to wrap any datasource to allow it to be a backend for the Kowari query engine.

Using appropriate implementations of this interface, it will be possible to perform queries on all sorts of different datasources just as we currently perform queries on the Kowari datastore. Some obvious example datasources are MS Exchange servers, Oracle databases, and RDF-XML files. With Kowari's ability for a query to access multiple models at the same time, the Resolver SPI will allow for very powerful realtime data access.

To make this work, the current query code has to be split up from the Kowari datastore, to channel all access through this new interface. This will make the Kowari datastore "just another" backend that can be plugged in. The main advantage of the Kowari datastore over any other is that it should be the fastest possible backend for the kind of data that the resolver interface uses, and it will support all possible features. Other interfaces will not necessarily be capable of supporting every feature. For instance, an RDF-XML file will require access to be restricted to read-only.

Changes like this are far reaching, so we need to momentarily concentrate all our efforts on these modifications. If we worked on anything else in parallel, it would most likely need to be repeated to work with the new changes.

With this in mind, we had a meeting today about the different subsystems which will be affected by the changes. After determining all the major sections, we each picked up a few, and will spend the next couple of days working out just what is involved to make the changes. This should give us enough information to estimate how long it will take.

Node Types
Late in the day TJ discovered that one of the machines did not run the full suite of JXUnit tests correctly. The failing test turned out to be the search for literals in the node type tests.

The test finds the one string and one number found in the Camera ontology. For some reason, the test fails by only returning the number. As with the earlier problems, this only occurs when the full WordNet database is loaded. This indicates that there is some problem with the use or implementation of the HybridTuples class.

There are two places this could go wrong. Either the string is not being returned by the node-type model, or the result is not being joined against correctly. This is easy to determine, as it is possible to just return the literals and not join the result against anything. If both the string and the number are in there, then the problem is not with the node-type data being returned, but with how it gets joined to the other data.

Of course, with WordNet loaded, the result of all literals is huge. Because of this, the appropriate method of query is via the command line interface, with the results being streamed to a "greppable" file. Unfortunately, at this point the network had to go down due to a UPS replacement for the file server (it started beeping at lunch time). Given how late it was on a Friday, we took that as our cue to leave.

Engineering
At lunch time I went along to UQ as one of the staff members in ITEE was retiring, and they were putting on a free lunch. This gentleman had shown remarkable patience with me when I was finishing my first undergraduate degree, and I've always been particularly grateful to him for it. Besides, who am I to turn down free food? (I skipped the beer though, as I had to return to work) :-(

While there I met one of my old lecturers who asked about my current study. At one point in the conversation he asked when I graduated, and when I responded he made the comment, "Oh, that was before they cut the course." This then led to a conversation about the engineering degree at UQ.

First a little background...

When I went through we had a lot of subjects, and they were really hard. They were harder and required significantly more work than any subjects in my physics degree. Engineering decreed a minimum of 32 hours a week (I forget what that translates to in credit points).

I don't know about the other departments, but when I was first studying I was aware that Electrical Engineering was in a quandary about what to teach the students. Institutions such as IEAust, the IEEE and the IEE lay down a minimum set of requirements for a degree to follow in order to be recognised.

Operating in opposition to this, the university set out a time period for the degree (4 years) and a maximum number of credit points that a student could enroll in. With the restrictions imposed by the university, and the requirements imposed by the institutions, the department created a series of subjects with an enourmous amount of information crammed into them.

There were also further constraints imposed, as mathematics, physics and computing subjects had to be provided by the Science faculty, but these subjects were not as "crammed" as the engineering subjects. Actually, I used to benefit from these subjects, as they did not require much work in comparison to engineering subjects and they allowed me to relax a little.

While Australian universities have been offering fast-tracked engineering degrees in 4 years, other countries take a different approach, taking 7 or more years. This is probably of benefit, as the students would retain more information, and fewer of them would burn out.

Back to more recent times...

In the late 90's, UQ decided that they wanted continuity among all of their courses. A Commerce degree had a different workload to an Arts degree, and this led to differing costs for HECS. Since many faculties did not support an extra workload, faculties such as Engineering were told to cut the work requirements of the students. As a consequence, the hours required of students dropped from 32 hours a week, to 20!

The natural consequence of this, was that subjects had to be dropped. Entire areas of knowledge are not taught to students. Control theory (a difficult topic) has proven to be one of the most valuable subjects I was taught, and yet it is no longer available to students.

If you can't learn these topics at university, then where can you learn them? Research degrees don't cover them. One solution proposed has been to put this information into a coursework Masters degree. This is certainly the approach taken in the US and UK (hence my previous post referring to Masters degrees from those countries not equating to the same qualification here). But this then means that undergraduate students at UQ are not getting the degree they thought they studied and paid for.

My question is now why the IEAust, IEEE and IEE have allowed this. They can't prevent UQ from changing its courses, but they can certainly drop their accreditation of UQ degrees. Having a degree from an unaccredited university is useless.

I'm at a loss as to why none of the institutions have acted on this. There are requirements for an electrical engineering degree, and UQ has suddenly stopped meeting them. At this time the IEE accepts the accreditations provided by IEAust, and I'm guessing that the IEEE do the same. So why haven't IEAust done something about it? Obviously it is political decision (how could they take away accreditation from one of Australia's top "Sandstone" universities), but in the meantime students are missing out, and industry is not getting the professionals it requires.

I will be raising this at the next IEE committee meeting. It has been raised in the past, but at the time a "wait and see" approach was taken. However, last time it was suggested that if things were bad enough then it would be possible to stop accepting IEAust accreditation. Perhaps that would make the IEAust wake up and do their job.

Thursday, October 07, 2004

All Different
I started on owl:allDifferent today, but was unable to finish.

This predicate uses owl:distinctMembers, which is a predicate used to refer to a collection. When this type of collection is turned into statements it becomes a linked list (actually, a cons list). This means that any query which needs the contents of this list has to be capable of walking down the list. Enter the walk() function.

Walk() allows the query to specify a starting point (the head of the list) and the predicate to traverse (the rdf:rest predicate). The starting point is fixed, so it will either be a set value, or else it will be a variable holding a single value. The way to get the variable set to a single value is to pass it in to a subquery and perform the walk in there.

In the case of a collection, the starting point is a blank node, so there is no way to use a set starting point, and it must be passed in as a variable. Even if lists did start with a named node, there are an unknown number of lists, so they all have to be found, and the heads of each list put into a variable. The only way to reach each of these values one at a time, is again with a subquery.

Unfortunately, when I tried to use a subquery there was an error stating that a variable could not be used as the starting point of a walk. The iTQL parsing code expects to see a fixed value for the starting point, even though the implementing code is able to deal with a variable (of a single value). Theoretically this is a simple fix, and AN said he would do this for me.

Intersections
One of the last predicates to deal with is owl:IntersectionOf. There are several others, but not in OWL Lite. Also, OWL Lite only uses owl:IntersectionOf for restrictions, so it should not be too difficult to work with.

I am still convinced that we need a simple method in iTQL to check if a node matches a description, as described by the OWL abstract syntax. I had an attempt at doing the iTQL for each of these, and the results did not satisfy me. In the case of owl:hasValue it is easy to perform a query for nodes which meet a restriction. However, for other restrictions, such as cardinality, it gets very difficult to check if a class instance meets the restriction, and the result is not of an easily usable form. For instance, finding if nodes of a type which is an intersection of a class and a cardinality restriction, will return a node, along with a tuples which will be empty if it does not meet the restriction, and will have a line of data if the restriction is met.

Once AN saw the query required for this, he agreed that just because we could do this in iTQL, the complexity demonstrated that we shouldn't. The only issue now is how to do it.

AN favoured an approach where we use a rules engine, and create statements which indicate if an instance meets a description. Performing this iteratively expands what can be determined. While I approve of anything that will work (and this will), I prefer the approach of developing a recursive iTQL function which can do it for us. If we like, this function could even create new statements on the way, to help with future processing.

I prefer this for two reasons. The first is that it avoids going through the iTQL layer over and over again. The second is that it makes iTQL statements far more powerful, and does not require the use of a rules engine in order to run.

The most significant detriment for a function that determines description is that it introduces knowledge of OWL into iTQL. iTQL is an RDF query language. It provides the facilities for performing OWL queries, but it is not itself OWL aware. A function like this would be easy to implement, and very valuable, but it would step outside of the domain of iTQL, which does make me a little uncomfortable.

Back To Intersections
I also had a question on what can be inferred from intersections in OWL Lite. Often, a class will describe that it is an intersection of another class and an intersection. This is demonstrated in the example camera ontology:

<owl:Class rdf:ID="Large-Format">

  <owl:intersectionOf rdf:parseType="Collection">

    <owl:Class rdf:about="#Camera"/>

    <owl:Restriction>

      <owl:onProperty rdf:resource="#body"/>

      <owl:allValuesFrom rdf:resource="#BodyWithNonAdjustableShutterSpeed"/>

    </owl:Restriction>

  </owl:intersectionOf>

</owl:Class>

Instances of the "Large-Format" class must be of type "Camera", and must have something of type "BodyWithNonAdjustableShutterSpeed" for its "body" property.

Obviously, to check if an instance of "Large-Format" is consistent it must also be of the "Camera" class, and if it has any "body" properties then these must all be instances of the "BodyWithNonAdjustableShutterSpeed" class. My question is this: If such an instance were found, then is it legal to infer that it is of type "Large-Format"? I'm guessing not.

Alternatively, the intersection is permitted without a class declaration:

<owl:intersectionOf rdf:about="#MyIntersection" rdf:parseType="Collection">

  <owl:Class rdf:about="#Camera"/>

    <owl:Restriction>

    <owl:onProperty rdf:resource="#body"/>

    <owl:allValuesFrom rdf:resource="#BodyWithNonAdjustableShutterSpeed"/>

  </owl:Restriction>

</owl:intersectionOf>

Any compliant instance will meet this. I'm wondering if this should be used in any inferencing, though I'm not sure how, at least for OWL Lite. The only thing I can think of where this might be useful is if it were used by a restriction, in which case the recursive procedure I described earlier will cover everything. Of course, it can also be used in class definitions, but as I said just now, I think that will only be used for consistency checking.

Wednesday, October 06, 2004

Bowtie Operator
I've just had a look at yesterday's blog using several different browsers on the Mac. IE, Firefox and Camino, all show unicode character 8904 (⋈) as a question mark (?). Safari was the only browser that rendered this unicode character properly. I have no idea what the rendering will look like on the various apps in Windows or Linux.

I keep IE only for those sites which insist upon it (fortunately there aren't many of those). I use Firefox for blog entries, as it provides the full XUL interface that Blogspot offers. Camino seems to offer all the advantages of Firefox, only with the lovely native Aqua widgets. Unfortunately, it does not render the blogger editor correctly, so I rarely use it.

My browser of choice is Safari. This is because it is fast, attractive, and 99.9% compatible with most sites. Importantly, it is fully integrated with the other OSX apps. This is the feature which has me using the OSX Mail program instead of Firebird (even though I feel that Firebird has a better interface, I just can't believe that it does not interface directly into the OSX address book).

It now seems that Safari is the only browser which can render certain unicode characters. This only encourages me to keep using it.

owl:differentFrom
Some of today was spent on owl:differentFrom. This was much easier than owl:sameAs as there is not a lot of entailment that can be made on this predicate.

While documenting the predicate I wanted to keep saying "different to" instead of "different from", so I thought I'd look it up. It seems that "different from" is the common expression in the US, while "different to" is the more common British (and hence, Australian) usage. Since most people reading this will be in the States I opted to stick to "different from". Besides, it matched the OWL predicate as well.

The only consistency check for this predicate was one that had already been done for owl:sameAs. While I was tempted to simply refer to the other section, I wrote the example out again, along with an explanation which was more appropriate for the context.

owl:allDifferent
I spent some time looking at the owl:allDifferent predicate, with some frustrating results. This predicate uses the very lists I hate in RDF. As a result, the only way to select each item in the list is with a walk(...) operator.

Unfortunately, the walk(...) operator requires a fixed end to start from, but the end in this case is a blank node. The only way to "fix" this end is to find it in an outer query and then pass it down via a variable into an inner query which performs the walk. This would seem OK, only the walk function does not accept a variable as its starting point (even one with a fixed value as would happen in a subquery). Late in the day AN said that he could fix this relatively easily, so I am hoping he can have it done early tomorrow.

Descriptions
I had a brief work with AN about the need to find out if a node meets a description as defined by the OWL documentation. He is not keen on the idea of a new function like I proposed, so we discussed some options.

The most prevalent option is to let the rules engine do the work here. The problem is that there are no OWL constructs which can describe that an object meets the conditions of an owl:IntersectionOf or many of the other constructs which can make up a description. AN pointed out that we can use an internally defined predicate to describe this type of thing, which would solve the problem.

If we took this route then we are still left with a couple of options. The first is to define the appropriate rules in the rules engine which can generate these statements. The engine would then be able to derive the entire set of description statements for all the objects in the system. While this would work, I believe that it would be slow, particularly as it would need to traverse the entire iTQL layer for each invocation of the rules.

The second option is to implement the function as I initially proposed it, but to insert the description statements as each recursion returns. This would then permit subsequent invocations to use these statements instead of the original algorithm, providing a significantly reduced workload. Other unrelated operations may also be able to take advantage of knowing which objects meet which descriptions.

AN also mentioned another option which involved performing a lot of intermediate operations with the use of iTQL. While this would obviously work, I do not like the idea of re-entrant iTQL. Besides, it suffers from the inefficiencies of the iTQL that I have already mentioned.

Whatever the outcome, we really need to see some way of letting an inferencing query determine if an object is of a given "type", without that query needing to do any difficult work in that regard. Many of the required inferencing queries are complex enough already without adding to their difficulty.

Another advantage of checking for description compatibility in some internal fashion is that it allows several steps of a rules engine to be skipped. With a check like this it is possible to immediately determine that an instance of a class is also an instance of a superclass (and all superclasses above that). With the purely rules-based approach this information will only be available after executing a number of rules, each of which involves another transition through the entire iTQL layer. While that approach will work, it would be significantly slower than simply giving a rule immediate access to all types for an object.

I suppose we still need to discuss this before any decisions can be made.

Tuesday, October 05, 2004

owl:sameAs
The majority of today was spent building examples for the owl:sameAs predicate, and documenting the results.

This ended up being a bigger task than most other predicates. Entailment has to reflect symmetry, transitivity, and reflexivity. It also has to copy every statement which refers to or is referred by a node that is the owl:sameAs another.

Then there is consistency checking. Instances of classes can only be the same as other instances, classes can only be the same as classes, and predicates can only be the same as predicates. Objects cannot be the same as objects which they are also different from. These all get expressed as separate queries, and all required example data to be created.

Adding to the complexity of these queries is the fact that they do not fall within a rules engine. This then means that there is no set of pre-inferred statements to connect Properties with each of its subtypes. As a result, the statements which constrain on having an <rdf:type> of <rdf:Property> also constrain on <owl:ObjectProperty>, <owl:DatatypeProperty>, <owl:FunctionalProperty>, <owl:InverseFunctionalProperty>, <owl:TransitiveProperty> and <owl:SymmetricProperty>.

I created the sample code while putting together the documentation. This had its advantages and problems. The advantage was that I was able to proceed with each item as I arrived at it. The disadvantage was that new example data caused modifications to the example data for previously completed items. This meant that I spent a lot of time re-issuing queries, which obviously took some time.

Difference Operator
On Monday night I spoke to SR about the operations fundamental to a difference operator. AN had been insisting that it could be done using a complement operator. However, since a simple inner join against a complement on a constraint does not accomplish this I decided to work out why.

It didn't take long to realise that the complement against the triplestore was not providing what we needed. I discussed some of this with SR, but his questioning demanded specifics. This was probably a good thing, as it forced me to make sure that I had everything right.

The inner join operator appears as: ⋈

This is the "bowtie" operator. I've included it here with the HTML of &x8904;

If the operator does not show up correctly... then sorry. You'll just have to bear with me. :-)

Take A as a set of rows n columns wide. B is a set of rows m columns wide. The columns of A are:
a₁ a₂ a₃ ... a_n
The columns of B are:
b₁ b₂ b₃ ... b_m

An inner join ⋈ is a combination of 3 operations.

If I do an operation of A ⋈ B, then this can be broken down into:
A x B (cartesian product)
σ(A x B) (selection)
π(σ(A x B)) (projection)

So:

A ⋈ B = π(σ(A x B))

The cartesian product ends up with a new set (which for convenience I'll call C). So:
C = A x B

The column names in C are:
c₁ c₂ c₃ ... c_n c_n+1 c_n+2 ... c_n+m

Which corresponds to:
c_i ≡ a_i (1 ≤ i ≤ n)
c_j+n ≡ b_j (1 ≤ j ≤ m)

The σ(...) operation selects those rows where two or more columns are equal. ie. For one column:
σ_Ci=Cj(C)

The projection operation removes duplicate columns (for a selection of σ_Ci=Cj(...) then this removes either c_i or c_j), and also removes columns where the values are fixed to a constant value. So the total number of columns at the end t, is given by:
t < n + m

Since the selection operation is guaranteed to provide at least one duplicate column (i.e. at least one column name is shared between A and B, else this would be a cross-product), this can be further restricted:
t < n + m - 1

The final projection operation is a mathematical convenience, as it does not change or add any information to the result. All it does is remove unnecessary information. As such, it is valid to consider the inner join without the projection operation for algebraic purposes.

This leaves us with a modified description of an inner join ⋈ which is defined as:
A ⋈ B = σ(A x B)

From here on, I'll use this definition of an inner join instead of the original. This is safe as since the missing π(...) operator simply removes redundant data that is not usually required.

Consider the universal sets of n and m columns respectively: U_n U_m.

("Universal" here refers to every possible value, and is an essential part of reasoning with an Open World Assumption)

By definition:
A ⊆ U_n
B ⊆ U_m

The cartesian product of A and B will then be in the space defined by the cartesian product of U_n and U_m.
A x B ⊆ U_n x U_m

If we consider a restricted universe of U'_n and U'_m, where:
A ⊆ U'_n
B ⊆ U'_m

(Since we don't have knowledge of every possible value in a space, we will use "Restricted Universal" to refer to all the values we know about. i.e. the contents of the database)

Then the product of A and B will also fall into this space:
A x B ⊆ U'_n x U'_m

The selection operator does not remove any columns (ie. does not modify the space). Also, it provides a restriction on a set, meaning:
σ(A x B) ⊆ A x B ⊆ U'_n x U'_m

Now since we have defined ⋈ as:
A ⋈ B = σ(A x B)

This means:
A ⋈ B ⊆ U'_n x U'_m

The definition of a complement S with respect to the universal set is:
S ∪ ¬S = U

Choosing an S that is a subset of the restricted universal set:

S ⊆ U'

Then let us define the restricted complement ¬' as being with respect to the restricted universal set:
S ∪ ¬'S = U'

With this definition we can say:
(A ⋈ B) ∪ ¬'(A ⋈ B) ⊆ U'_n x U'_m

It is important to note that ¬'(A ⋈ B) appears in the same space as U'_n x U'_m.

Finally, we can define a projection operator which selects out the A columns from the (A x B) cartesian product. Remember that (A x B) has columns:
c₁ c₂ c₃ ... c_n c_n+1 c_n+2 ... c_n+m

So we define π_n(S) as reducing the space of S from n + m columns down to n columns. This means that if S has n+m columns of:
c₁ c₂ c₃ ... c_n c_n+1 c_n+2 ... c_n+m
And S' has n columns of:
c₁ c₂ c₃ ... c_n

Then:
S' = π_n(S)

This operator loses information from the columns c_n+1 c_n+2 ... c_n+m. The result of this operator is within the same space as the original set A.

When applying this to a set in the space of (U'_n x U'_m) we get the columns which come from U'_n. As defined earlier, A ⊆ U'_n. Applying this operator to the cartesian product of A and B:
π_n(A x B) = A

Since:
σ(A x B) ⊆ A x B
σ(A x B) = A ⋈ B

Then:
A ⋈ B ⊆ A x B
π_n(A ⋈ B) ⊆ π_n(A x B)
π_n(A ⋈ B) ⊆ A

This means there are elements which are in A, but are not in π_n(A ⋈ B). So if we take ALL of the elements which are not in π_n(A ⋈ B) we will find some which are in A, and some which are not. In other words:
π_n(¬'(A ⋈ B)) ∩ A ≠ ∅

Let us refer to π_n(A ⋈ B) with the set D. To remove from A those elements which appear in both D and A we can write:
A ∩ π_n(¬'(A ⋈ B))

If we look at the construction of this, we will see that the result is a subset of A (since A ∩ S ⊆ A), and that all rows which the selection operator which would have matched A to B have been removed.

To confirm this, we can looking at the opposite construction:
A ∩ π_n(A ⋈ B) = π_n(A ⋈ B)
Since:
π_n(A ⋈ B) ⊆ A

Remember that by definition:
¬'(A ⋈ B) ∪ (A ⋈ B) = U'_n x U'_m
¬'(A ⋈ B) ∩ (A ⋈ B) = ∅

Which leads to:
π_n(¬'(A ⋈ B) ∪ (A ⋈ B)) = π_n(U'_n x U'_m)
π_n(¬'(A ⋈ B)) ∪ π_n((A ⋈ B)) = U'_n

(This step relies on the distributivity of π_n(...). This is easily provable, but I haven't done it here).

Similarly:
π_n(¬'(A ⋈ B)) ∩ π_n((A ⋈ B)) = ∅

Since we know that (A ⋈ B) is the inner join, then this defines all the rows where A and B match according to the selection operator. So π_n((A ⋈ B)) defines all those rows from A which the selection operator would match against B. Since:
π_n(¬'(A ⋈ B)) ∩ π_n((A ⋈ B)) = ∅

Then we know that π_n(¬'(A ⋈ B)) contains all the rows from A which did not match the selection operator, plus all the rows in U'_n which are not in A.

This then defines our difference operator:
A - B = A ∩ π_n(¬'(A ⋈ B))

Complement Operator
While the above definition is all quite sound, it assumes that the result of ¬'(...) is readily available. However, this complement is based on one or more cartesian products of the entire statement store against itself. When applied to database of decent size, this result can easily get too large to work with. Instead, a difference can be easily worked out by removing rows from the first set which match rows on the second set. This will be at least as fast and memory efficient as the current inner join.

So why have this pseudo-proof description if the operations described are not going to be implemented? Well I have three reasons. The first is that it provides a sound mathematical basis for the operation, and this helps us to recognise problems from the outset. The second reason is that it provides a full justification of the operation for anyone skeptical about its operation, such as SR. The final reason was to determine just what it is about the complement achieved with the excludes() operator that was not working for us.

Monday, October 04, 2004

Documentation
Today was just about the same as Friday. I spent the time putting together example OWL code, writing entailment queries in iTQL, and then documenting the whole thing. Most of what I got done today revolved around owl:equivalentClass and owl:equivalentProperty.

Strangely, the queries I had originally written for this had the wrong OWL constructs in them. I wasn't paying attention during the conversion, but Volz had used several constructs which did not belong to OWL, but rather to DAML. For instance, Volz described sameClassAs and samePropertyAs, when these are from the DAML namespace. Fortunately, this is a trivial problem, as these declarations have a direct mapping into OWL, translating into owl:equivalentClass and owl:equivalentProperty.

Set Operations
While considering how to perform an inverse on a constraint, ala the complement operator except, I realised what my problem has been all along.

I keep making the mistake of describing an inner join as a set intersection. While analogous, this is not quite right. A constraint defines a set of triples, and if this is joined via a variable to another set of triples, then the result can be up to 6 values wide (not that there is a reason to use more than 5, since 6 columns means an unbounded cartesian product).

In fact, the solution is all 6 columns of the cartesian product with a few transformations applied to it. For a start, the sigma operation (select) will define one or more columns to be equal. These repeated columns are redundant and are consequently dropped, though strictly speaking, they do still exist. Similarly, any columns of fixed values are redundant, so they are also dropped. This dropping of columns is a projection of the 6 column result down to just the columns with the useful information in them, ie. the variables.

So the sigma operator defines the elements to keep from the cartesian product, giving us the result, and the final projection just removes unneeded columns. Note that mathematically these columns do exist, they just contain known data so there is no need to keep them.

The point of this is that the inputs of the join are a pair of sets of triples, but the output is a set of 6-tuples, based on a cartesian product. This is most definitely not an intersection operation.

We had been considering a "difference" operation to be an intersection with a complement of a set, but now I've come to realise that this is not the case. Instead, it would be implemented by the normal selection operation:

  A - B = { x: x_i ∈ A ∧ x_j ∉ B }

Where i and j are the selection columns.

I think that it might also be possible to define a ¬B based on the cartesian product of the entire dataset with itself, but I'm a bit tired to check this properly. The idea is that we need the inverse of B, not in respect to the space that it occurs in, but in the space of the result. Since the result of an inner join is that of the cartesian product, this makes the complement of B also appear in that space.

Friday, October 01, 2004

Documents and More Documents
As expected, today was spent documenting, only it was a little more intense. I provided examples for owl:SymmetricProperty, owl:TransitiveProperty, owl:FunctionalProperty and owl:InverseFunctionalProperty, as well as testing each, and documenting them all.

If nothing else, I got more practice writing RDF-XML, which is something I sorely need. During this process, and with a little help from SR, I discovered some of the answers to the questions I had about namespaces yesterday.

When I was using xmlns in a tag, it was being lost by the fact that the tag was already declared as being in the rdf namespace. So it was doing nothing for me.

I also discovered that using rdf:about to identify nodes picks up the base namespace of the document, xml:base, and simply tacks the name onto the end. For this reason, all the names I use in this way have to be prepended by #. On the other hand, using rdf:ID tells the parser that this is a fragment, so it automatically uses a # to separate it from the namespace.

The rdf:resource attribute seems to interact with namespaces similarly to rdf:about, so it is necessary to either fully qualify its domain, or else prepend with a #.

Still, it can be confusing with so many ways to represent the same thing, and with me only having a half of an idea of how it works. For instance, why would I want to choose:

<shutter-speed rdf:parseType="Resource">

Instead of:

<rdf:Description rdf:about="#shutter-speed">

There's probably some important difference, but for the examples I'm doing it doesn't seem to make a difference.

On the subject of important differences, I recall a conversation explaining something about the differences between rdf:about and rdf:ID, but I haven't been able to find it (not that I've looked too hard yet). From memory, there is something more significant than namespaces. It might have had something to do with referencing things outside of the current domain. This might be why rdf:about is preferred for declarations, as their visibility may be affected.

Sets and Differences
I think I have a better idea of what we need in order to do a difference now, and why NOT (or SELECT) wasn't working for us. I don't have time just now, but I'll try and explain it on Monday night.

In the meantime, AN has been able to send me his 3 level query which he thinks will produce an effective difference query. I'm not convinced, but subqueries can be complicated little beasts, so I won't judge it until I've had time to examine it properly. Either way, I still think we need a proper difference operation. If it is not possible to do a difference with multiple levels of subquery, then we need to implement it just for the sake of completeness. If it is possible, then we still need to implement it for the sake of efficiency, and also because the queries end up so complicated that they are almost impossible to understand.

Working notes