Wednesday, August 18, 2004

Syntax
Today was spent doing very little coding, which I found quite frustrating. TJ won't be happy, but there was nothing I could do about it.

The constraint implementation described yesterday needed to create a new type of Tuples object so that it perform the correct type of transformation when joined against. This didn't seem quite right, and since the appropriate code belongs to others, I took the time to ask their advice. Of course, I ended up with 4 people offering 5 opinions. So there ended up being a lot of discussion before we could reach consensus.

After considering it for some time I realised that cardinality needs 4 things:

  1. A constraint expression.
  2. A list of variables (usually just one) from the constraint expression.
  3. The type of integer comparison to perform ( > < = ).
  4. The scalar integer value to compare against.
The list of variables provides the grouping of item 1 needed for the counts. I couldn't come up with a use-case which would ever need more than one variable in this list, but it's normally best not to place artificial restrictions when they are not needed. Besides, as soon as we decide not to do something like this, we invariably get an email requesting that very feature, normally within 6 months.

Since item 1 is a tuples, and the output is a tuples, then this can really be described as a parameterized mapping from tuples to tuples. Hence it makes sense to consider this as a function (yes, I've been reading more from the text on Predicate Logic). The type of comparison found in item 3 would probably be best described by declaring 3 different functions, which in turn reduces the required parameters.

No one seemed to dispute any of this, but we could not get consensus on the syntax to express any of it. AN liked the idea of using a pure function notation, pointing out that we mostly do that already, particularly with trans() and walk(). AN preferred a function-like syntax, with all the parameters appearing as 3-element statements, separated by "and" keywords. This is exactly what trans() and walk() do. SR came up with a syntax that was much more wordy, making it look like a statement. The first element of this statement was the integer from part 4, the second element was a predicate describing the scalar comparison from part 3 and the third element was like a subquery. Item 2 was the "select" clause of this subquery, and item 1 was the "where" clause.

After spending many hours getting consensus on the semantics of this, I was getting frustrated at the time taken to work out the syntax. I was wondering if it might be possible to implement this function without finalizing the syntax, however I was reluctant to proceed with this as a poor choice in the realization of some of the required items could make the parsing of the syntax into these items much harder than it would need to be.

Finally TJ entered the fray, and asked about using subqueries and count(). To start with this didn't appear to be useful to us, as it only provides a scalar integer as a result, but then TJ showed how it could be left in the results (with its implicit variable name of $k0) and then those lines which didn't match on $k0 could be dumped. After investigating this, it appears to work, and will only require a minimal change.

The required change is the introduction of a "having" clause which can use a constraint with a predicate of occurs, occursLessThan or occursMoreThan. This will be applied as a kind of filter at the last moment.

The advantage of doing things this way is that it involves a minimal change, and should be reasonably fast to implement. The problem with this method is that the result must come with the $k0 variable built in, so it couldn't be used for doing insertions. However, I don't think this kind of query needs that functionality, so it might be OK. Another problem is that the syntax is so incredibly obtuse, that I barely remember it. Tracing through how it was going to work, it took me 5 minutes to see that it would do what we want. Even now I'm not sure I can remember it. For this reason alone I'm concerned about implementing the code in this way, but cardinality is already late, and I really need something done before I go on holidays at the end of this week.

I suppose the best outcome here would be to get it working, but in my absence for someone to discover a problem, and on my return I could do it in the original way... once someone has worked out a syntax!

Inferencing
In the meantime, I've been getting the RDFS inferencing paths working again. Unfortunately I've been using my Mac, which has 2 problems. First, it already has something bound on port 8080, and the http server uses this. Strangely, I couldn't easily figure out what is already using that port, as netstat does not report anything using it. I should just point my browser at it and see if I get a response.

The second problem was when I moved the server over to my desktop machine. It is names "chaos", and I have an entry for it in /etc/hosts. Unfortunately, Java is failing to find a server named "chaos" and so iTQL falls back to 127.0.0.1, which is never going to work, even if the server is on the same machine as the client.

I ended up running both the iTQL client and the server on my desktop machine (with the iTQL GIU coming to be via X), which worked fine. However, running the inferencing rules had the same addressing problems, meaning I would need to move all of the Drools code up there as well. I took that as my cue to leave it until morning.

Tuesday, August 17, 2004

Cardinality
Cardinality implementation is an elusive beast. I've been working on it this evening, as I didn't get as far as I wanted today. After discussing things with AN this morning, we came up with a good plan for implementation, but turning this into code has not been as simple as I'd hoped for.

To start with, AN has refactored a lot of the constraint resolution for Kowari. I don't have a major problem with that per se, but it invalidates a lot of what I knew about the way things were done. Consequently, I spent quite a bit of time finding code that I knew existed, but I didn't always know where to look for it since it was refactored into a new home.

Cardinality constraints have to be done after everything else has been resolved. To force these constraints to be executed last, I've set the size of the constraint's row count to Long.MAX_VALUE. AN suggested using the size of the graph plus one, but there is no graph for this kind of constraint, so I opted for a number that would be obviously larger than any other.

It may have been possible to tell the execution plan about the need for cardinality constraints to always come last, but I didn't like this idea. The code for the execution plan is clean and generic, so the last thing we want is for it to start making special cases for particular types of constraints. Conversely, setting the size of a constraint to a large number like this is really just subverting the whole execution plan anyway, so perhaps some special knowledge wouldn't be so bad after all. I'll leave it "as is" for the moment.

Essentially, cardinality resolution gets done in one of two ways:

  • If the variable of the cardinality constraint is not found in the tuples the constraint is being joined on, then the constraint will be applied to the entire tuples. First, the variable has to be associated with a column (this must be done with another joined constraint, otherwise the variable is completely unbound and will throw an exception). Once associated, that column can be searched for unique values, by joining on a singleton tuples with the variable associated with that column. The rows of the resulting tuples then get joined to the entire set one at a time. If the results are of the correct size, then they get appended to the final result.

  • If the variable is found in the tuples to which this constraint will be joined, then a join on that variable will provide a unique list of values. This list can be used one item at a time, to join against the original tuples, and if the sizes of the resulting tuples are correct then they will be appended to the result.
From what I can tell, I'll need to implement this whenever a ConstraintConjunction is executed. Execution of constraints happens inside ExecutionPlan but the work gets delegated to the TuplesOperations.join() method. This creates a problem, since this method is only aware of tuples, and not of the constraints they belong to.

As a result, I'm thinking I need a new type of Tuples which can represent the need to constrain by cardinality. However, it's late, so I'll sleep on it, and get AN's opinion in the morning (maybe even SR, as he is back now).

Monday, August 16, 2004

Timeliness
Recently I've been tending to write these the day after the events described. Hopefully they won't be suffering too much from the lack of attention. It's partly due to needing to help with Luc in the evenings, and also because of my preparations for this year's Noosa.

Cloning
Friday's problem was tracked down through the extra logging I threw into the exception trace. My first problem was because I hadn't implemented cloning on the ConstraintIs class. As a result, the clone() method from the ConstraintImpl superclass was being called, which converted the constraint into a ConstraintImpl. This meant that the wrong constraint resolution method was being called, which was the basis of the problem.

Adding the clone() method fixed this, but not quite as well as I'd hoped, with a number of failures still occurring in the tests. At this point, I decided that I was running into problems associated with inheriting methods (such as clone) from ConstraintImpl, and so I removed that inheritance, and copied a lot of code from ConstraintImpl to ConstraintIs.

It was while copying the code, that I found the problem. This meant that I could probably go back to inheritance, but AN expressed a preference to split the constraint classes up anyway, so I was happy to leave it. Perhaps an abstract constraint class could be created to have them all inherit from, rather than having them all inherit from ConstraintImpl as I had initially done.

The problem I found was fairly simple. I had implemented clone by cloning all the nodes except the KOWARI_IS node, as this will always be fixed. However, I forgot to anticipate that the constraint could be localized, meaning that the node would be changed to a local node, with only a long integer as an identifier. Once I realised this I changed the code to clone the predicate if it isn't equal to KOWARI_IS, and voila, another couple of tests passed.

Subqueries
The next problem was similar to the cloning issue. SubqueryAnswer was creating a new constraint with its own version of a KOWARI_IS predicate. In one instance the predicate was global, and in the other instance it was local. An existing constructor would work for the former, but the latter needed a new method created which would accept the local node. The "query" package that ConstraintIs lies within does not have access to the database, so this node has to be pre-allocated and passed in, rather than having the ConstraintIs class find the node for itself.

Given that subqueries only needed to allocate new ConstraintIs objects, I create a factory method for these new requirements, and put it directly on the ConstraintIs class.

Blank Nodes
The final two failing tests had me stumped for several hours. They were both in JRDFGraphUnitTest and should not have been impacted by any kind of changes to constraints.

I finally discovered that these tests were expecting that a new blank node allocated for a fresh graph with a predetermined number of statements should have an ID of "_node58". Since I had changed the database to no longer allocate nodes for some of the magic predicates, this number was changed.

I spoke with the culprit, and he confessed to thinking that using hard-coded node IDs was lazy as he wrote the code that did this. I know he'll read this, so hopefully he won't be so lazy next time. :-)

The fix was to put markers into the array that is compared against the blank nodes, and replace those markers with the correct IDs as soon as the blank nodes are allocated. That way any further changes will not affect these tests.

Friday, August 13, 2004

Visitor Pattern
After getting through some email this morning, my first task was to clean up my modifications (ie. clean up code that is calling methods I've yet to write) and make sure it all works. During this period I started wondering about the use of the visitor pattern for resolving constraints, and if this would work for the subclass of a constraint.

Binding the function to call is done at compile time, and my concern was that ConstraintIs might not be passed into the visit() method correctly. ConstraintIs extends the ConstraintImpl class. The problem happens because of an overloaded method (in the ConstraintResolver class) which takes a ConstraintImpl in one implementation, and a ConstraintIs in the other. If code in the resolver class takes a ConstraintImpl object and passes this object into the overloaded method, then the compiler will make a selection of the method to call based on the type of that parameter. The compiler has to choose one method or the other, and it has no idea that I might be passing in a ConstraintIs, so it chooses the method which takes a ConstraintImpl.

The only way that Java can select a method at runtime is when it is calling a method on a class reference. The reference to the class provides the indirection necessary to find the method. For instance, this is how I can call toString() on any class derived from Object. Using overloaded methods on another class, and choosing to pass in differing objects at run time provides no mechanism for Java to work out where to go.

Fortunately, my concerns were unfounded and demonstrated my lack of understanding of the use of the visitor pattern. This pattern is specifically designed to circumvent the very problem I was concerned about. By calling accept on the destination class (either ConstraintImpl or ConstraintIs) then the compiler is able to go through the reference to the correct class. The implementation of the accept method is a trivial call to visit(this) on the originating class, providing the correct type (via the this parameter) for the compiler to statically bind.

AM took me through much of this, and once I "got it" I realised that I hadn't implemented the trivial accept method on ConstraintIs. This meant that it would inherit the accept method from ConstraintImpl which was not what I wanted. It's always nice to find bugs before you run them.

I was still curious about the use of the visitor pattern, as I didn't see why the constraint resolver would want to have a different method to resolve each different type of constraint, and not call a method on a constraint to do the work for it. This would remove the method binding problem as well. It turns out that the forthcoming resolver interface will want to provide different mechanisms for resolution, meaning that we need numerous classes able to do the constraint resolution, on identical constraint objects. Sure, it can be done without the visitor pattern, but then the constraint objects will need to know far too much about the internal workings of the classes which are to do the resolving.

Cloning
After getting everything tidied up and compiling, I had a go at testing the new setup. This didn't work, and it was because of the tucana:is predicate.

Logging during tests is particularly bad, as most logs get swallowed up, with just an exception string being saved in the output. So I added a lot more information (including a stack trace) to the exceptions I was seeing, and this helped me track down the problem to the clone method on ConstraintImpl. Since I had ConstraintIs extending ConstraintImpl it was picking up the inherited clone method, and hence lost all type meaning. Reimplementing clone at the ConstraintIs level changed the error I was seeing, but didn't fix the problem. I'll be working on that on Monday.

Thursday, August 12, 2004

Constraint Resolution
Nothing more happened in terms of design today, but I was pretty happy with the progress nonetheless. I've changed the structure of the code back into a working system, and made a start on the code for cardinality.

By moving the recognition of the kowari:is predicate out into the parser, it has solved several problems, and made the code a lot simpler. These benefits are carrying through to the other constraints, so the design change seems to be a good one.

Using a factory method to create the appropriate type of constraint is the only part of the code which is doing a lot of if()...else statements, and even this is reasonably simple. More importantly, because the predicates are being recognised at parse time instead of execution time, there is no need to have the magic predicates stored in the database on initialization, so they can be recognised while resolving the constraint. With the advent of different database back-ends with the new Resolver SPI, it won't always be possible to add magic predicates to a database, so this change turned out to be quite important anyway.

There are other advantages as well. Checking of the constraint parameters used to be done on the server while the constraint was being evaluated, but now this is done by the constructors of the constraints. This catches problems earlier, and simplifies the execution code on the server. Finally, the fact that AN has already changed the ConstraintResolver to using the visitor pattern has meant that new constraint types can be introduced trivially, simply by adding a visit() method with a parameter of the constraint type required.

Moving over to this arrangement removed 29 lines of kowari:is code from a visit() method which was handling all types of constraints, and added a one line visit() method which accepts a ConstraintIs class. It's simpler, easier to read, does not involve myriad if() statements, and is far more maintainable.

The new ConstraintIs class was trivial as well. It just extends the standard ConstraintImpl class, with a few minor changes. The constructor does not need the predicate, as this is defined already by the nature of the class. It also does checks on the types of the data. At this point we only permit kowari:is to be used with a variable subject and a fixed-value object, so the constructor can check for these as well. Finally, a couple of convenience accessor methods return the value and the variable without the need to cast.

Chang and Lee
At Bob's suggestion, I've been reading Chang and Lee's Symbolic Logic and Mechanical Theorem Proving. I'm only up to chapter 4, but it has proven to be an easy read so far.

Predicate logic is so applicable to our querying that I'm a little ashamed at not having read it already. Admittedly, so far the book has been about verifying the truth of a formula, rather than finding all "interpretations" (RDF statements) which fit it, but it is still talking about exactly the kind of things our queries need to do. SR was all over this stuff when the query layer was designed, but I realise now that I should have known it as well. Fortunately I was working on the datastore layer at that time, but I should still have learned about formal predicate logic when I first started working on optimizing queries. Oh well, better late than never.

Wednesday, August 11, 2004

Constraint Design
Started the rearrangements for constraints, but I've decided not to go "all the way" with them. This is basically because the poor design of constraints is too endemic to fix in only a few days.

The problem is clearly evident in ConstraintImpl. This class both extends the abstract class AbstractConstraintExpression and implements the Constraint interface. AbstractConstraintExpression implements (partially) the ConstraintExpression interface. Constraint extends the ConstraintExpression interface.

So the ConstraintImpl class is getting declarations of methods from one parent (Constraint), and implementations of those methods from another parent (AbstractConstraintExpression).

Constraints and ConstraintExpressions were once distinct, with Constraint being the class that ConstraintImpl now holds the position of. Fortunately AN has already changed that, but it's a shame that it didn't end up as a parent of AbstractConstraintExpression, as this would solve a few problems by creating a single inheritance path. It's a little problematic to do this, as the methods declared by Constraint and ConstraintExpression are unrelated, and some classes implement ConstraintExpression without implementing Constraint.

For the moment I've create a new ConstraintFactory class. This can be used by ConstraintExpressionBuilder instead of directly creating a new ConstraintImpl object. The point of this modification is to automatically create a different kind of constraint object when a magic predicate is used. By using different constraint object types, the constraint code can avoid the use of numerous "if" statements.

Magic Predicates
Magic predicates are described in AbstractDatabase. This class declares methods for getting the node numbers for each of the predicates, such as "KOWARI_IS". However, these do not describe the URI used for any of the predicates. Currently this is done in XADatabaseImpl, and would be duplicated in any "implementing" database.

XADatabaseImpl actually creates nodes in the datastore for each magic predicate. The generated node numbers appear to have no use except to recognize when a magic predicate is being used. This generation is done during the init() method, and it defines the string for the predicate inline. An example is:

 KOWARI_IS = initSession.preallocate(new URIReferenceImpl(new URI(Tucana.NAMESPACE + "is")));
This has the undesirable effect of tying the URI of the predicate directly to this particular implementation of the code. The only reason a query can be made which matches this predicate is because of user documentation. There is no connection at all between iTQL and these predicates. It's just asking for one of the predicates to start failing due to a typo when a change is made. To make things worse, these predicate URIs are duplicated in SubqueryAnswer.

As a first priority, I've started moving these URI definitions to a new class called SpecialPredicates. From here they can be accessed by any class which wants to refer to the URI of the predicate, making them consistent across the system. This can also be used by the iTQL parser to recognize when one of these predicates is used, so that an appropriate kind of constraint object can be created.

Once these new constraint objects are created, there should be no need for node numbers to be created. The only reason I could see to keep them might be to prevent a magic predicate from being explicitly added to the database. I should ask AN about this and see what he thinks.

Xcode
Well, I'm getting used to the new interface. It has its strong points, but also a lot of weaknesses.

As a Java editor, it seems to be somewhat lacking. For a start, the only awareness of code in the editor is a selection widget which lets you navigate to any method in the current class. While this is nice, there is no list of member variables, nor is there any code checking for declared members, methods, or any other context relevant information. More annoyingly, comments are not detected during paste operations, meaning that when I paste a header into a new file, it thinks it is seeing tags which need special indenting treatment.

Another feature I was looking for was a hierachical class viewer. Xcode has one, but interfaces are kept separate from classes. This made it useless for tracking the genealogy of classes like ConstraintImpl, with its combination of interface and abstract class parents. I found it was faster and much easier to just look at the source code.

One useful feature I've found is CVS support, though it took me a little while to work out how to turn it on. It seems pretty good, but it appears to be a relatively simple wrapper around the "cvs" command.

I'm still discovering features, and I'm still struggling against my natural tendency to use vi commands, so I'll be giving it a bit longer yet. At this stage it looks like Xcode is really better for C and C++ development, and that I might want to consider something else for Java. KA has been using Eclipse, and I might give that a go shortly.

Luc
I wrote this a day late as Brisbane is having a public holiday for the Brisbane Exhibition, and I decided to have a restful night. I was happy to be home today, because I discovered Luc's first tooth has poked through. So all that lack of sleep has been worth it... I guess. Ask me again when he's two. :-)

Monday, August 09, 2004

Delays
I've been neglecting the blog over the last few days. Shouldn't bother anyone but me (it'll bother me when I come back later to figure out what I was doing on these days). The main reason for the lack of effort was my lack of progress to report on. The other reason was because Luc is on the verge of teething, and I'm suffering from a proportionate lack of sleep. :-)

Friday
Friday was spent trawling through code, and making few modifications. I did little bits and peices, mostly fixing obvious bugs and documentation, but nothing substantial. The majority of the time was spent reading the ItqlInterpreter class and working out just what it is doing with Constraint objects. All too often I've found that I've started coding before I was really ready, and I'm trying not to this time.

Monday
Monday was spent reviewing the same code. Constraints are badly in need of refactoring. Too many implementations of classes, which leads to some horrible workarounds for subclasses. It needs more interfaces, and the code which uses them needs to refer to these objects by interface references.

I also spent some time setting up Xcode. I'd been looking forward to using Xcode, but hadn't taken the time to set it up or work it out.

Xcode is fine for working with Java, but it doesn't really have full Ant integration. I followed a couple of pages which talk about using Ant, but really all they do is create a blank project, and then set the build command to call Ant instead. This didn't work all that well for Kowari, as it uses its own script called build.sh to launch Ant with a series of required parameters.

I resolved the problems by calling build.sh directly, and by modifying this script to suit my local system. I wouldn't normally want to modify a file in this way, but build.sh is very short, and mostly just manipulates environment variables. The first change was to tell it about my local JAVA_HOME when it's not already set. I do have it set in my bash environment, but it's not global where Quartz applications can pick it up. I've considered doing this with ANT_HOME as well, but since every other system is using the ant-1.6.1 jar in the lib directory of Kowari, I thought it might be safer to just go with that. Mac OSX comes with ant-1.5.3 as part of the J2EE packages, but I've since upgraded it to ant-1.6.2, which should be perfectly compatible with the Kowari version of Ant.

The other change was based on how the building script is called by Xcode. To do an ordinary build, the script is called with no arguments, and to do a "clean" then the script is called with a parameter of "clean". Since the ant script always needs a parameter I needed to test if none was passed to build.sh and if so then pass a parameter of "dist". While I could just change the default in the Ant build.xml file, the build.sh script is a much better place to change it. That's because this script almost never changes, while the build.xml file needs to be updated occasionally by myself and others. Keeping build.xml file untouched allows it to synchronize with CVS much better.

I then learned all about adding files into the project. I started by adding the files I needed manually, but these are so dispersed through the filesystem that I thought there must be a better way. I then found that by adding the directories out of "src" then it recursively picks up everything under it. Of course, if there'd been real Ant integration then it could have found all the files I needed, but this would be a huge amount of work on Apple's part.

I found that I couldn't go adn include to "org" directories which are two layers down from src, as these are not merged. As a result I ended up with several "org" directories with no indication of their contents. Using the directories like "store" is a little annoying, as there are sometimes files in the same package under different directories, and also because I have to go down through directories such as "store/java" to find "org".

At least it seems to know about CVS, as the CVS directories were not imported. Since that is the case I should look to see if I can do updates and commits from teh Xcode GUI.

I like the editing features I now have, particularly the persistent, named bookmarks. Finding files is pretty good too, once you get the hang of it. I'm sure that I'll learn more about the configuration later so I can make editing Kowari easier still. It will be a while before I'm proficient with Xcode though, as I'm still trying to unteach my fingers about vi.

Breakpoints, errors, and the like are unlikely to come naturally, as I think that Xcode is too far removed from the build system at the moment, but as I get more used to it I might see what kind of class path options I have, and whether or not I can tweak it into something more powerful than a fancy editor.

I've already downloaded Xcode 1.5, but I have not played with it yet. It seems to have better Ant support, but from what I understand this is so new projects can be built for Ant. I believe there is little support for importing existing Ant projects.

Thursday, August 05, 2004

Inferencing for OWL
This morning I had a discussion with AN about the next thing to move on to. Of course, it makes sense to fully integrate the inferencing code with the rest of the system, but AN pointed out that this should be done after the move to resolvers. I agree, as the selection of the source and destination models for the inferences will depend on how we talk to the models.

In the meantime I'm moving onto an iTQL construct that will allow us to select cardinality, which is an OWL requirement.

AN's document on Kowari/TKS inferencing describes cardinality as a "magic" predicate in the "tucana" namespace. This is similar to the <tucana:is> predicate used to constrain on equality. I can work with this, and so I went to find wherever tucana:is gets mentioned.

It's a little messy, with XADatabaseImpl and Subquery both defining the same predicate. I decided to ignore what subqueries do, as I needed the more general case. The string for this predicate was being put into a constant called KOWARI_IS, and I tracked down all instances of its use. This then led me to the getKowariIs() method, on the Database interface.

It was a simple matter to add the new cardinality predicates wherever I saw getKowariIs(). Then I needed to apply them when I found them.

I found that the application of these predicates was happening in ConstraintResolver. This class had been introduced to enable the use of the Visitor Pattern on constraint classes. This removed the need for lots of switch statements, as had been the case when everything used to be done in LocalQuery. Unfortunately, the magic predicates were still being done with if statements interrupting the normal flow of data. Any attempts to do the same with the cardinality predicates was going to get very messy.

Discussing the code with AN, we worked out that the current implementation of magic predicates is messy and unmaintainable in the query engine. This is because existing constraints have to behave differently if they see one of these predicates. By moving the code for handling magic predicates up into the language layer, then a different constraint object can be constructed altogether. This is more object oriented, and will definitely help with maintainability.

Unfortunately it won't be as quick to implement this as it would have been to put in a dodgy hack to get cardinality. It should be cleaner and easier though.

Wednesday, August 04, 2004

Proof Reading
Once again it's way too late to spend any further time on this to tidy it up. I still haven't proof read yesterday's blog! (addendum: I have now).

In the meantime, I hope that it's not too difficult to read.

N3 Once Again
The class path was resolved quickly this morning, allowing me to finish the N3 importing.

First of all, blank nodes were parsed correctly, but an exception occurred when creating a URIReference from them. This was because subject nodes were not being checked properly to see if they were blank. After that the load worked fine.

So then I started a series of tests. I took the Jena camera ontology and loaded it into a model called "camera". I wanted to look at blank nodes, and this RDF file contains several. Then I saved it out as N3 to a file named "camera.n3", and was able to manually confirm that everything looked fine.

Next, I created a new model and named it "c2". I used the new N3 parser to load camera.n3 into the c2 model. Then I exported it to disk as "c2.n3" and compared it to camera.n3. Unfortunately, they were identical.

At first glance it would be a good thing for these files to be the same, as they represent the same data. However, when loading blank nodes into a system a whole set of new nodes must be allocated for each blank node. The fact that the nodes were being reused was a bug.

I mentioned the problem to AN, and this helped me to think about what the problem was. I realised that the caches which are used to map blank nodes must not be getting cleared out. A quick check confirmed this, and so I added calls to the clear methods on these caches within the startDocument method of the parser's event handler. This fixed it.

After N3 importing and exporting was working I started downloading the entailment tests from the RDF test page. It was a little annoying downloading each in turn, and this prompted me to check further down the page to find all the tests in a zip file.

Inferencing Goals
While swimming tonight I gave some thought to Bob's suggestion that I write a couple of pages on a real world example of what I'd like to achieve. The trite answer is to base it on the sorts of things we would ultimately like to do at work. The general idea here is to be able to take disparate databases, and to infer new and interesting information.

Maybe we'd like to say that a person "A" who is considered a threat, is a known associate of person "B", who traveled on the same flight as person "C" who works for a weapons manufacturer. Frankly, for this kind of inference I'm really reaching, particularly as I don't even know exactly what security experts would really want to see. Of course, we'd like to be able to provide the ability to do inferences like this, but if I try to come up with inferences of this type, then I run a much greater risk of choosing something which isn't really useful.

To start with, I'd need to show that a particular ontology can be mapped into RDF. If not, then I need to pick a different system. I don't have to actually map such an ontology into RDF, but I need to show that in principle it can be. Once I am confident that the information I know is in RDF, then I need to determine if it would be possible to infer information that I didn't originally know.

So I started thinking of the kind of inferences that I've personally made in the past. My goal then would be to create an inferencing system which can infer the same things I was able to. The first thing to come to mind was the fact that the light of a supernovae is preceded by a burst of neutrinos.

I was once describing the process of a supernova to a friend (in fact, it was AM, some years before he came to work at Tucana). I explained how a collapsing star has so much gravity that its inward rush is not stopped by electron degeneracy, but just continues on through it. Now if it continues to collapse through electron degeneracy, then the electrons simply can't continue to exist in their current form, due to the Pauli exclusion principle (electron degeneracy is the boundary of the Pauli exclusion principle where every possible electron state for the volume is being occupied). The way the star can continue collapsing is to merge its electrons with their corresponding protons (remember, stars are mostly hydrogen, with is just an electron-proton pair). Merging like this creates a neutron, and a neutrino. With all these neutrons being created, eventually neutron degeneracy is reached, and this is normally strong enough to make the whole collapsing star "bounce" back out. This is the explosion that we call a supernova. (If it kept going through neutron degeneracy then the star would become a black hole).

This process had been described to me in an astrophysics subject without the mention of neutrino generation. But as I mentioned to AM the electron-proton reaction part of the process, I inferred the part about the emitted neutrino, as my knowledge of that reaction told me that neutrinos had to be generated. At this point I realised that during the collapse preceding a supernova, an enourmous number of neutrinos must be created. The collapse takes place over several days (remember that stars are huge, so the speeds involved are still staggering), so neutrinos will be produced for the whole of this period. This reminded me of a media story I'd once heard about how the world's neutrino detectors would always show higher activity a few days before a supernova. Now I know why.

So the knowledge of the process of a supernova, the types of particles which make up a star, the rules governing the reactions of these particles, and the types of reactions which would occur during a supernova, all let me infer that a neutrino burst would precede a supernova by a few days. To a better physicist this may have been obvious, but for me it was a significant inference. If I can make a computer able to infer something like this, then I believe I'll have accomplished something significant. My immediate task is to convince myself that this is possible with RDF, with an inferencing system that is either OWL or an extension thereof.

Can I build the knowledge of the supernova process into RDF? I believe so.

I'd be describing a class called "Star" which is a subclass of "Stellar Body" (which is then a subclass of "Body"). It would have several properties, including size, which is the determining property for working out if it will go supernova (too small and it settles to a while dwarf, too large and it becomes a black hole). So the property of having a "supernova event" at some point in a star's existence can be inferred reasonably easily. The event itself has a sequence of sub events. These involve the collapse to electron degeneracy, the collapse to neutron degeneracy, and the bounce back. The collapse to neutron degeneracy has an associated electron-proton particle reaction. So I can trace this route of relationships down to stars of a certain size having a particle reaction as an event in their existence.

This then leads to an ontology for particle reactions. The electron-proton reaction has a neutrino produced. Linking the two ontologies shows that a star of a certain size will emit a neutrino at some point in its existence. In fact, going back to the size of the star, the number of electron-proton pairs can be inferred, and this can then lead to a number of neutrinos which can be emitted. If an ontology says that any emission over about 10100 (I pulled that number out of the air here) can be called a "burst", then it will be possible to infer that the star with the right properties will emit a burst of neutrinos.

I need to break it down more than that, and then I need to work out what kind of operations would need to be done to allow this sort of inference, but intuitively I feel confident that it can be done.

Indexing Numbers
While building and testing Kowari today I started giving serious thought to the concept of our indexes as vectors in a space.

When Kowari started we had 3 indices. Our indices were actually orderings of statements, based on the first column, followed by the second, and finally the third. By having 3 indices we were able to define a constraint as 1, 2 or 3 nodes, and there would always be an index which would provide the resulting statements as a contiguous group. The start and end of this group can be found with a binary search, meaning that the data can always be found in O(log n) time. I've discussed this here in the past.

Moving on to 4 nodes (subject, predicate, object and meta) meant that we needed to use a minimum of 6 indexes to be able to quickly find a contiguous group of statements. But where did that number 6 come from? How many indexes do we need for a large number of columns? Does this represent some larger structure that we might be able to take advantage of. I've come up with some of the answers to these questions.

For a given system of N nodes in a single statement (N=4 for Kowari) then there will be a need to search when given a constraint of 1 node, 2 nodes, 3 nodes, and all the way up to N nodes (searching when you have a constraint of N nodes is the equivalent of doing an "exists" operation on a statement). It turns out that a search done with S nodes in the constraint will define a minimum number of indexes, which I'll call minS. Finding the maximum value of minS as S varies from 1 to N gives the minimum number of indexes required to be able to completely search the statement space. The indexes chosen to meet this figure will need to cover all the cases of other values of S.

Note that I have not proven that a set of indexes which meets the maximum minS will always be able to meet the index requirements for all other values of S. However, the selection of indexes to search on S constraints must always take into account the other values for S. It appears that this is always possible, and after gaining the experience of choosing a few sets of indexes it seems that this would indeed be provable, particularly when S is less than the S for the maximum minS.

The value for minS for statements of N nodes can be determined with some simple combinatorial algebra, and it comes down to:
  minS = N!/(N-S)!S!
(where is MathML when you need it?)

These maximum values are shown here for the first few values of N. The corresponding value for S is in brackets:

N

2 2 (1)
3 3 (2)
4 6 (2)
5 10 (3)
6 20 (3)
7 35 (4)
8 70 (4)
9 126 (5)
10 252 (5)
11 462 (6)
12 924 (6)

Hyperspace
Yesterday I discussed how a set of statements on N nodes forms an N-hypergraph.

Working with this supposition, one could assume that an index (of the type described already) forms a vector within an N-hyperspace. Perhaps category theory will show that this mapping into topology will let us determine something about our indices that we have not already seen.

Suppose that each index forms a unit vector, which is normal to a hyperplane. In 3-D (which is were normal RDF stands) these planes then form a cube. The three indices of:
  subject-predicate-object

  predicate-object-subject
  object-subject-predicate
form the normals of 3 sides of this cube. The opposite set of useful indices:
  object-predicate-subject

  predicate-subject-object
  subject-object-predicate
form the normals of the other 3 sides to the cube.

The dimensions here are the 3 labels (subject,predicate,object), and the index's vector is along the dimension of the first label. The curl of that vector corresponds to the remaining dimensions, and so subject-predicate-object is along the same line as subject-object-predicate, but in the negative direction. So the curl of a vector describes a path through the nodes of a statement. We can also see that the first three vectors are in the opposite direction to the second three vectors, and are functionally equivalent. This is why either set of indices can be used.

This is all fine for 3 dimensions, but does it hold for 4 dimensions and above? The answer seems to be yes.

In 3 dimensions we have 3 faces of a cube which define all our required indices. In 4 dimensions we have a hypercube, which has 24 faces. Since each face has 3 "opposite" faces, there are 4 functionally equivalent vectors normal to those faces. 24/4 = 6, which is the number of indices needed for statements of 4 nodes.

For 5 dimensions we no longer want to consider the number of faces on the hypercube. Going back to the maximum minS table, note that the S value for N=3 and N=4 was 2, but for N=5 then S=3. Remember that S corresponds to the number of nodes to constrain on, which gets mapped to the number of dimensions we are searching on. By moving up from 2 dimensions to 3 dimensions we are instead looking for the number of opposing 3D volumes defined by the hypercube, rather than the opposing 2D planes. In 5 dimensions the hypercube has 40 volumes. Dividing this by the the number of opposing volumes (4) gives the 10 indexes required.

This pattern continues. For any hypercube of dimension N, take the S value for N from the table above, and use that to select the number of dimensions of the "sides" of the hypercube you are interested in, and find out how many of those "sides" the hypercube has. The required number of indexes shown above will divide evenly into that number, by a factor of a power of 2.

The number of sides of various dimensions can be easily determined by following the instructions here. The first few are shown here. The first column defines the vertices, the second is the edges, the third is the sides, the fourth is the volumes, the fifth is the hypervolumes, and so on:
N

0: 1
1: 2 1
2: 4 4 1
3: 8 12 6 1
4: 16 32 24 8 1
5: 32 80 80 40 10 1
6: 64 192 240 160 60 12 1
7: 128 448 672 560 280 84 14 1
8: 256 1024 1792 1792 1120 448 112 16 1


Programs
In case you've made it this far, yes the 2 tables above were generated with code that I wrote. Both a pretty trivial, but I might post them if I get some time.

In the meantime I have to get to bed.

Tuesday, August 03, 2004

Caveat Emptor I normally try and take the time to proof read what I write here. But this is a little lengthy, and it's now 12:45am. You'll just have to take the spelling and grammatical errors as they come until I get time to come back and fix them up. Jars The first order of the day was to establish my priorities again. While we know that better externalization will increase RMI speeds for AnswerPageImpl, TJ doesn't want to get bogged down in it for the moment. So I'm back onto the inferencing code. I'll probably try and do the externalization in my own time soon. In the meantime I ran the full test suite for the current changes, and when they passed I checked everything in. That reminded me that I have a checkout that continues to fail the Lucene tests. I guess I should blow that checkout away. Because RDFS tests are mostly defined with N3 files, I went back to trying to load N3 again. The first step was to rebuild everything with the latest checked out code. Well it turns out that there have been build changes again and now TKS isn't building with the latest Kowari core. The build is complaining that it can't find a method that appears to be there. I'm not going to get bogged down in a problem like this again when the code belongs to someone else, so I decided to do the work directly on Kowari. Since this is all for inferencing, I thought I'd get the inferencing code working with the current checkout. This is because I need it working when I get off the N3 task, and because I thought I could try saving the inferred data in N3. Only this ended up being easier said than done. The latest Kowari updates have changed version numbers on the jar files, and this ruined all the class paths. Worse, the contents of the jars appear to be different. When I extracted all the inner jars from itql-1.0.5.jar it had a shorter list than the contents of itql-1.0.4.jar, and I can't get the inferencing code to run anymore. Just like the TKS code, I had no desire to go chasing down jars, so I decided to leave it until I need it running. N3 Again I finally ran a Kowari instance, loaded some data into it, saved it to N3, and tried to reload it. The first thing that happened was an exception which complained of bad tokens in the Prolog of a file. A check confirmed that the file was being loaded with the RDF loader, which was definitely the wrong thing. It turned out that the RDFLoader.canParse method is erroneously returning true on N3 files. I should check out what this method does for its test. Since my test for parsability of N3 is really only interested in the filename extension for the moment, I figured that it can't hurt to test for N3 before RDF/XML, and if it matches then attempt to use the N3 loader. I know it's a hack, and just writing about it now makes it seem worse. However, it works for the moment, and the correct loader is being run. Unfortunately the Jena N3 parser wouldn't run initially because of missing Antlr classes. This seemed strange, as I'd explicitly added this jar into the path in the build.xml. A little checking showed that I'd only done it for TKS, so I ported it back over to Kowari. However, while I found many of the appropriate build targets, I had to ask AN before I could find the last one, which has since been moved out into an external XML file. After a fresh rebuild I discovered that the Antlr jar was still missing. I'm now looking directly in the main jars, at the list of all the sub jars. The build code is definitely including Antlr, but the resulting jar files do not have it included. I'm stumped. I'm going to have to start running Ant with full verbose output, and maybe even run the jar program by hand. This sort of thing gets really annoying when you have coding you want to get back to. Yes, I'm overtired today, so I'm allowed to be irritable. Masters I had another meeting with Bob today. As usual, he had some great insights for me, and I have a couple of things I want to get accomplished in the next 2 weeks. For a start, I need to sit down and come up with a concrete example of what I want to do, and why. "Scalable OWL Inferencing" doesn't cut it. I need to be able to say, "I have this data, and I want to be able to infer this other data, and OWL is what can provide it for me". Or maybe "OWL can't do it for me, but I need it in order to build the inferencing system that can do it for me". Does anyone have a suggestion of some useful, real world information that OWL can either derive or help me to derive? If not, then what's OWL good for? I have some ideas of my own, but I'd love some feedback. The other thing that Bob pointed out is that I really need to look at a couple of texts, rather than just reading papers. The texts provide the foundation of knowledge, while the papers really only seek to expand on it. The first one I'm looking for is on First Order Predicate Calculus, by Chang and Lee. Apparently the field hasn't changed all that much since this book was first written. I've yet to confirm, but I think the book is "Symbolic Logic and Mechanical Theorem Proving".
Indexes and Hypergraphs While thinking about role of directed Hypergraphs in modelling RDF, it occurred to me that I hadn't considered the "direction" enough. After all, it's a directed 3-hypergraph. The direction forms a loop around the triangle of subject, predicate and object, but I hadn't really considered a starting or ending point. That's when I realised that Kowari's indexes reflect this already, as they are quite symmetric, with no emphasis on any point defining a beginning or end:

  subject predicate object
  predicate object subject
  object subject predicate
I also noticed that if this loop were in the opposite direction then it would form the other set of 3 indexes which can be used for these statements:
  object predicate subject
  predicate subject object
  subject object predicate
Now each statement is a plane intersecting 3 nodes in the graph (in a 3 dimensional space). Each plane has 2 normals, which are perpendicular. The direction of the loop that the statements form would correspond to the curl of those normals, with one normal defining the subject-predicate-object direction, and the other normal defining the object-subject-predicate direction. I thought this was interesting, and so I speculated on whether it can be extended to higher dimensions. This is relevant, as Kowari moved from a 3-hypergraph to a 4-hypergraph some time ago. I had speculated that each normal of a 3 dimensional plane was corresponding to a direction around the nodes, and this in turn corresponds to a set of indices that can be used to do a search. I've been wondering how many "normals" there are to a 4-dimensional plane, and if they would each correspond to a set of indices that can be used to search the space. To do that I'd need to know how many normals there are to a 4 dimensional plane (a question I haven't checked the answer to yet) and how many sets of indices there are which can search "quad" statements instead of triples. We had found about a dozen 3 years ago in an ad hoc approach, but I needed to search the whole set of possibilities exhaustively so I knew I had them all correct. Indexes Calculations Thanks to AM for helping me on the following. There are 24 possible indices on the 4 elements of a quad (that's 4!. Note that I'm saying "4 factorial" and not just adding emphasis). It doesn't take long to show that to be able to match any individual node will take a minimum of 4 indices, and a pair of nodes will need a minimum of 6 indices (these 6 subsume the first 4). Each of the indices must be selected from one of 6 groups of 4 indices. To allow searching on 3 nodes restricts the choice from these groups, as certain combinations are mutually exclusive. Searching on 4 nodes need not be considered, as any index will do this for you. Searching for 2 nodes in quad statements proves that 6 indices are required (no I'm not going to put a proof here - it's after midnight). So we know we need at least 6 indices. It's actually pretty easy to select a set (and we did), but for this exercise I'm interested in finding out how many sets are valid. If we consider all possible combinations of 6 indices from the 24 possible we get (24 * 23 * 22 * 21 * 20 * 19), or 24!/18! = 96909120. However, this includes a lot of index combinations which won't let you search for certain node combinations. It took a little while, but we eventually arrived at the final figure. The number of sets of indexes which can be used to do any kind of search on 4-hypergraph is 3520. This was a LOT larger than I'd expected, but we went over it several times, and I'm now confident that it's correct. At this point I decided that each set does NOT correspond to a normal to the hyperplane. I still think that some sets will correlate with 4 dimensional loops through the points, and perhaps there is a normal to describe this loop, but I no longer think that every possible set will correspond to a loop. Many of the index sets are almost identical, with only two of the indices swapping the order of their least significant nodes. They would seem to be parallel in their most significant dimensions, and hence demonstrate a lot of redundancy. AM wants to see if we can extend this calculation to n nodes. After what we had to consider to find this number for just 4 nodes (a fixed number of nodes for a start) then that would be a major undertaking. In the meantime I'm thinking I should write up what we did to find this value. I'll try and make a start on that on Saturday.

Monday, August 02, 2004

Weekends
I thought that I might take some time over the weekend to do some more work on either the N3 loader or moving Answer pages over RMI. Between family, DVDs and fitness it never happened. Oh well, I guess we're supposed to clear our minds on the weekend. At least I have a new first aid certificate from St John's to show for it.

Paging All Answers
I went back to the externalization of AnswerPageImpl today. I planned to spend the whole day making all of the components of an answer externalizable. That's externalizable, as opposed to serializable, in order to avoid the metadata overhead. This is possible because there are no real parent classes to consider for any of the components of an answer.

In many cases it's possible to represent a component of an answer as a string and a datatype. Datatypes include URIReference, BlankNode and Literal (although Literals may require their own datatype). Representing components in this way has very little overhead, and should reduce the significant time we seem to be spending on the default serialization employed by RMI. At this stage I'm thinking of using the default serialization of a java.lang.String as there seems to be a lot of work in the JVM to handle this class, and I'm assuming that RMI is probably pretty good with strings as well. Besides that, if I tried to do it myself with a char array and a length, then I'd probably make some erroneous assumption about Unicode for non-latin alphabets.

In the first instance I pulled apart the AnswerPageImpl at a higher level. Other than a few integers and booleans, the entire dataset is a java.util.ArrayList of Object[], with each array being of fixed length. This was easy enough to encode as a pair of integer dimensions, followed by all the elements as an unstructured list. This didn't cover the expense of serializing each of those elements, but it did take two seconds off a 73 second transfer. Given that this only addresses the small amount of data in the wrapping objects, and does not consider the 3000 elements in each page of the answer, then this modest improvement shows that this is certainly on the right track.

However, testing this small change ended up being more difficult than expected. Because the tests involved a transfer between two computers, I've been using TJ's computer for a lot of the work. This is partly because TJ has been with me quite a bit during Friday and today to help out and see how I've been progressing. I checked for differences in the source code to CVS this morning, but I didn't realise that TJ decided to update the source code shortly afterwards. In the process it pulled down a modification that AN made to an RMI class, which resulted in a need for a new serialization ID. The consequence was a class that wasn't being rebuilt, but which needed to see the new ID, and that led to lots of RMI failures.

Since I had been making changes to the serialization of objects which are sent over RMI, I naturally assumed that the bugs were my own. I spent an inordinate period of time adding logging to my code, and seeing it all fail without a single log message. Finally TJ asked how I was going, and when I told him of my consternation he explained that he'd updated the source. After removing all compiled classes that were dependent on AN's new modifications, a rebuild got it all working again.

Well, it worked right up to the NullPointerException that I caused! It was that old gotcha that I keep forgetting... readExternal (and readObject) does not run the default constructor (yes, I know that's a C++ism). Hence, the ArrayList was not instantiated yet. Easy to find, easy to fix.

Once it was going it was easy to see that we'd gained a little in speed, and that I was on the right track. It is frustrating when these things take so much longer than they should though.

Compression Revisited
While restructuring the externalization code, the readExternal and writeExternal methods started getting unwieldy, so I started pulling sections out and into their own methods. While showing some of these to TJ, and describing parts of the compression/decompression code that he hadn't seen, I realised that I'd missed an obvious optimization.

As I mentioned in my last post, compression was slightly faster when it was performed on an entire data buffer at once, rather than streaming on an ObjectStream. On reflection, this made sense. Correspondingly, it made sense to decompress the entire buffer at once. This needed a length prepended to the byte stream, but otherwise didn't need too many changes.

The original code to get an ObjectStream was:

byte[] byteArray = (byte[])in.readObject();

InputStream byteStream = new ByteArrayInputStream(byteArray);
ObjectInputStream data = new ObjectInputStream(new InflaterInputStream(byteStream));
My first attempt to decompress the data in one hit appeared like this:
int uncompressedSize = in.readInt();

byte[] byteArray = (byte[])in.readObject();
InputStream byteStream = new ByteArrayInputStream(byteArray);
byteArray = new byte[uncompressedSize];
(new InflaterInputStream(byteStream)).read(byteArray, 0, byteArray.length);
byteStream = new ByteArrayInputStream(byteArray);
ObjectInputStream data = new ObjectInputStream(byteStream);
This code threw an exception which described an unexpected termination of the data stream. Logging what was going on showed that the read method was only returning 611 bytes, when I expected 55620. Since I knew that the buffer size of the InflaterInputStream is 512 (I'm assuming this number is in bytes, but there's nothing to confirm this) it appeared that the read method is restricted to the current buffer page. So it was easy enough to overcome by looping on the read method until all the data was read.

The result of this was to knock 14 seconds off a compressed transfer that had taken 88 seconds with the previous implementation of compression. That brought it to within a few seconds of an uncompressed transfer.

Buffer Sizing
Since it had become apparent that InflaterInputStream.read() needed to be called numerous times because of the buffer size in the InflaterInputStream, it seemed reasonable to revisit the issue of buffer size.

I initially thought that using the same buffer size on both ends would be appropriate, but it quickly became apparent that this was not the case. Increasing the compression buffer size had a very small effect, but only when the page size was increased significantly, to about 8kB. I also tried up to 16kB, but this was just a little slower, so I stayed at 8. The result of this was about a half second improvement, which was almost lost in the noise.

The decompression buffer size had a much more significant effect. Increasing from 512 bytes to 1024 took 2 seconds off straight away. Going up to 8kB or higher started to slow it down again. In the end I settled on 4kB. This may not be exactly the perfect size, but it provided the greatest improvement for the tests which I'd tried. Different shapes of data may result in slightly different buffer sizes being the most appropriate, so I saw little point in doing a binary search to look for the "perfect" buffer size.

The final result is a compressed transfer which takes exactly the same length of time on our network as an uncompressed transfer. On slower or congested networks this should offer a good alternative. Hopefully I'll get even better speed improvements tomorrow with properly externalized answer elements.