Monday, September 27, 2004

Appending Tuples
I started the day by looking at Friday's problems with the Type pseudo-model. Rather than using a memory profiler, it occurred to me that I might see the JXUnit tests work if I moved all other tests aside. If things started to work at that point then I could add the other tests back in, one at a time, until things broke again. The idea was to keep things a little more controlled, and hopefully avoid those OutOfMemoryExceptions.

It was a tactic that paid off. With just the node-type tests running, everything worked as expected. The first tests that I added back in were the "advanced_queries" tests, and these immediately caused incorrect data to be returned. Armed with this I was able to duplicate the problem within an interactive iTQL session.

After a little while I narrowed down the problem, and found that it was the presence of a second model which caused the query to fail. Selecting all the literals from the string pool showed that the two literals which were supposed to be returned were now separated by a large list of other literals. This made me reconsider how I was selecting the literals from the string pool.

The literals from the string pool come out in three lumps: Strings, doubles and dates. The next version of the string pool will have many more types, but for at least the next week we are still restricted to these types. Each type is returned as a GNodeList, and gets wrapped in a thin wrapper class which provides a Tuples interface. In order to return the three tuples as a single Tuples object, they have to be appended.

At this point DM was able to advise me that the TuplesOperations.append() method which I was using was supposed to take a list of ordered tuples as arguments. While the tuples from the string pool are ordered numerically, alphabetically, and by date, none of these were the ordering required by the append() method. This method creates an object which holds it's list of constituent tuples, and shuffles forward through them according to which one has the next consecutive item to be returned. If any of these tuples is out of order, then its next item will never be considered the "next" item to be returned by the whole group. Hence, the list of tuples was being cut short.

The package holding TuplesOperations also holds an internally used class called UnorderedAppend which can do an append using unordered tuples. Unlike the ordered case, this class does not hold references to the constituent tuples. Instead, it creates a new tuples object and populates it with the contents of its arguments. DM suggested that I change the scope of this class to public so it could be used externally. With a lot of this code changing soon anyway, SR had no objections to this change.

The result was a correct set of answers. Before checking everything in to CVS, I started running through a more thorough set of tests to confirm that everything was working well.

Failed Tests
Initially everything appeared to be working, and I started looking at removing some of the hacks I'd put in to help me debug the problem. One of these hacks had been a step in the node-type tests which explicitly dropped every other model being used by the full suite of JXUnit tests. The idea of this step was to help clean out the list of literals, reducing it to just the literals found in the data needed it for the test.

Once I allowed other data to stay in the database during the node-type tests, the tests suddenly failed with the following output:

ItqlInterpreter error - org.kowari.query.QueryException: Couldn't resolve local query

Caused by: (QueryException) Couldn't resolve local query
Caused by: (TuplesException) BlockCacheFile.reset(prefix) untested.
Grepping for this error string led me to the BlockCacheLine class (the name "BlockCacheFile" seems to be something left-over from an earlier version of the file). The lines in question were something like:
if (prefix == null) {

...
} else {
nextTuple = -1; //!! FIXME
throw new TuplesException("BlockCacheFile.reset(prefix) untested.");
}
Now code like this is a real worry! Evidently I had enabled a new (and unimplemented) code path which had not been used until now. The code was written by AM who is away on holidays this week, so if I want to get this method working correctly I'll have to work out the inner workings of HybridTuples for myself. (BlockCacheLine is an inner component of HybridTuples).

TJ is very keen for me to help AN with TQL for OWL, so with time being an issue I decided to see if I could try something else, and avoid this code path altogether. My first thought was to attempt to use the TuplesOperations.append() method which I had incorrectly been using at the start of all this. This required a sort() to be applied to each tuples object before appending. The consequence of this would be a whole new sorted tuples object being created for each block of nodes from the string pool, and to wrap these new objects in the appending tuples class. The original tuples would then get closed. Not quite as efficient as letting all the sorting get done in an UnorderedAppend object, but still acceptable.

Implementing this was easy and straightforward, and didn't help.

Upon reflection, I realised that the problem was exactly the same. Using the UnorderedAppend code had created a new HybridTuples object to hold the sorted tuples in. Using the ordered version of append() had not created a new HybridTuples object, but each of the tuples which it wrapped had been a HybridTuples. Each of these were created in order to sort the original tuples.

Joining
Since the missing code was only when the BlockCacheLine.reset() method was called with a null or empty prefix parameter, TJ suggested I see why this parameter was being passed in, and if it could possibly be avoided.

Tracking this was remarkably easy. The BlockCacheLine.reset() method is only called from HybridTuples.beforeFirst(), which has its own prefix parameter. This prefix parameter is passed straight through to the BlockCacheLine.reset() method. I confirmed with SR that it is the constraint resolver's join() method which calls HybridTuples.beforeFirst() with the prefix. He also explained that this is fundamental to how join operations are performed.

In other words, BlockCacheLine.reset() must be able to accept the prefix parameter. So first thing in the morning I'll be trying to disseminate the HybridTuples code to work out how to do this. Where is AM when you need him?

HybridTuples
When tuples need to be resorted, it is necessary to create a new object which can store the newly sorted data. To ensure that the tuples can be of any size, it is necessary to allow this object to be backed by a file, otherwise a query could run out of memory. However, a file-backed object will be quite slow for small data sets, which is just as unacceptable as running out of memory.

The purpose of HybridTuples is to provide a fast and scalable Tuples object. When the data set is small, the entire object is held in memory. When the data set gets to a particular size, HybridTuples creates a backing file in which to store its data.

In the past Kowari had an in-memory Tuples implementation separate from the disk-based FileTuples object. This worked, but required the code instantiating the tuples to know when to switch from using one to the other. HybridTuples, on the other hand, is supposed to do this automatically. It normally performs this well, but it seems that one piece of code required for join operations has not been implemented.

SR, DM, and I were all perplexed how this code path could have been missed being run in the past. After some thought SR appeared to have the answer. The optimiser works hard to make sure that the tuples resulting from each constraint is in the natural order found in one of the index files. This is almost always possible. If it is absolutely required that one of the tuples be resorted in order to perform a join, the optimiser tries to ensure that this happens to the smallest possible tuples.

Since joins on constraints usually serve to reduce the size of the data, and it is usually this joined data that the optimiser is forced to re-order, then it appears that the HybridTuples the system resorts to performing joins on are always small enough to avoid being disk based. Perhaps certain results have been large enough to go over the size threshold for HybridTuples, but they have not being joined against, otherwise we would have seen this error before now.

It's probably fortunate that the pseudo-model code has triggered this bug, as a sufficiently complex query on a significantly large data set would probably have also triggered it.

Camera.owl
As the basis for some of the testing I've done with OWL, I've often used the Camera ontology example supplied with Jena. This file contains only 2 literals. The first is a string, which is a comment on the ontology, and the second is a cardinality constraint of 0.

While debugging the code which appended the contents of the different parts of the string pool, I expected to see a string and a double being appended into a single tuples object. Instead I found that the string pool contained 2 strings. This was because the number 0 had been erroneously encoded as a string.

When I looked in the camera.owl file I found this:
<owl:cardinality>0</owl:cardinality>
I normally prefer working with N3, so AN was kind enough to point me to a file (which I'd written!) to get the correct syntax:
<owl:cardinality rdf:datatype="&xsd;double">0</owl:cardinality>
Of course, the correct data type is xsd:nonNegativeInteger, but we won't be supporting that for a week or so. I also needed to declare the xsd entity.

This fixed the type problem for the data. Hopefully the original camera.owl in Jena has been, or will be, replaced with an updated example which has the correct data type in it. I didn't feel like downloading the whole Jena package to check for a simple update on a single example file. At least, not today.

Graph Data Model
The last time I saw Bob, I said something that made him think of a book he owned, called "Graph Data Model and Its Data Language" by Hideko S. Kunii. He'd met the author some years ago at a conference, and she recommended the book to him. He thought there was something familiar in my description of Kowari, and suggested that I try reading it.

I picked the book up at the UQ library yesterday, and made a start on it last night. The first surprise was that the cover has pictures of several constellations on it. This was the foundation of the company name and original logo of Tucana Technologies. Then as I started reading, I realised that the book describes a very similar system to Kowari.

The book describes a database called the Graph Data Model (GDM), and a data access language called the Graph Data Language (GDL). Kunii describes GDM as:

The data structure of GDM is a labeled directed graph. Records are represented as nodes characterized by attributes and holding values. Links, which are interpreted as binary relationships between records, are represented by directed arcs. Links serve as logical access paths. GDL is based upon algebraic operations on the graphs.

The principal difference between Kowari and GDM/GDL, is that the core of Kowari is completely symmetric about the three components of a subject-predicate-object tuple, while GDM considers predicates to be different to the nodes they connect. This is familiar with respect to RDF, and demonstrates some of the principles of indexing that I've seen in other systems. However, while the philosophy may have been different, the eventual implementation may not have been all that different. From the chapter on implementation:

A link occurrence file contains pairs of URI-tuples. We can choose either a B-tree, B*-tree or indexed sequential file structure for a link occurrence file. Since the data structure of the link occurrence file does not differ from one link type to another, we can store multiple link types in a single physical file.

In a sense, this sounds very much like the predicate-object-subject index in Kowari. This index is a balanced tree, grouped into sections based on predicate. The contents of each section is a set of tuples of object-subject pairs.

Of course, the implementation of much of Kowari is different to GDM. Kowari also has a meta node for each tuple. More importantly, Kowari is symmetrically indexed about all 4 of its nodes, with 6 index files providing every possible direction of indexing. However, some of the mathematical foundations are identical, and definitely worth pursuing. In particular, I'm interested in the scope and structure of GDL. It appears that Kunii has done a thorough analysis of the requirements of a data access language of this type, and I'm interested to see if there are operations missing from TQL.

As I read more, I started recognising elements from our own recent work in TQL. Early on in the description of GDL, Kunii describes "Link Operators":

The Graph Data Model provides extensive link traversal operations. Because of this facility, the model can express many types of relationship between record types. There are basically two types of link operators: numerical link operators, and transitive link operators. Existential link operators and universal link operators are defined as variants of numerical link operators.

It was only last week that I started to realise that the universal and existential operators can be realised with numerical operators. Seeing the things I've only just started to understand laid out as a natural consequence of this kind of data structure only serves to remind me that there is a lot of theory that I should have already read. I wonder how many people in the DAWG have read up on this kind of work?

No comments: