Working notes

Luc
I love the little boy, but last night I got about 2 hours of sleep. I'll probably come back to this tomorrow and wonder what state of mind I was in when I wrote it. :-)

I know there will be spelling and grammar problems here tonight. I'll come back and fix them later.

iTQL Transitive Queries
Transitive constraints didn't quite go as planned today. I've successfully extracted the triples used in a transitive constraint, but where I thought I could simply extract the data I need and pass it down to the query layer, I ran into some problems.

The grammar has defined that the contents of a trans(...) element is a triple, optionally followed by another triple. Note that this specifically avoids the Constraint defined in the grammar, because these allow an optional model to be specified, which we don't want here. So all I need to do is take the elements found in the triple, and store them in a class that meets the ConstraintExpression interface so it can be passed down to the query layer.

I've created the appropriate Constraint class called TransitiveConstraint, based on Constraint (both extend ConstraintExpression). Constraint uses 4 ConstraintElements (three parts of a triple and a model) in the constructor. TransitiveConstraint needs the compulsory triple which describes the transitive predicate, and the optional triple to describe the anchor. So the plan was to use the 3 parts of a triple in the same manner that Constraint uses its four parts, and to provide the anchor as a complete Constraint object.

The SableCC classes provide triples in an implementation class called ATriple. This contains all the information needed for the TransitiveConstraint. Unofortunately, neither the triple nor its elements can be passed down to the query layers as this layer is not supposed to know anything about the SableCC grammar classes, so the elements of the ATriple class have to be converted to a usable form. This is done with an existing function called toNode(). This method converts from a PTripleElement to a org.jrdf.graph.Node, which is exactly what is required. After conversion, all the nodes can be passed to TransitiveConstraint and we can get on with the query code.

This is where I came unstuck. When the toNode method encounters a variable element it converts it to a BlankNode. A temporary HashMap is used to map variable names to BlankNodes so they will be re-used when the same name is encountered, but this mapping is lost as soon as the complete conversion is done, and there is no way to keep it around.

I don't want a BlankNode when I convert a variable to a org.jrdf.graph.Node. Even if I try to use one I lose the name, which makes it useless for me. Normal Constraint objects must get around this somehow, but I couldn't see how. After some time I asked SR and AN what they knew about it. SR didn't know, and after spending half an hour with him AN suddenly realised that I was looking at a problem that he is currently working on himself. He assured me that it won't be long before he will have some solution for keeping all information about variables around.

AN commented that he found it strange that both ML and myself would run into exactly the same issues within a day of each other. This is because ML and myself are doing work that is completely unrelated to each other. To date I haven't had occasion to discuss a single thing with him. That might be about to change!

I decided to move my efforts into the query layer until AN has finished what he's doing. If I finish there first then I'll change the iTQL grammar to accept a full constraint inside trans(...), and I'll use the code in ItqlInterpreter to enforce the extra conditions. After considering this for a while I realised that this will have the advantage of letting me use the normal constraint code when I get down into the query layer, so it's starting to seem like a good idea. I can't easily do this while using a triple instead of a constraint, as the constraint objects are specifically written to be built from constraint grammar objects, and not triple grammar objects. The one thing that I don't like about it is that the grammar file won't be a true representation of the real grammar.

Maybe it will look clearer in the morning after a decent night's sleep.

Triplestore Disk Data Structure
DM and I discussed another option for storing triples on disk. It's worthwhile writing our options down so I have something to refer back to. I mentioned the first pattern last Saturday (May 08, 2004), but I'll try and flesh it out a little.

Both patterns are based on hash tables. We've done disk based hash tables before, so we already have code for that. The hashing functions may take a little work, but they shouldn't be too hard. We'll try and come up with something reasonable, and then we can keep metrics to see how many collisions we're getting, so it should be possible to test a variety of functions to get the best performance.

First Design
The first design uses 7 hashmaps as indexes. These maps are completely disk based, and when small enough would definately be candidates for memory mapping. When they get too full they are increased in size, and rehashed. This is where C has the advantage over Java, as an existing memory map of a file can be unmapped and mapped again to the larger file. Java doesn't allow this to be done explicitly. The best we have been able to come up with is to close the file and set the reference to the MappedByteBuffer to null, and then loop on the garbage collector until it picks up the buffer and unmaps it. We can tell when this has happened if we keep a WeakReference to the buffer and test it after each garbage collection. This all works, but it's painful, and has major performance implications, particularly under Windows.

The hashmaps take hashes of the following parts of triples as keys:

(Object)
(Object,Subject)
(Predicate)
(Predicate,Object)
(Subject)
(Subject,Predicate)
(Subject,Predicate,Object)

These indexes let us look up triples based on any combination of Subject/Predicate/Object that we may have. All we need is a decent hashing function that takes the appropriate parts of the triple.

Each bucket in the hashmaps stores the complete key (to detect collisions) and an index into a large data file.

The first design uses just one large data file. Files can now be addressed with 64 bit addresses now in all important operating systems (Linux, OS X, Windows, Solaris), so this file can be as large as we like. Indeed, under Linux we could even be accessing the raw device, though I doubt we'd gain much from doing that.

The elements of the data file are all identical. They hold a complete triple, and 6 indexes which point to other elements in the file. These indexes allow each triple to be a part of 6 distinct linked lists. The first linked list contains all the triples which match on Object. The second list matches on Object and Predicate. This goes on as per all of the indexes above, with the exception of (Subject,Predicate,Object), since there is only ever one triple which contains that set of values.

A separate file is kept to list all unused elements in the data file, with the exception of the block of consecutive unused elements at the end of the file. This file is called the "Free list". A single index is also stored indicating the start of the unused space at the end of the data file.

This design leads to the following operations:
Searching

Take the known nodes, and hash them with the function appropriate for that number of nodes.
Go to the appropriate hashmap, and look up the index into the data file.
Return an iterator which has that index.

Iterating just means going to the data file to return the triple found at that index, and then setting the index to the linked list pointer appropriate for the data being traversed. This continues until the end of the list is reached.

Inserting
Starting with the full (Subject,Predicate,Object) index:

Take the nodes and hash them.
Go to the (Subject,Predicate,Object) hash table and insert an entry if it is not already there. If it was there then there is no need to continue as the insertion is not valid.
Allocate a new element in the data file. This is done by checking if the free list has any elements listed. If not, then use the pointer to the beginning of the space at the end of the file, and increment that pointer to the next unused element.
Write the triple to the data file in the allocated element.
Remember this index for the subsequent indexes.

For the subsequent indexes the procedure is similar:

Take the appropriate nodes and hash them.
Go to the appropriate hash table and see if there is already an entry for those nodes. Inserting the entry if it is not already there, otherwise remember the existing index into the block file and then set it to the index to the remembered element that was allocated above.
Go to the remembered data file element. If an entry was inserted in the last step, then there is no linked list yet, so set the appropriate pointer to -1 (or to the current element for a circularly linked list). If there was a pre-existing entry then set the appropriate pointer to this entry so that this forms the new head of the list.

Strictly speaking, the above steps are slightly out of order. The data element that forms the new head of a linked list should point to the old head before the hashfile is changed to point to the new element. This is done to maintain file integrity, though if this is important then flushes are needed at the correct time to prevent out-of-order writing to disk.

Deleting
For each index:

Take the appropriate nodes and hash them.
Go to the correct hashmap index and find the index to the data file.
Iterate along the linked list in the data file until the correct triple is found.
Change the pointers appropriate to the index around this element in order to remove it. If the triple was the first element in the list, then change the hashmap to point to the new head.
If the list had only the one triple in it, then remove the entry from the hashmap.
The first time through, remember the index to the triple being removed.

Once the triple is removed from all the indexes and linked lists, then add its index to the free list.

Discussion on First Design
Insertions are likely to be very fast with this scheme. Very little reading and writing occurs for each index, and the block file requires a single write with no reading (or two writes and a read if the linked lists are to be made circular).

Finds are very fast with only 2 disk reads. I have no problems with any overhead involved with iterating, as the calling code is often more complex. If it became an issue then a background thread could even be used to read ahead on the list.

As you can see, deleting takes a little effort, as it requires traversal of all 6 linked lists. This could be prevented completely if the lists were changed to be doubly linked, but then the size of the data file would jump from 72 bytes per element to 120 bytes (all numbers are 64 bit. This is needed to make the node IDs scalable, and to allow the indexes to refer to files of any size). This is a significant increase in size and at this stage we don't believe we need it.

Slow deletions can be a problem in systems where entire groups of statements are to be deleted. For instance, in Kowari/TKS we usually see this as the dropping of an entire model. There seems to be little call to drop groups of statements by other criteria.

Fortunately this isn't a problem here as we plan on using the file system to our advantage letting us keep different models in different directories. That means that dropping a whole model just means that we can drop the directory.

To digress far a moment, in the past I've noticed a reluctance in proprietry software to use the filesystem's structure to store data for a program. In the rare case where multiple directories are used, then they often have obscure names, and the program stores information internally about what goes into which directory. For instance, iTunes (and the iPod) splits songs up into a collection of obscurely named directories. This allows it to find files relatively quickly by using a hash. People have worked the algorithm out, but it was non-trivial.

Programmers don't want users playing around in the program's internal data structures, and for good reason. Making things obscure discourages this. Otherwise, if I had a collection of data called "Foo" and I wanted to change its name to "Bar", then if I see a directory called "Foo" I might think I could change it, unaware that the program needs to change some internal data as well. I'd deserve what I get, but people still do this, and programmers seem to think they should be preventing it.

Maybe programmers of proprietry software also think they are protecting some important secrets. I don't know. But this code is destined to be open sourced, so secrets are no issue. It's also likely to only be run by those who know what they're doing, so I have no problems in using a directory name to tell the program what model is held in it. Besides, it means that anyone changing a directory name or deleting a directory will see what they expected to see. ;-)

Back to the design...

The main problem with this design is that the triples are not ordered. In the normal run of things this is irrelevant. The JRDF interface certainly does not need it. However, any efficient joins on the returned data will require sorting. One of the strengths of Kowari/TKS that makes it so scalable is that all of its data is sorted. Unfortunately this leads to loading RDF being slower. This design will load data faster because it doesn't need sorting at all, but scalable joins are still going to need sorted data.

Sorting can be done in 3 different ways:

At load time by ordering the linked lists. This will slow loading down significantly, but it should still be reasonably fast in comparison to Kowari/TKS.
At query time. This makes a certain amount of sense, as joined data often needs to be re-sorted anyway. If sorting is ever done, then it should be cached to reduce duplication of effort.
In a background thread. This isn't as fast as doing it at load time, but it lets loading be a lot faster. Stability becomes more important and harder if this data is to be modified or queried while it is being sorted.

The second and third options are the most likely at this point. I'd like to see what the speed difference is when compared against the first option, and I may end up doing tests to compare these.

Second Design
It's getting late, so I'll be terse here...

This is a similar design to the first, only now there are 7 data files. Each hashmap index refers only to its own data file. With the exception of the data file belonging to the (Subject,Predicate,Object) index, each data file contains only the first data item from the nodes used to form the index. ie. The data file for the (Subject,Predicate) index only has subject nodes in its elements. The data file for the (Object,Subject) index only has object nodes in its elements.

The data file elements form a single linked list with other elements in the same index. They also hold an index into the next data file. This is done in the following pattern:
  Subject -> Subject,Predicate -> Full Triples
  Object -> Object,Subject -> Full Triples
  Predicate -> Predicate,Object -> Full Triples

Searching for a Subject gives a pointer to the first data file which tells us all subsequent subjects (via the linked list) and the start of the linked list in the Subject,Predicate index which matches our query. Each element in the Subject,Predicate index points to the next Predicate that we need (we already know the Subject), and also points to the head of a list in the FullTriples data file. The list in the FullTriples file will entirely match the Subject,Predicate which pointed to it.

All indexes end up pointing to the FullTriples file. The elements in this file contains the full triple, and 3 pointers for finding the triple with the next Subject, the next Predicate, or the next Object. The previous indexes used to get us here provide the rest of the needed data, so this is fine. It also involves less overhead for making the list doubly linked, allowing for faster deletions.

This design would have a slower loading speed, but not dramatically. Reading would be slightly faster. The real speed benefits come from deletions.

I'm getting vague on the tradeoffs of this design, though I had them in mind 8 hours ago. I'll write more on this later when I've had some sleep.

String Pools
You'll note that there has been no mention of string pools. Well that is not essential yet, but it will certainly be needed before this system can look at RDF. To start with we may borrow the StringPool structure from Kowari, but I'm keen to give a trie structure a go as a fully indexed string pool would do wonders for us. I also want practice with trie structures, as I believe it would be possible to make them phasable, make them candidates for insertion into Kowari.

Working notes

Wednesday, May 12, 2004

No comments:

Semantic Links

About Me

Blog Archive