Monday, June 07, 2004

Resolver
Everything builds, but the tests are not done yet. I'm using a set of tests built by AM, but they are taking a bit of adapting.

This resolver is to be used over the front of a RemoteSession, making it the basis of the new distributed queries. For the moment, distributed queries are a part of TKS rather than Kowari, so none of this code will be open sourced in the forseeable future. At this stage remote queries will provide simple functionality, but over time efficiency rules for remote joining and resolution will be added in.

Because the code is to be part of TKS rather than Kowari I had to get a TKS checkout, and make package modifications to satisfy the different directory structure. It is not a part of the TKS build process yet, but it does provide a place to check the code in.

Trans Modifications
I looked over AN's shoulder a little today, trying to offer a little help for his modifications to trans. He had a good grasp of it, but I figured that it doesn't hurt to know what the original author was trying to accomplish when he structured the code. I tried to be relatively straightforward, but it was quite apparent today that there were some subtleties that weren't so obvious.

Nonetheless, AN seemed fine with the modifications, improving some of the structure in the meantime. I was concerned about the overhead of testing for the existence of every triple to be generated, but we were able to determine that AN could use an existing set of tuples to search with greater efficiency than going back to the original graph.

Hashmap Indexes
I'm hoping to spend a little time on doing some honest-to-goodness coding on the hashmaps tonight. In the meantime I've been giving some consideration to the indexes and space usage.

The problem of 32 bit "next" pointers ends up being relatively easy. The end of each linked list will be indicated by a NULL pointer. Almost all other nodes will point to the next node in list. If the offset of one node from its predecessor is too large, then this will be indicated by setting the offset to a reserved value. When this is seen, the code traversing the linked list will go back to the hashmap and ask for the key again, only now with an integer which I'll call the overflow count. When the hashmap finds the required key, it will pretend that it has instead found a collision, and will keep looking in subsequent hashbuckets. For each pretended collision the overflow count will be decremented and the search continued, until the key is found and the overflow count is equal to 0. Insertions duplicate values into the hashmap should be reasonably simple as well.

It is highly unlikely that a next pointer will overflow the 32 bit value, so this scheme won't burden the hashmap unduely.

My remaining concern is what to do about the limited size of a hashmap. I'm starting to think that a hierarchy of indexes might be useful. Unfortunately, I haven't yet worked out how to go about this. In the meantime I've been reading various web pages about indexes, and searching for others. If anyone knows of any good ones then I'd love to hear about them. Unfortunately my background hasn't really covered many of the data structures that most software engineers probably take for granted, so I still have a lot to learn.

I've noticed that most descriptions of indexes describe structures which work quite well in memory, but are not so appropriate for disk. For instance, the skip lists page that AN pointed me to today is great for building data on disk, but deleting becomes quite messy due to the variable size of each node. This is not insurmountable, as a free list could allow for node reuse, but the problem does get harder. Similarly, many other indexing solutions describe structures which use variable sized data, with lots of jumping around in the data. The trick to efficient disk access for these systems is usually to find a fixed sized representation.

According to DM, Lucene gets away with skip lists like this because it relies entirely on write-once lists, with deletions done as white-outs, and additions going into new files. Later optimization is done as a merging of the multiple files. This is the approach that he wants to take, and it looks like it might work well. The biggest advantage is that it will scale well.

Consequently, I'm considering having a hierarchical structure with hashmaps at the top, and skip lists underneath, but I've yet to work out the best place to split data effectively in this way.

No comments: