Tuesday, June 29, 2004

I've been so caught up with things tonight that I nearly forgot to post. Then I discovered that a recent crash of my Linux system (no idea why... uptime was about 3 months at that point) had corrupted Mozilla somehow. So more delays while I upgraded to 1.7. I'm missing the Pango port, so I might go searching for that soon and see if it's available in 1.7 yet.

Unfortunately the new installation of Mozilla didn't fix my problem of mozilla-xremote-client failing to do anything. It does work when I ssh to this box with X forwarding on and execute it, but it does nothing when run locally. There is almost no documentation for mozilla-xremote-client anywhere, and I've yet to see anyone suggest what can cause it to not do anything when run. It's really annoying. Do I resort to strace? Do I download the code and debug it? I'd rather spend my time on other problems.

Resolver Paths
I spent a lot of today struggling with the modifications that occurred in the last week. Once I had everything sorted for a clean build I was back to my 11 day old JUnit errors. While not immediately apparent, the problem turned out to be due to classpath problems again, this time with reflection code, so the Exception information was often misleading.

I then spent some time in a cycle of running the test, finding the missing class in the exception traces, using jar/grep to find the library with the missing class, inserting it into the path in the ant script, and re-running the test. I ended up with a LOT of jar files in my path before things finally started to run.

Once running, I found that I had promoted all the errors up to failures, except for one which passed. The failing tests are describing a misconfiguration of a subsystem I'm not familiar with, so I may be on a steep learning curve in the morning. It would feel like progress if it weren't for the fact that there is still so much to be done.

I'd love to spend some time talking about this, but I have a thesis proposal to write, and I promised myself I'd finish it tonight. So I'll be short.

For a little while now I've been considering B-trees as an index for a triplestore. After all, every major database uses them, and they certainly scale well for searching. It was almost accidental that Kowari started using AVL trees, and I've been considering what the tradeoffs are between the two approaches. This led to a significant discussion on it this afternoon, so I figure it's about time I log it.

B-trees of any reasonable order have significantly less depth than AVL trees. This is because AVL trees only have left and right children while B-trees have many. While trees are mapped into memory the depth is of little real concern, but now that we are looking at indexes of several GB this starts to become more of a concern. Unfortunately, B-trees start to lose out a little here, as they do not fill up their nodes completely, with each node having "some number" of entries between N/2 and N, where N is the order of the tree. Having slack space like this makes the index about 25% larger than it needs to be. This pushes B-tree indexes out of a range that can be memory-mapped or cached, much sooner.

Inserting into a B-tree can be a very cheap operation if the target node is only partially full, or it may be more complex than an AVL tree insertion if it is completely full. At this point Kowari differs from the standard AVL tree. Instead of storing a node for every triple, Kowari stores indexes into a block file. This file contains blocks of up to 512 sorted triples (8kB). If a triple is being inserted, the correct AVL node is located and used to find the appropriate block. The block is then read, the triple inserted, and then the block is written back to disk. If the block is already full then it is split into two blocks, and a new AVL node is created to point to it. AVL nodes also hold the maximum and minimum triples of each block, so comparisons can be done on the node without having to follow an indirection.

Having each node refer to a list of sorted data, and having full lists broken up when inserted into, is almost exactly the process used by B-trees. The more I look at it, the more I realise that we have merged certain concepts from both B-trees and AVL trees, often gaining in the process. For instance, the size of the index is orders of magnitude smaller than a standard AVL tree due to the indirection into a block of triples for each node, and it uses its space more efficiently than a standard B-tree, keeping the index smaller still.

Conversely, there are associated problems with the hybrid approach. The restriction to just 2 children makes the tree significantly deeper, which makes lookups slower once the index gets too large to keep in memory.

To be really scalable we do need a new index. DM will be working on using skip lists to allow for fast loads of large datasets, but this approach leads to read-only files. Future inserts create new files, and a background task will be merging files when possible. This has the major advantage of using sequential reading of a hard drive to a much greater potential, but adds complexity, and does not easily allow for simple modifications of the graph.

For this reason I'm thinking I'd like to try B-trees, only with the current block files, and the triple range comparators we currently use. This would end up storing lists of triples in the blocks, and lists of blocks in each node. Certainly not a groundbreaking design, but it would be interesting to see the performance impact it provides. Making the order of the B-tree nodes a tunable parameter might provide some interesting results as well.

No comments: