Thursday, January 24, 2008

Current Work

Since I can't do much at work during my "two weeks notice", I've been asked to stay at home this week. I'm still being paid, and have to be available if I'm needed, but in reality it's just a holiday. With the visa interview next week I'm not as relaxed as I'd like, but it's been a good week. I've enjoyed spending more time with Anne and the boys, along with my mother-in-law, who left here on Tuesday. But after having a few days to clear my head, I'm trying to get back to Mulgara. Unfortunately, my new computer has not arrived yet, so I'm back to my old G4 PowerBook in the meantime. It's fine for use with VIM and even Safari, but it's choking whenever I try to do real work on it.

I've spent a couple of days trying to catch up on email, and now I'm looking at getting back to actual coding. I should be doing SPARQL (and I'm looking forward to that), but I allowed myself to get side-tracked on some performance code.

String Pools

Indy offered to do some profiling of a large load, and immediately came back to show me that we spend most of our time reading the string pool. This makes sense, as every triple that gets inserted needs to be "localized" into internal graph nodes (gNodes). This means searching for the URI or literal, and getting back the associated gNode, or creating one if it doesn't exist.

There are two ways to index something like a string and map it to something else. You can use a hashmap, or a tree. Hashmaps are much faster with constant complexity, but have a number of problems. They take up a lot of disk space, they can be expensive if they need to expand, and they provide no ordering, making it impossible (well, not impossible, but you have to do tricky things) to get ranges of values. Trees don't suffer from any of these problems, but they have logarithmic complexity, and require a lot of seeking around the disk.

For the moment, the string pool maps URIs and literals to gNodes by storing them in a tree. It's an AVL tree, to reduce write complexity to O(1), though we don't store multiple values per node (unlike the triple indices), meaning the tree is very deep.

The code has many, many possibilities for optimization. We'll be re-architecting this soon in XA2 (there's going to be a big meeting about it in SF next month), but for the moment, we're working with what we have.

The first thing that Indy noticed was some code in Block.get(int,ByteBuffer). This was iterating its way through copying bytes from one buffer to another. This seems ludicrous, especially when the documentation to ByteBuffer.put(ByteBuffer) explicitly describes how it is faster than doing the same thing in an iterative loop. A simple fix to this apparently sped up loads by 30%! I wasn't profiling this, so I can't vouch for it, but Indy seemed certain of the results.

Initially I had though that this couldn't have been code from David an myself, but I checked back in old versions of Kowari, and it's there too. All I can think of is that one of us must have been sick, and the other absent. At least it's fixed. I've also noticed a couple of other places where iterative copies seem to be happening. I'd like to fix them, but there may be no point. Instead I'll let the profiler guide me.

After thinking about it for a while, I started wondering why one buffer was being copied into another in the first place. The AVL trees in particular are memory mapped, whenever possible, and memory mapping is explicitly supposed to avoid copying between buffers. Buffer copies may seem cheap compared to disk seeks, but these are regularly traversed indices, so the majority of the work will be done in cached memory.

A little bit of inspection showed me what was going on. The first part of a URI or strong (or any kind of literal) is kept in the AVL tree, while anything that overflows 72 bytes is kept in another file. The code that does comparisons loads the first part of the data into a fresh buffer, and appends the remainder if it exists, before working with it. However, much of the time the first part is all that is needed. When this is the case there is no need to concatenate 2 separate buffers together, meaning that the original (hopefully memory mapped) buffer can be used. I fixed this in one place, but I think it's appearing in other areas as well. I'll have to work through this, but again I shouldn't go down any paths that the profiler doesn't deem necessary.

Dublin Core PURL

After making my adjustments, I tried running the tests, and was upset to see that 5 had failed. This seemed odd, since I'd worked on such fundamental code that a failure in my implementation should have prevented anything from working at all, rather than stopping only 5 tests. I had to go out this morning, so it bothered me for hours until I could check the reason.

It turned out that the problem was coming from tests which load RDF from a URL: http://purl.org/dc/elements/1.1. Purl.org is the home of persistent URLs, so if a document ever changes location, the URL for it does not need to be changed. So using this URL in a test seems appropriate, providing you can assume an internet connection while testing. But unexpectedly, the contents of this file changed just last week, which led to the problems.

Given that this is a document describing a standard for Dublin Core, and given that it has a version associated with it, I am startled to see that the contents of the file changed. Shouldn't the version number increase? While I find it bizarre, at least I found it before people started complaining about it. It will be in the next release.

Moving On

Now that I've addressed the initial profiling black spots, I can move on to the things I ought to be doing, namely SPARQL (being an engineer I would prefer to be squeezing every last drop of performance out of this thing, but I have to manage priorities). I have to talk to a few people about Mulgara tomorrow, but aside from that, I'll be working in the JavaCC file most of the day.... I hope.

1 comment:

Anonymous said...

ooh, meeting! lemme know, would love to participate :)