Friday, June 04, 2004

Resolvers Still
Well the basic structure is all written, but it's far from working. There are missing jar files, and all sorts of other problems with the build process, so I'm left sorting out what are the coding problems, and what are the configuration problems.

I think I'm close, but I was called away a little early tonight, leaving me unable to finish it until later.

Transitive Queries
AN spoke to me today about the trans statement. Since I got pulled onto resolvers he has started working on inferencing where I left off. He's making progress, and I'm sure it will be fine.

In the meantime it feels a little frustrating, because I know that if I were to make a change here and restructure something there then I could make the necessary changes reasonably easily. The only reason I didn't do so before was because I needed to speak with AN before I could commit to making those changes. Now that we've spoken I'm not getting the chance to work on it.

On the brighter side, I'm learning about a new subsystem that I hadn't had a lot of contact with before.

The meeting today went well. Yes, UQ is very interested in this kind of work. They are happy for me to apply, and Assoc. Prof. Colomb is prepared to be my supervisor. I now need to put my application into the university administration. Fingers crossed that they are happy to let me in, though ITEE thinks I should be fine.

As a part of the application I need to write a proposal of what I hope to do. I'm thinking something along the lines of "Scalable Ontology Server", but that sounds like hubris. :-) However, it does provide a general direction for my study, and I've been assured that my original proposal is a "write-only" document which is filed away, never to be seen again.

I have a LOT of reading ahead of me.

DM is making progress on his new triplestore design. Taking his inspiration from Lucene, his index plans are all based on skip lists. New data gets put into new files, so nothing needs to be moved, and an optimise method will merge files together.

The principle of this design is simple: Use sequential access on disk to its greatest advantage. Hard drives operate at speeds which are orders of magnitude faster when data can be read sequentially.

So while skip lists means that more data has to be read by the process, the overall performance should actually be faster. Looking at it I can see his point. Making all writes an append operation on new files, with file merging on the fly, also sounds like a lot of work, but again, it's completely sequential access.

The other advantage of using skip lists is that the data size is effectively unbounded. This has made me reconsider the hashtables I've been planning, and my opinion is now not so favourable.

The most expensive index from my previously discussed design will be based on the hash of the full triple. Let's assume 32 bit identifiers. 64 bit is actually more useful for a large system, but since this might not scale so well 32 bits are worth considering. Each element of the hash table will be 20 bytes long. That's 3 x 4 byte identifiers, and a 64 bit file index. A 2GB file can therefore only hold 214.7M entries. Presuming that I let the hash table get heavily loaded, say at 0.75, then that cuts me back to about 161M triples.

One of the big advantages of using a hashmap is that I could just map it into memory and look it up with a memory indirection. A 2GB file is more than I can really map on a non-64bit system, and this has not even considered the other indexes. In other words, I would be lucky to get to 100M triples if I use mapped hashtables. That isn't to say that I can't use them, but things may not be as rosy as they first appear.

On the other hand, I have had some ideas for the data file. As triples are written, they will simply get appended to the file. It would be feasible to set aside large buffers for these triples to be put into, and to wait until this buffer is full before writing it out. This prevents lots of small writes, and more importantly, it may help to prevent numerous writes to the indexes. For instance, if the two triples:
  [23, 45, 34]
  [23, 45, 87]

were written, then the (subject,predicate) index would normally get updated twice for the (23,45) entry. In fact, so would the (subject) and (predicate) indexes for (25) and (45) respectively. Postponing the index writes means that only the latter entry would need to be written.

Of course, writeback filesystems might allow the first writes to be postponed until they are invalidated, but this is not guaranteed, and still invovles another system call.

Unfortunately the final data file will contain linked lists with all of the indexes pointing backwards to places in the file with lower offsets. This is the complete opposite of what is needed here. One possibility is to take the Lucene idea of running "optimise" to rewrite the file. In that case the whole file could get rewritten backwards, which would make traversing the linked lists significantly faster. I believe that it would be fine leaving the blocks alone, as all operating systems read whole blocks at a time, so back-indexing within a block will just be going to data that has already been read and is guaranteed to be in the same page of buffer cache.

It's a shame that I can't write a block to the front of a file.

Data Size
The final advantage of DM's idea is that the triples (or quads) being stored are just that: Triples only.

The data I'm planning on storing will be triples and linked list pointers. Each element will be 3 identifiers, and 6 "next" pointers. Presuming that linked lists can always do a 32 bit relative offset (which they may not without elements being moved) then this is already 36 bytes per triple. For full 64 bit file offsets this moves up to 60 bytes. If back pointers are included then it goes up again, to 108 bytes.

One of the big killers for Kowari's performance with huge data sets is when it moves out of buffer cache and into greater reliance on the hard drive. Obviously, this occurs when the data being manipulated is too large to fit into RAM. The performance degrades as cache misses become more frequent with the growing dataset.

The easiest way to put off this diminishment in performance is to use smaller data elements. An arrangement with 12 bytes for a triple will perform much better than one with 108 bytes.

I'm thinking I'll code this anyway, so I can see and evaluate its performance. However, I'll be planning on throwing it away once I've seen what it can and can't do well. In the meantime I'll be using my new student card (hopefully) to borrow books on indexing from the UQ library.

No comments: