Monday, May 17, 2004

More Transitive Queries
Anyone reading this must be sick of me being in the grammar and query code by now. Put it down to distractions due to other parts of the system I get involved with. Not that I do coding there, but because I'm in discussions about designs.

Did most of my coding in the LocalQuery class today. That's where all ConstraintExpression objects get resolved, so I'm finally into the nitty gritty of it. After looking at the way the code is structured I decided that I was needlessly missing out on sharing some code by having TransitiveConstraint extend ConstraintExpression rather than Constraint. So I changed it. It wasn't all that bad, as all of the Constraint behaviour that I don't want is being overridden.

I've realised that anchored transitive statements need a separate set of code to non-anchored statements. That's a little annoying, as it means I end up doing it twice. Unfortunately the complexities in trying to sift out the common code will make it unreadable. OTOH, I'll keep my mind open, and if I get to the end of it and it looks like I was wrong I'll try and pull them back in together again. What I've done so far doesn't look like it though.

I was under the impression that I was going to be using a lot of find() method calls, but it seems that using Constraints will be as low in the APIs as I need to go. The lets me leverage the merge code already used for constraints. I'm still in the process of building up "where clause" constraint objects based on answers from previous constraint resolutions. I expect to be able to use this method to iteratively resolve the inferred statements.

I was left wondering if the query should only return the inferred statements or the original statements as well. To be "strictly" correct I should probably be including the original statements, but I think that only providing the inferred statements is more useful. I'll be asking for opinions in the morning.

Simultaneous Writes
Interesting question came up doing simultaneous writes, and I spent some time discussing it with DM.

Now it's very rare that any system really does simultaneous writes. If a computer has only one CPU or one hard drive then at some point it has to be serialised. It is feasible to avoid it if you have a multithreaded database with RAID storage, but even that doesn't guarantee parallel writing, as there are always locks involved somewhere. The issue is where the serialisation occurs. The lower down, the better. This is because an operating system can do things like read-write reordering, and there is little point in trying to duplicate this functionality at a higher level. Probably the best way to guarantee non-serialised writing is to use a cluster, but that was getting away from the question we were posed. We just need to consider writing from multiple clients at once.

Normal table-based databases often allow table, page, and row level locking. Table locking is pretty useless these days and does not allow for scaling. Page level locks are pretty good, but the best performance in scalable systems normally comes from row level locking.

Kowari and TKS don't have tables, but it got me wondering what the parallels are, and if we can apply them. (The short answer is no, though we do have solutions).

When I was discussing Kowari indexes in the last couple of days I mentioned that the graph was stored as a set of 6 indexes. These indexes form our "tables". Unfortunately, changing a single node in a statement means that statement must be moved in 5 indexes. It can't simply be "locked" and modified in position (except in one of the indexes).

As an aside, this touches on where the Kowari/TKS dual phase commit happens. The database relies on all the indexes holding the same data. The dual phase commit lets us update all 6 indexes in the first phase of the commit and to confirm this in the second phase. If one of the indexes fails to update in the first phase, then the data in all indexes gets rolled back. The second phase is atomic on the file system, so it either happens or it doesn't. This is done internally, so there is no API for it, but it is certainly happening. We couldn't guarantee integrity of the data if it didn't work this way.

Back to considering locking in Kowari: it could theoretically be possible to lock the source and destination nodes in each index's AVL tree when performing a move of a statement. It would be very messy if a source node were nearly empty and needed removal, or if the destination were full and needed splitting. Splitting nodes (ie. adding a new node) and removal in AVL trees isn't usually so bad, but they can have repurcussions all the way to the root of the tree! (This reminds me of a little rant I have about AVL tree rebalancing. See below.)

Unfortunately, locking nodes in a Kowari/TKS write phase could be very problematic, particularly if we were trying to share the phase. Kowari and TKS use a phase-tree to handle transactions. This worked in perfectly with the AVL tree structure. As soon as a node is to be modified, that node and all parent nodes above it get copied. This means that we can make as many modifications as we want without disturbing the original data. Committing a phase simply means adopting the root of that phase's tree as the new root of the index (this is an atomic, and near-instantaneous operation). It also means that we can have numerous transactions which all have their own unique view of the data.

We only want to be writing to the most recent phase tree, as only one tree's root can be made the "new" root. DM has the clever idea of giving each writing thread its own tree, and keeping a transaction log of all modifications. Then the first transaction phase to be committed gets to make the root of its tree the new root. After the first transaction is committed, all other phases will be dropped, but their transaction logs can then be replayed against the newly committed phase. Naturally, these transactions will be rolled back if there is a conflict during playback.

While this works, it does involve overhead of the second (and subsequent) committed phase having to write statements to its own tree, and then abandoning them to write them to the newly committed tree. I believe that some of this can be eliminated if all executing transactions are informed of a commit, so they can be played against the new phase. This means that any conflicting transaction can be aborted early. More importantly, it will also mean that each transaction can get a new phase tree before the partial playback. Consequently, whichever transaction is the next to be committed will simply get to make it's phase tree the new root, meaning that all of its new modifications get kept, and don't require playback. DM suggested a further modification which only updates phases like this after a minimum time period, to prevent too much overhead accumulating on long running transactions from numerous short transactions.

Next comes the node pool. At this point it is a single point of contention, though there is not much overhead associated with it. Still, it will eventually need attention.

The simplest idea is to have a node pool manager that operates like our current node pool. Any transaction which needs to allocate nodes can ask the manager for its own private node pool. The manager can allocate a slab of node numbers to allocate (in groups of, say, 1000) and can also provide a list of nodes from the free list that it holds. Like the current node pool, these temporary node pools will allocate from their free lists first, and then hand out nodes from an incrementing counter. Once a temporary node pool is exhausted, the transaction can request a new one. If a transaction is committed or rolled back then all the unsed nodes will go onto the node pool manager's free list.

This idea will cycle many more nodes through the free lists, but they should also get used up pretty quickly, so the free list won't grow without bounds.

I also considered the idea of using interleaved node pools. This involves a fixed set of n node pools, each of which hands out nodes which are all equal mod n. Then instead of asking a single node pool for the next available node, a transaction can ask one of n node pools. Unfortunately, while this works for up to several times n threads, it won't scale for large numbers of writing threads. However, it might be worth considering just because there may never be that many writing threads, and it does provide some extra options. For a start, there need be no centralized free list. To determine which free list a deleted node should be returned to, a thread need merely take that node number mod n. It also means that starting new transactions would not incur any extra overhead from creating a temporary node pool.

I only mention this as an idea, because I really feel that the scaling problems rule it out.
Rant About AVL Deletions
By the way... has anyone ever noticed that all these sites out there that discuss rebalancing AVL trees seem to always get it wrong when deleting nodes? Everyone knows how to rebalance after inserting, but I've read a lot of sites which discuss this, and every single one of them got removal wrong. Many of these were course notes from university subjects! Oh, they MOSTLY work, but throw a heap of random data at them, and start adding and removing nodes at random, and after about your millionth modification of the tree you'll suddenly discover that it is out of balance.

It was a couple of years ago now, but it took DM and I days to find a problem we had with removals. That was the last time I trust a university lecturer's algorithm. ;-) Surprisingly though, we did a search and found lots of implementations, and they were all subtly wrong. We had to work it out for ourselves.

Hmmmm, it's been a while now, but maybe I should write up a web page on how to remove nodes from AVL trees. OTOH, maybe everyone realised the error of their ways (or maybe they just downloaded Kowari) and now has it right, but I somehow doubt it.

No comments: