Monday, February 13, 2006

With work and the move keeping me busy it's been a while since I last looked inside Kowari. To kind of ease myself back into it, I thought I should finally get around to writing some notes on the Kowari file system.

At all levels of abstraction inside the Kowari file system, the concept of phases keeps appearing. Phases are a way to keep track of a constant state in the system, regardless of any write activity that may be occurring. They are also used to serialize write access from different connections. While this does enforce consistency in the data on disk, the inability to allow multiple parallel writers is a current weakness in Kowari. (This is the principle reason the XA2 file system has been proposed).

Phases were borrowed from the TUX2 filesystem. In this approach, all data is laid out in a tree, with pointers going down from parent nodes to child nodes. This is a common pattern for data storage. (When I get OmniGraffle working again I'll put in a simple diagram of a tree here).

When a change needs to happen to a node in the tree, then that node is first copied to one side. The new copy contains pointers to the same children as the original node. Then each ancestor of this node is copied recursively, all the way up to the root of the tree. Each new parent node will refer to the copied child, but all the other children will be the same as the original node. The result is a pair of roots to two trees which are indistinguishable from each other, and which share many of the same nodes. (This is where I really need the diagram).

At this point, the new nodes can be modified without affecting the old tree. If modifications are required on any nodes in the old tree, these nodes can also be copied, along with all the parents up to an already copied parent. As soon as a node is found that has already been copied, then the existing copy can be adjusted to point to the new child node.

Once the write operation to the tree is complete, the old root will refer to the tree before the write, while the new root will refer to the tree after the write. The old tree was never touched, so it is guaranteed to be completely consistent. Once the new tree has been completely written, and is in a consistent state, then both trees will be completely consistent. That means that it is safe to use either root of the tree, without fear of power blackouts, OS crashes, etc.

At this point, a single number which holds the address for the current root of the tree can be updated. This can be performed atomically on any hard drive. Even if the power fails, every hard drive available today has enough capacitance to complete the write operation of a single block.

To guarantee an atomic write like this (ie. to guarantee a consistent tree on disk) the host operating system has to allow a force operation to disk. This is supported in Java, which in turn relies on the file system API of the underlying operating system. It is concerning that some operating systems have been known to postpone "force" operations (eg. Mac OSX), but this is a matter that is out of the control of user level code. Fortunately, the conditions required for this to cause a problem are far less likely to occur than the likelihood of hardware failure, so the risk is manageable.

Note also that the new root of the tree can be abandoned at any time with no consequences for the original tree. All that is required then is the ability to clean up or recover all the copied nodes created for the abandoned tree.

A new root to a tree as described here is called a new phase. Selecting a phase to be the new primary root of the tree, is known as committing the phase. Committed phases are always in a consistent state, removing the need to journal modifications.

Another feature of phases is that at any point in time a phase may be "kept" for reading. Any future modifications will start up a new phase (creating a new root to the tree). This means that it is possible to keep multiple phases active at any time, with each phase representing a "snapshot" of the data in the tree at the time that phase was kept. This is important for multiple readers to be able to get the latest available data while a long write operation is continuing. To ensure that the snapshots are kept consistent, they must not be modified. This means that only the most recent phase may be written to.

This describes Kowari's inability to allow multiple writers. However, this strategy is done to ensure that Kowari has a completely robust filesystem that is efficiently capable of withstanding any error, including the sudden removal of power. While speed and robustness are important, enterprise systems also require multiple concurrent writers. This has led to the plans for XA2.

XA2 will allow for multiple writers, but this in turn will require a form of journaling. Journaling is less efficient than phase trees, but we expect to gain in other areas of efficiency, making the overall performance even better. The system is more complex than the phase trees discussed here, so I will leave discussion of this for another time.

No comments: