Tuesday, March 07, 2006

BlockFile

BlockFile is an interface for accessing files full of identically sized "blocks" of data. This interface is like IntFile, in that they both describe the abstract methods for accessing their respective files, and both have two concrete implementations: one for mapped file access, and the other for normal I/O access. In the case of BlockFile these are MappedBlockFile and IOBlockFile.

Unlike IntFile, the concrete implementations for BlockFile share a significant portion of their code. This is shared between the two classes by having them both extend an abstract class called AbstractBlockFile, which in turn implements BlockFile. So MappedBlockFile and IOBlockFile do not implement BlockFile directly.

Redundant Interfaces

This arrangement makes BlockFile superfluous, since AbstractBlockFile can be used in every context where BlockFile might have been. However, using an interface in this way is a well established technique in Kowari. It allows for the rapid development of alternative implementations as the need arises. It also permits an object that fulfils a number of roles to implement more than one interface. These advantages both help the codebase stay flexible for ongoing development.

Finally, these interfaces, redundant or not, provide a consistent way of documenting the method of using a class (its platonic "interface", as opposed to the grammatical construct in the Java language).

Records

The point of a BlockFile is to store a series of records of identical structure in a file. Each record will be exactly the same size. The data structure within that record is left up to the caller. Kowari uses BlockFiles to store numerous data types, including strings (of defined lengths), numbers, tree nodes, and linked lists.

Files which hold records like this are essential to scalable systems, as data from anywhere in the system can be found and modified rapidly. Variable length structures can be more space efficient, but require that a file be traversed before data can be found, and modification is often impossible. Variable length structures can be indexed for rapid retrieval, but modification is often impractical.

As a side note, the expected replacement for the Kowari file system will use variable length structures, but with inherent indexing to make traversal faster than it currently is. The problem with modifying data will still exist, but this will be managed in a linear annotation queue, which will be merged back in to the main data silently in the background. The result should demonstrate much faster read and write operations, particularly for terabyte scale files.

Methods

setNrBlocks() andgetNrBlocks() are used to set and get the length of the block file. Since the file may only be viewed as a collection of blocks, then its length should only be measured in units of blocks.

Like all the other file-manipulating classes, clear() is used to remove all data from the file, and set it to zero length. force() is used to block the process until all data is written to disk. unmap() attempts to remove any existing file mappings, close() closes the file, and delete() closes the file and removes it from the disk. See IntFile for a more detailed explanation.

The remaining methods are all for manipulating the blocks within this file.

Blocks

The Block class was created to abstract away access to the records stored within a BlockFile. The block is always tied back to its location on disk, so any operations on that block will always be to the specific location on disk. This is important for memory mapped files, as it means that a write to a block will be immediately reflected on disk (though disk I/O can be delayed by the operating system) without performing any explicit write operations with the block.

Implementations of BlockFile are also expected to manage Blocks through an object pool. This was particularly important in earlier JVMs, where object allocation and garbage collection were expensive operations. Fortunately, recent versions of Java no longer suffer these effects, and an object pool can even slow a process down.

The other reason to use an object pool for blocks is to prevent the need to re-read any blocks still in memory. This is not an issue for memory mapped files, but if the blocks are accessed with read/write operations then they need to be initialized by reading from disk. The object pool for blocks will re-use a block associated with a particular file address whenever one is available. If one is not available, then the standard pool technique of re-initializing an existing object is used. With Java 1.5, it is this last operation which may be discarded.

As a consequence of the above, the remaining BlockFile methods all accept an ObjectPool as a parameter. This may appear unnecessary, since a pool would usually be associated with an instance of the BlockFile class, but in this case there are several pools which may be used, depending on the needs of the calling code.

Block Methods

allocateBlock() is used to get a Block for a particular file position, defined by a blockId parameter, but not to read the file for any existing data. This is for new positions in the file where the data was previously unused. There is no reason to read the data at this point, and the read operation will be time consuming. Writing the resulting block will put its contents into the file at the given block ID.

readBlock is similar to allocateBlock, but the block is already expected to have data in it. The Block object is allocated and associated with the appropriate point in the file, and data is read into it.

Since read operations are never explicit in memory mapped files, the allocateBlock() and readBlock() methods are identical in the MappedBlockFile class.

Any blocks obtained from allocateBlock() and readBlock() should be passed to releaseBlock() once they have been finished with. This allows the block to be put back into the object pool. While not strictly necessary, it should improve performance.

writeBlock() writes the contents of a Block's buffer to that Block's location on disk. This performs a write() in IOBlockFile and is an empty operation (a no-op) in MappedBlockFile.

The copyBlock() method sets a new block ID on a given Block. This now means that the same data is set for a new block. On IOBlockFile this is a simple operation, as the buffer remains unchanged. However, on MappedBlockFile the contents of the first block have to be copied into the new block's position. This is unusual, since MappedBlockFile usually does less work in most of its methods.

writeBlock simply writes a block's buffer back to disk. This is a no-op in MappedBlockFile.

modifyBlock() and freeBlock() should be removed from this class, as they are operations which are designed for ManagedBlockFile. This class provides a layer of abstraction over BlockFile and will be described later. The only implementations of these methods is in AbstractBlockFile where an UnsupportedOperationException is thrown.

Block File Types

The BlockFile interface also includes a static inner class which defines an enumeration of the types of BlockFile. This is the standard enumeration pattern used before Java 1.5.

The defined types are:
MAPPED
The file is memory mapped. Reading and writing are implicit, and the read/write methods perform no function.
EXPLICIT
The file is accessed with the standard read/write mechanism.
AUTO
The file is accessed by memory mapping if the operating system supports a 64 bit address space. Otherwise file access is explicit.
DEFAULT
This is set to AUTO, unless the value is overwritten at the command line. The override is performed by defining tucana.xa.defaultIOType as mapped, explicit or auto.
The code for determining this setup is in the static initializer of the enumeration class.

This enumeration class is used by the factory methods in AbstractBlockFile to determine which concrete implementation should be instantiated.

No comments: