I was going to complain cathartically tonight, but Anne's done it for me. I've started going to the gym again tonight, so that's helped me feel a bit better, despite my lack of catharsis.
We have a chain on the door now, so hopefully we'll have no more incidents. I'm still feeling paranoid though.
There is a lot to be said for describing a system from the top down. The overall structure comes into view very quickly, and there is a concentration on the general architectural concepts which are so important for understanding and modifying the code.
However, I have found that describing Kowari at a high level always seems to lead to questions about the details. There are concepts that people seem to struggle with until they see how all the details are managed. This encourages me to approach the system from the ground up. An obvious advantage of this approach is that there are no dependencies on unexplained systems. Conversely, a lot of ground has to be covered before the overall structure comes into view.
There are also a number of operations at the bottom levels which are created to support higher level concepts. While the operations are easy to follow, the need for them may be unclear until the higher level code is viewed.
Possibly the best compromise is to start at the top, explain the requisite lower-level details as needed, and to return to the top whenever possible. My main problem here is that the theme of the discussion jumps around a lot, I may also end up discussing details which people didn't need, while skipping those which were more interesting.
In the end I've decided to go bottom-up. At least there I can be consistent with this approach, and I don't have to be concerned about missing details. More importantly, once I start getting higher up in the architecture, it will be possible to refer back and forth as required (that's an advantage of written explanations over spoken ones). The one piece of architecture that I feel is important to know at this level is the Phase Tree design. This was covered in my last entry.
The descriptions I'll provide for the lower level classes should be considered a supplement to the Javadoc. If anyone would like to see more info on anything, then please let me know.
I'll start in the
This class provides the semantics of an array of 64 bit
long values, backed by a file. It has operations to set the size of the array (
setSize(int)), and to set and get values from any position from 0 to the end of the array (
putLong(long, long) and
A less used set of options, are the getters and setters which can be used to treat the structure as an array of 32 bit
int values (
putInt(long, int) and
getInt(long)), or 8 bit
putByte(long, byte) and
This class also provides storage for an array of "unsigned" 32 bit integers. This is handy for file access, and managing some data types. However, unsigned values are not directly supported in Java (which the exception of the
char type). To manage them, they are passed around in 64 bit
longs so they get handled correctly, but they are still stored on disk in a 32 bit pattern. The methods here are called
putUInt(long, long) and
IntFile is an abstract class, and implements minimal functionality. The real work is done in a pair of concrete classes called
MappedIntFile uses the New IO (NIO) functionality introduced in Java 1.4. It memory maps a file, allowing the underlying operating system to manage reading and writing of the required memory buffers. This leverages the efficient management of disk and memory that is essential in modern operating systems, and avoids the traditional copy to or from additional memory buffers.
Unfortunately, memory mapping files in this way is restricted by the size of the address space of the process. On a 32 bit system, this creates a maximum limit of 4GB, though the program and the operating system must use some of this space. 64 bit systems don't do much better, as many 64 bit operating systems impose limits within their address space for various operational reasons. In Java, the addresses for NIO are all 32 bit
int values. Since this is a signed value (the sign using up one bit), the maximum size for a single file mapping is 2GB.
All this means that if a Kowari instance needs more space, it has to revert to using standard read/write operations for accessing the files. This is the purpose of the
ExplicitIntFile class. This class implementation of each of the put/get methods calls down to read/write methods on the underlying
java.nio.channels.FileChannel object. This implementation is not as fast as
MappedIntFile, but it can operate on much larger files, particularly when address space is at a premium.
The constructors for each of the concrete implementations of
IntFile are all default (not public). Instead, the
IntFile class contains factory methods which instantiate the appropriate concrete class according to requirements.
The factory methods are all called
open(...), and accept either a
String or a
java.io.File object. They will open an existing file when one exists, or create a new file otherwise.
First of all, the system properties are checked for a value called "
tucana.xa.forceIOType". If it doesn't exist, then it defaults to using
MappedIntFile. Otherwise, it makes a choice based on the expected values of "mapped" or "explicit". Any other value will fall back to
MappedIntFile and give a warning in the log.
Unfortunately, we have found a bug that occasionally manifests in
ExplicitIntFile. DavidM tracked it down, but it is tough to reproduce. I don't have the details, but on hearing this I audited this class, and it all appears correct. Until we have a solution to this problem, the factory method is temporarily using
MappedIntFile in all cases (this has been a problem for some users, and needs fixing).
Forcing, Clearing, and Deleting
The remaining methods on
IntFile are for managing the file, rather than the data in it.
force() is used to ensure that any data written in the
putXXX(long, XXX) operations has been written to disk. This relies on the operating system for the guarantee of completeness. This is relevant to both mapped and explicit IO, as both are affected by write-behind caching. This operation is essential for data integrity, when the system needs to know that all files have reached some fixed state.
clear() leaves the file open, but truncates it to zero length, thereby deleting the entire contents of the array.
close() simply releases the resources used for managing the file, and is primarily used during shutdown of the system.
delete() also releases all the resources, but then deletes the file as well.
The final method to consider is
unmap(). This method is only relevant to the
MappedIntFile class, but it must be called regardless of the implementing class used. This is because any calling code cannot know which implementation of
IntFile is being used (this is intentionally hidden, so it can be swapped with ease).
unmap()is called on
MappedIntFile, all the references used for the mapping are explicitly set to
null. This allows the Java garbage collector to find the mappings and remove them, thereby freeing up address space.
This is an important operation, but it is difficult to enforce. Java does not permit explicit unmapping of files, since allowing it would permit access to memory which is not allocated (a General Protection Fault on Windows, a segfault on x86 Linux, and a bus error on Sparc Linux and Mac OS X). The closest we can come to forcing this behavior is to make the mapping available to be garbage collected, and run the garbage collector several times until the mapping has been cleaned up. This actually works on most operating systems, but needs to be iterated much more on Windows before it works.
There are some specific details in
MappedIntFile that need to be addressed, but I'll have to leave here and get some sleep. Hopefully I won't be woken by the Police tonight...
Tuesday, February 21, 2006