Friday, August 26, 2005

Final Week
Once making it to the final week, the plan was to go through the remaining layers of the storage code, and use any remaining time to go through examples and questions.

There are three main areas components of functionality in Kowari's storage layer: The node pool, the string pool, and the statement store. Once upon a time these three operated together, but now the statement store is off on its own behind the new Resolver interface, while the node and string pools are accessible to the whole system. However, their unified functionality has not changed.

All three components handle transactions autonomously, managing the phases of all their underlying components. The overall Session class uses a two-phase commit operation to keep the transactions of each component in synch with the others. It is also in these top level components that the files are all managed. The files which are created and manipulated at this level are generally used by the classes at lower levels (for instance, the BlockFiles and IntFiles which are used by FreeList and AVLTree) but there are also other files which handle session locking, and persistence of phase information for the transactions.

Once I'd reached this level, I had all of the information needed to explain the data formats in each file. It was awkward to explain the structures before this stage, since several important structures (notably, the phase information) contain information from every layer. Trying to describe a single layer leaves large holes in the structure, and has led me into confusing conversations in the past when I try to skip over these holes. But at this point I was finally able to write up the file structures, one byte at a time (the values are typically 64-bit longs, but ints and bytes are occasionally used).

I'd like to properly document each of these components, along with the associated file formats, but for the moment I'll just give an overview.

Node Pool
The idea of the node pool is to allocate a 64-bit number to represent each resource in the RDF data. We call these numbers "graph nodes", or just gNodes. GNodes get re-used if they are freed. The re-use is for several reasons, the most notable being to prevent an numeric overflow (it also helps the string pool if there are few holes in the address space for nodes). However, a resource ID cannot be re-used if there are any old reading phases which still expect the ID to refer to the old data.

These requirements are exactly met by a FreeList, so the node pool just a FreeList along with all the code required for file management and transactions.

String Pool
The string pool holds all of the URI References and Literals in the database. When it was first written, the only literals we stored were strings, and since URI's are also represented with a string, we called the component the "string pool". The string pool stores lots of other data types as well, but the name has stayed.

The string pool provides a mapping from gNodes to data objects, and from those objects back to the gNode. It also provides a consecutive ordering for data so that it can be used to easily work with a range of values.

The mapping of a gNode to the data is done with a simple IntFile. Each data element can be represented with a buffer of fixed length (overflows for long data types such as string are stored at a location referred to in this buffer). To find the data buffer for a given gNode, the gNode number is multiplied by the record size of the buffer. This is why the string pool prefers the node pool to re-use gNodes, rather than just incrementing a counter. Given that these records are all the same length, I'm not sure why a BlockFile was not used instead of an IntFile, but the effect is the same.

The mapping of data back to the gNode is accomplished by putting all data into an AVLTree. The records in this tree are identical to the records in the IntFile, with an addition of the gNode to the end of the record. The tree also provides the ordering for the data. This allows strings to be searched for by prefix, URIs to be searched for by domain, and date or numeric ranges to be found.

One problem with this structure, is that it is impossible to search for strings by substring or regex. This is why we have a resolver for creating Lucene models. However, it's been in the back of my mind that I'd love to see if I could build a string pool based on a trie structure. (Maybe one day).

The data structure holds up to 72 bytes in the record. Anything longer than this (typically a string) has the remainder stored in a separate file. We have a series of 20 files to handle the overflow, each storing blocks twice the size of the blocks in the previous file. This lets us have random access to the data, while reducing fragmentation. It also allows us to store data objects of up to several GB, though we don't expect to ever need to handle anything that large.

When the string pool and node pool are combined, they provide a mechanism for storing and retrieving any kind of data and associating each datum with a numeric identifier.

Statement Store
The statement store is the heart of Kowari's approach to storing RDF.

Each RDF statement is stored as a quad of the form subject, predicate, object and model. We originally stored a triple of subject, predicate, object, but quickly realised that we needed the model (or graph) element as well. Again, the interfaces already existed, so the statements are referred to throughout the code as triples rather than quads.

These statements are stored in 6 different AVL trees, with each tree providing a different ordering for the statements. I've already discussed the reason for this at length. Ordering the statements like this allows use to treat the storage as an index.

Of course, the representation of the statements is with the gNode IDs for each resource, rather than the resources themselves. This means that the indexes contain numbers and nothing else.

While simple in principle, the code here is actually quite complex, as it has numerous optimisations for writing to multiple indexes at once. Unfortunately for me, several of these optimisations were introduced after I last had a hand in writing the code, so it needed a little effort for me to understand it sufficiently to explain it to others.

Each of the indexes is handled by a class called TripleAVLFile. This class knows about its required ordering, and manages its own AVLFile. The nodes in this tree actually represent a range of RDF statements, with a minimum, a maximum and a count. By handling blocks of statements like this, the overhead of maintaining the tree is reduced, and searching is increased by a significant linear amount (since it's linear it doesn't show up in a complexity calculation, but this is the real world we're talking about, so it matters). Once the correct node in the tree is found, then it contains a block ID for a block in a separate ManagedBlockFile which contains all of the RDF statements represented by the node.

The 6 TripleAVLFiles manage both their trees and the files full of blocks (the AVLFile and the ManagedBlockFile). This is simple enough when simply reading from the index, but takes some work when performing write operations. Trying to insert into a full block requires that block to be "split" in a similar way to node-splitting in B-trees, but with co-ordination between the AVL tree and the block file. Writes are also performed in a thread owned by the TripleAVLFile, so that multiple modifications to a single location in the index can be serialised rather than being interspersed with writes to the other 5 indexes.

The details of these and other optimisations makes this code a complex subject in itself, so I'll leave a full description for when I get around to proper documentation. I should comment that each of these optimisations were only adopted when they were proven to provide a benefit. Complexity can be the bane of performance, but DavidM did his work well here.

Reads and writes are managed by the statement store, with reads being directed to the appropriate index, and writes being sent to all 6 indexes. The other job of the statement store is to manage transactions, keeping all of the indexes reliably in synch.

A description of the storage layer is completed by describing how the Node Pool, the String Pool, and the Statement Store are all tied together.

When a query is received by the server, it must first be localized (I'm using the American "z" here, since the methods names use this spelling). This operation uses the String Pool to convert all URIs and literals into gNode numbers. If an insert is taking place, then the Node Pool is also used to allocate new gNodes where needed.

A query's principal components are a FROM clause (which is an expression containing UNIONS and INTERSECTIONS between models) and a WHERE clause (which is an expression of conjunctions and disjunctions of constraints). Each constraint in the WHERE clause may have a model specified, else it will operate on the expression in the FROM clause. To evaluate a query, the FROM and WHERE expressions need to be merged. This results in a more complex constraint expression, with each constraint having its model specified. The operations in the FROM clause get transformed into disjunctions and conjunctions in the new constraint expression, hence the increase in complexity for the expression, though the individual constraints are still quite simple.

The server then iterates over the constraints and works out which resolver should be used to evaluate each one. In most cases, this is the "System Resolver" as defined by the PersistentResolverFactory tag in the kowari-config.xml file. By default, this is set to the Statement Store described above.

Once the resolvers for each constraint are found, the constraints are sent off to be resolved. The Statement Store resolver tests the constraint for the location of variables, and uses this to determine which index is needed. It finds the extends of the solution, and returns an object containing this information and a reference to the index, so that the results can be extracted through lazy evaluation.

Next, the results are "joined" according to the structure of the constraint expression. These operations are done with lazy evaluation, so it is not quite as simple as it sounds, particularly when optimisation is considered, but the method calls at a higher level are all straightforward (as you'd expect).

The final results are then globalized by using the String Pool to convert all the numbers back into URIs and literals. The globalized answer then gets serialized and sent to the client.

Insertions and deletions use a similar process to what I've just described.

This really glosses over the details quite a bit, but it provides an outline of the process, and explains what I was doing for most of the week. When I get time I hope to document it properly, at which point this post should help by providing me with an outline.

No comments: