Sunday, April 13, 2008

Writing 2-columns

In my last post I described a scheme for representing 2 columns. But the moment I first thought of it, I decided it was too impractical. After all, each "triple" gets represented with 10 entries. If I want to include a graph identifier (i.e. a Quad store) then it goes up to 12 entries. If I want to cut down on disk seeking, then the idea seemed to be of little more than academic interest.

Then a little while ago I was explaining this scheme to a friend (Inderbir), and in the process I tried to explain why this was going to be impractical, but in the course of the discussion a few things occurred to me.

The statements to represent form a series of "doubles", which need to be indexed two ways: by each column. The data for a single statement will appear like this:
  Statement, _statement_x
SubjectIdentifier, _subject_x
PredicateIdentifier, _predicate_x
ObjectIdentifier, _object_x
_statement_x, _subject_x
_statement_x, _predicate_x
_statement_x, _object_x
_subject_x, my:subject
_predicate_x, my:predicate
_object_x, my:object
Where anything whose name starts with an underscore is a unique identifier. As I'd already mentioned, now that we use 64 bit identifiers in Mulgara, it makes sense to create these from an incrementing long value.

Given that each identifier only gets used for one statement, then the statement, subject, predicate, and object identifiers will all be allocated together, and will be consecutive. Indeed, if these identifiers are kept separate from the identifiers that will be allocated for the URIs and Literals of the statement, then the statement can be presumed to always be a multiple of 4, and the subject, predicate, and object identifiers will be 1, 2, and 3 greater, respectively. This means that the bottom two bits of the IDs can be used to represent the type of the ID, meaning that the first 4 statements in the above list can be inferred, rather than stored. Also, since the IDs for the subject, predicate, and object positions can be calculated by adding 1, 2, or 3, then the next three statements don't need to be stored either. Cutting the data down to 3 entries suddenly makes it look more interesting.

I should note at this point that I still expect to represent URIs and Literals with IDs that can be mapped to or from the data they represent. While the mechanism for doing this in Mulgara needs to be improved, it is still an important concept, as it reduces redundant storage of strings, and the comparison of Long values allows for faster joins. However, I do intend to return to this idea.

After reducing the data to be stored, we now have:
  _subject_x, my:subject
_predicate_x, my:predicate
_object_x, my:object
Indeed, since each of those IDs are consecutive, and always increasing, then in the index that is sorted by the first column, all three statements will go to the end of the file. This means that the file need not ever have a seek operation performed on it while it is being written to. Operating systems are usually optimized for append-only writing, so this is another bonus.

It is also worth noting that since the predicates are always consecutive, there is no need to write each of them either. Instead, the following can be written for each statement:
  _statement_x, my:subject, my:predicate, my:object
With this data, all of the above can be inferred. Indeed, the need for the statement to take up an ID on it's own can be dropped, and the subject, predicate, and object IDs are calculated by adding 0, 1, and 2 to the first ID. This leaves space for a fourth element, such as a graph identifier, before needing more than 2 of the low-order bits to give the type of the identifier.

On the other file, we will be storing the same data in reverse order:
  my:subject, _subject_x
my:predicate, _predicate_x
my:object, _object_x
In this case, the identifiers for the URIs, blank nodes, and literals of the subject, predicate and object will be all over the place (and will be regularly re-used), so there is no guarantee of ordering here. This means we have to go back to standard tree-based indexing of the data. However, we only have 3 search operations to go through here, which is significantly better than the searching we currently do in Mulgara.

Note that all of the above applies to statements with more than 3 elements as well. Each new element in a statement increases the size of the single write on the first index by one more long value, and adds one more seek/write operation to the second index. This is far less expensive than expanding the size of the "complete" indexes used in Mulgara.

Retrieving

I'll stop for a moment, and take a look at what a read operation looks like.

The first index file is written to linearly. Each record is identical in size, and the ID that starts the record is monotonically increasing. If the store were write-once-read-many (WORM), then the ID could be skipped altogether as this information would be inferred from the offset within the file. This may be useful for some applications, but I'd prefer to delete information in place (rather than creating a white-out list for later merging), meaning that the ID is still required in this case.

For this kind of structure, the file can be searched using a binary search. Also, the largest offset that an ID can appear at is the value of that ID multiplied by the size of a record, meaning that the number of seeks required for a search can be greatly reduced.

The second index is a standard tree. B-Trees are well known for not seeking much, so for a first cut, I would suggest this (though Andrae have other plans further down the line).

To find all statements that match one element (say, the predicate), then this requires a search on the tree-index, to find the first time that URI appears. The associated predicate ID is paired with a set of IDs that represent the use of that URI in statements (sometimes as predicate, sometimes as subject or object). These IDs are in consecutive order, and so can be merged with the first index as a linear operation. Adding in another element to search by (say, we are looking for a given predicate/object pair) then this becomes another search on the second index, and another linear merge.

Linear merges aren't too bad here, as it is always a linear operation to go through all of the data anyway (meaning that it can't be avoided). The only case where this is an unnecessary expense is if the "count" of a set of statements is required.

Efficiency in the Tree

While considering the above structure, it occurred to me that this index is having to store identifiers for the RDF nodes over and over, even though they all appear next to one another. There are ways of compressing this, but it made me question the redundancy altogether. What if the item was just stored once, and the "satellite data" (to use the term for data associated with key) was instead it's own structure? I thought that maybe this could be a tree, but then it occurred to me that the data represents statement IDs, and will therefore always be inserted in increasing order. So a list is most appropriate.

So now I could have each entry in this tree point to a list of statements that this RDF node participates in. Since the list will always be appended to, it makes sense that this is kept in another file, using a linked list of blocks. However, to cut down on seeks, the first few elements of the list would do well to appear with the node in the original tree.

So what sort of satellite data should be stored? For reading, the head of the list has to be stored, though as just mentioned, I think that this should be inline with the satellite data. The tail of the list should also be stored, else it would require a linear seek to work out where to insert, and this is not scalable. To give some help with management of the list, the size should also be recorded. This also makes counting trivial.

Up until now there has been a presumption that the identifiers of elements in a statement follow a particular bit pattern. However, if the satellite data contains three lists instead of one, then the number of the list is enough to indicate which position the node is used in. For instance, the node of <rdf:type> may have a few entries in the list for subject (indicating that it is the "subject" in just a few statements), may have a few entries in the object list (indicating that there are a few statements which refer to this URI), but will have millions (or more) statements in the predicate list, because this URI indicates a commonly used predicate.

If the presence of a statement ID in one list or another indicates that this node is used in a particular capacity for that statement, then this means that the presumption of using the low order bits of the ID for this purpose is removed. That gives us a little more flexibility.

Data Pool

All of the above presumes that there exists a mechanism to map URIs, strings, and other literal data on to an ID, and to map those IDs back into the original data. Historically, Mulgara has referred to the the store that performed this operation as the "String Pool". Since URIs are encoded as a string, and the first iteration of Mulgara only stored literals in a lexical form, this name was accurate. However, with the inclusion of numbers, dates, and other datatypes, it might be more accurate to refer to this construct as a "Data Pool" instead.

Part of the data pool structure of Mulgara uses a tree containing some (or all) of the data as a key, and a long value as the ID it is mapped to. Storing entries that are keyed on strings or other data is a lot like the second index just mentioned. So now I started to reconsider the presumption of a separate data pool altogether.

Instead of writing to the linear file first, the idea is to write to the tree index first. This involves a search. If the data is found, then the statement ID will be appended to the end of the appropriate list (this updates the linked list block, possibly spilling over into a new block, and then rewrites the tail/size of this list in the tree). If the data is not found, then a new entry is placed in the tree, two lists are initialized to nil, and the third is given the allocated statement ID. The list is not yet long enough to spill into the file full of linked lists, so this isn't too expensive. For a B-Tree with space, this will require writing of just a single block!

Now it isn't feasible to store everything in the tree as a key, so only the head of the data would need to go directly into the tree. The remainder of the data is still needed, but rather than trying to manage this data re-usably, the ideas from last post about keeping all the data in the pool can be adopted. In this case the data can simply be appended to a third file. The offset of this append then becomes the ID of that data. This ID is stored along with the rest of the satellite data in the tree. It is also the ID that gets stored in the first linear index file which can now be written to.

More Efficiency

So now instead of a "Data Pool" and 2 files, the design is now for 4 files. Two of them are only ever appended to, one always has direct seeks before writing, and only one of them is a tree that requires searching before a write can happen. Given that this is the entire store, then that's not too shabby! It's a darn sight better than the 196 files in Mulgara, almost all of which need multiple seeks to do anything.

But can I do better?

Andrae had already been looking at reworking the string/data pool, and a lot of things are quite obvious to do. For a start, any data that can fit into 54 bits (or so) ought to have its value encoded into its ID, with the top bits used for type identification. That many bits lets you encode all bytes, chars, shorts, ints and floats, as well as the majority of long values (and possibly a lot of doubles as well). Any date within a century of now will also fit in. This means that many items that are not strings don't need any extra storage. So along with the type bits, there would be another bit to indicate whether or not the data is encoded in the ID, or if it is found in the data file. Anything that can be encoded into the ID won't have to go into the data file, though it would still go into the indexes so statements using it can still be found. The main difference is that any statements discovered to contain one of these IDs would not require the extra seek to get the remaining information.

Another significant change has already been proposed by Andrae over a year ago. In this case, the different types of the data will be stored in different indexes, which are each optimized to handle such data. This increases the number of files, but only one of these files will be accessed at a time. Also, since each of these types are literals, there is no need for lists describing subject or predicate statements.

Similarly, blank nodes will have their own file, only they will not require any extra data beyond the lists, and no predicate list will be required.

Getting back to the fundamental types of strings and URIs, Andrae pointed out that Tries are an appropriate structure for reducing space requirements. This is perfect for managing the plethora of URIs that appear in the same namespace (or that just start with "http://"), as common prefixes to strings are not repeated in this structure. Like other tree structures, this would let us store arbitrary satellite data, meaning they are perfectly adaptable to this structure.

Interestingly, if we expand the trie to become a suffix trie, then we can get full text searching, which is one of the most common requests that Mulgara gets.

Hashed Predicates

The example I gave above about how <rdf:type> mostly participates in statements as a predicate, is common of many predicates. In many situations, the list of predicates to be used is quite small. In particular, there are likely to be just a few predicates that will be used the majority of the time, such as <rdf:type>, <rdfs:domain>, <rdfs:range>, as well as many application specific values.

Since these URIs are going to be accessed all the time, there isn't a lot of point in burying them deep in the URI tree. Instead, the most common URIs could each be given their own file, which indicates the "predicate statement list" for those URIs. Those URIs can be included in the tree for their subject and object lists, but the code that searches for predicates would skip the tree and go directly to the file instead. Any operations which require iterating over all the predicates can insert these values in via the algorithm, rather than getting it from the tree structure.

However, which URIs would be stored this way? This may vary from one application to another. So instead of hard coding the values in, they could be placed in a configuration file. Then the application would know to map these values directly to their own files instead. Since the filenames can be allocated by the system, they can be created with a hashing algorithm, or possibly be placed in the configuration file along with the predicate URI list.

I'd still prefer to configure this rather than allow ALL predicates to be done this way, as any predicates that are not used so commonly will not take the resources of another file. It also allows the system to have an arbitrary number of predicates beyond the most commonly used. But by having these files dedicated to common predicates, any requests for statements with a given predicate will require a single seek to the start of that file, and will immediately give the list, along with its size.

Comparisons

The evening after I presented this to the Mulgara/Topaz developers back in March, I happened to attend a presentation given on applying columnar databases to RDF. This described storing subject/object pairs in files, with one file per predicate. This particular optimization is similar, but it has a good fallback for when you run out of files for your predicates (after all, searching in good-sized B-Tree typically only requires a couple of seeks). This scheme also provides the ability to search for statements on subject or predicate, which apparently is less efficient in the presented system.

A nice feature that is shared by both this scheme and the columnar scheme is that selecting statements always gives sorted values that can be joined with linear merge-joins.

However, given the flexibility of this structure, I've been encouraged to write it up and let people know about it. Well, I've started that, but I thought it would be good to get something out there straight away, hence this post.

In the meantime, in amongst my SPARQL work I'm trying to build a proof-of-concept. I've done the complexity calculations to see both the worst case and the expected case, but it doesn't take much effort to see that it involves a massive reduction in the seeking, reading and writing done by Mulgara at the present. I won't be including all the optimizations discussed here, but I still expect it to be around two orders of magnitude faster, and to take up a couple of orders of magnitude less space.

Final Notes

None of the above discusses deletions, transactions, or any of that other stuff needed to make a database useful in the real world. These issues haven't been forgotten, but in order to present the structure I wanted to concentrate on the minimalism in reading and writing to the structure.

My plan for deletions is to go through the various lists and mark them with invalid identifiers (e.g. -1). These will have to be skipped linearly during read operations, which means that removing data has little impact on speed (except that blank IDs will never need to be converted into URIs, Literals, etc). At a later time, either by an explicit cleanup operation, or a background task, a cleanup thread will compact the data by shifting it all down to fill the gaps. Of course, this will require some locking for consistency, though since everything is ordered, there may be a chance to minimize locking by skipping any data that repeats or appears out of order.

Andrae has also spent a lot of time working on a theoretic framework for concurrent write transactions in RDF. His work is quite detailed and impressive. Fortunately, the engineering application of this work is completely consistent with this framework, so we hope to eventually integrate the two. In the meantime, Andrae's work will form the basis for XA2, which in turn will be taking a few avenues to permit this scheme to be easily integrated at a later date.

So for now, I have to get SPARQL up and running, while also looking for time to finish the proof of concept and writing everything up. I suppose I should be doing that instead of blogging. :-)

1 comment:

awo101 said...

Your thoughts on URI/literal <-> ID conversion are very interesting. We ran into this issue when using 3store as a backend for mSpace - each mSpace query tends to require the return of quite a few literal strings for display (especially on earlier iterations), so having to undertake potentially several disk seeks to get a literal value for every single ID really was pretty crippling. It's certainly something that needs real attention if IDs are used to represent URI and literal values.

-Alisdair Owens