Tuesday, April 08, 2008

Indexing

I'm in the process of writing a number of things up at the moment, including the following description of RDF storage. But since academic papers take so long to write, and they're boring, I thought I'd blog the main bit of one of the things I'm writing about.

This all came about due to a description I wrote a few years ago about the number of columns needed to store data that was N columns wide. (Wow! Is it really 4 years?) It came down to a process and equation, of finding the minimum value of an expression, as S varies from 1 to N:
minS=1..N (N!/(N-S)!S!)
This gives a result of 3 indices for symmetrically storing triples, 6 indices for quads, 10 indices for quintuples, and so on. Note that this is the number of indices needed if you want to be able to use any search criteria on your tuples. This may indeed be the case for triples and quads, but if an element of the tuple becomes a unique ID (like it does for reification), then there is no need for symmetric indexing.

The rapid growth of this equation is a clear indicator that we want to keep the number of columns as low as possible. For expediency Mulgara moved from 3 columns to 4, so that we could encode graph identifiers with the triples, but that came at the expense of doubling the number of indices. This is really a big deal, as each index in Mulgara takes several files for managing the resources in the index, and for holding the index itself. Each piece of information that has to be read or written means another disk seek. This can be mitigated by read and write-back caching by the operating system, but as the amount of data exceeds what can be handled in memory, then these benefits evaporate. So keeping the number of indices down is a big deal.

Ronald Brachman's work in '77 shaped the future direction of description logics, including the use of the idea that everything can be represented using binary and unary predicates. RDF is defined using binary predicates, and unary predicates are simulated using the rdf:type predicate, which means that RDF is inherently capable of representing description logics, and indeed, any kind of knowledge representation. The issue is that it can be inefficient to represent certain kinds of structures.

The RDF representation of reification requires 3 statements for reification (plus one that can be inferred) and these are independent of the actual statement itself. An extra column can eliminate these 3 statements altogether, but the indexes grow accordingly. Graph membership can be accomplished using extra statements as well, and again, this can be trivially eliminated with an extra column. The question is, when do the extra columns (with the consequent factorial growth) become more expensive than adding in more statements? Should the number of indices be limited to 4? To 3?

2 Columns

I always found it interesting that the equation above has a solution for N=2. I considered this to be an artifact of the equation, but it bugged me all the same. So then a couple of years ago I gave it some thought, and realized that it is indeed possible to represent a triple using "doubles". Of course, once a triple can be represented, then anything can be represented. The question is efficiency.

If the indices were to contain only 2 columns, then this means that only unary predicates could be used. This implies that the predicates define a type. After some thought I realized that I could use unique types to identify each element of an RDF statement, and then a unique type to represent the statement itself. Of course, there is nothing new under the sun, and just recently I discovered that the CLASSIC system introduced unique atomic concepts for each individual in the system in a similar way.

To map the following triple:
  <my:subject> <my:predicate> <my:object<
to unary predicates, I used a scheme like the following:
  Statement(_statement_x)
SubjectIdentifier(_subject_x)
PredicateIdentifier(_predicate_x)
ObjectIdentifier(_object_x)
_statement_x(_subject_x)
_statement_x(_predicate_x)
_statement_x(_object_x)
_subject_x(my:subject)
_predicate_x(my:predicate)
_object_x(my:object)
Where each of _statement_x, _subject_x, _predicate_x and _object_x are unique identifiers, never to be used again. In fact, my use of underscores as a prefix here indicates that I was thinking of them as a kind of blank node: unique, but without a distinguishing label.

When I first came up with this scheme, I thought it a curiosity, but hardly useful. It seemed that significant work would need to be done to reconstruct a triple, and indexing so many items would require a lot of seeking on disk. I was also concerned about the "reckless" use of the address space for identifiers in creating unique IDs for so many elements.

Then recently I was describing this scheme to a friend, and I realized that when I considered some other ideas I'd been working on lately, then there was something to this scheme after all.

Disk Seeking

I've been very disappointed with Mulgara's loading speed on certain types of data recently. If the data has a lot of unique URIs and strings, then the size of the store was getting too large, and the length of time taken to store the data was too long. I was also surprised at the gigabytes of file storage being used when the data files were only a few hundred megabytes. Mulgara is supposed to be scalable, and this wasn't acceptable behavior.

Consequently, I've been doing more work with algorithms and data structures recently. I have not been trying to supplant Andrae's work but was instead hoping to tweak the existing system a little in order to improve performance.

The first thing that becomes apparent is that the plethora of files in Mulgara is a real bottleneck. Each file on its own may be efficient (not all are), but cumulatively they cause a disk to seek all over the place. Since this is probably the single most expensive action a computer can take (other than a network request), then reducing the seeks is a priority.

Profiling the code led to a couple of improvements (these have been rolled into the Mulgara 1.2 release), but also showed that the biggest issue is the String Pool (more properly called the "Data Pool" since it now stores any kind of data). This is a facility that maps any kind of data (like a URI or a string) to a unique number, and maps numbers into the data they represent. With a facility like this, Mulgara is able to store triples (or quads) as groups of numbers. We call these numbers "Graph Nodes", or gNodes.

The string pool was spending a lot of time just searching to see if a URI or string to be inserted into the graph was already mapped to a number, and inserted it if not. Some work was also being done to keep track of what had been allocated in a given transaction phase, so that any allocated resources (like disk blocks) could be freed and reallocated if the data were ever removed. However, items are rarely removed from the string pool. Removals mostly occur when an entire graph is dropped, and these graphs are often dropped just before a slightly modified version of the same data is to be inserted. In this case, the same data will be removed from the string pool, and then re-inserted. That's a lot of work for nothing. It makes much more sense to leave everything in the string pool, and only remove unused items when explicitly requested, or perhaps as a background task. (Unused items can be easily identified since they don't exist in the statement indices).

If the string pool were changed to be a write-once-read-many pool, then a lot of the structures that support resource reuse (Free Lists, which are a few files each) can be removed from the string pool. Of course, the reduced reading/writing involved with removing and re-inserting data would also benefit. So this looked promising.

Another idea is to take any data that fits into less than 64 bits (say, 58 bits) and store it directly in the ID number instead of in the pool. The top bits can then indicate the type of the value, and whether or not it is "stored" or if it is simply encoded in the ID. This covers a surprising range of required numbers, and most dates as well. This idea was mentioned to me in SF last year, and it sounded good, only I had completely forgotten that Andrae had already proposed it a year before (sorry Peter, you weren't first). But wherever the idea came from, it promised to dramatically help dates and numbers. In fact, it helps all the data, since the tree no longer has as many elements stored in it.

There were also other ideas, such as moving the tree type of the index. We mitigated the use of AVL trees in the indices by using pointers to large blocks of data. However, this becomes a subtraction of a constant in the complexity analysis, while a wider tree becomes a division by a constant. Constants don't usually mean much in complexity analysis, but when each operation represents a disk seek, then the difference becomes significant. While this is something that must be looked at, it didn't make sense when we knew that XA2 is coming, and that the trees will change anyway.

Address Space

You may have noticed that I'm talking a lot about resource reallocation, and 64 bits in the same breath. This shows some of the history of Mulgara. The system originally ran on 32 bits, where not reusing resources was a guaranteed way to wrap around in the number space and cause no end of problems. When the system was upgraded to 64 bits, it still made sense to manage resources for reallocation, as some resources were still limited. However, resources that represented IDs in an address space were not reconsidered, and they ought to have been. Looking at what literals could be encoded in a 64 bit value (and how many bits should be reserved for type data) was the impetus I needed to make me look at this again.

Given that every resource we allocated took a finite time that was often bounded by disk seeks, it occurred to me that we were not going to run out of IDs. If we only used 58 bits, then we could still allocate a new resource every microsecond and not run out of IDs for over 9000 years. A more reasonable design period is 100 years (yes, this is a wide margin of safety), and constant allocation of resources at a microsecond per resource means that we still only need 52 bits. So we're safe not reusing IDs, and indeed, we have over a byte of information we can use in this ID to do some interesting engineering tricks.

Structure

So I had a number of these lessons fresh in mind when I recently tried to describe just why a 2 column store was inefficient. During the course of the conversation I started seeing ways in which I could apply some of these techniques in a useful way. It took a while for it to come together, but I now have something that really shows some promise.

The details here are reasonably detailed, so it makes sense to take a break here, and write it all up in a fresh post in the next day or so. A little more sleep might also help prevent the rambling that I've noticed coming into this post. :-)

2 comments:

awo101 said...

On the topic of two-column RDF stores, you may find "Scalable Semantic Web Data Management Using Vertical Partitioning" by Daniel Abadi and Adam Marcus a good read, if you haven't seen it before - it's pretty well written, and won best paper at VLDB 2007.

http://www.vldb.org/conf/2007/papers/research/p411-abadi.pdf

-Alisdair

Quoll said...

Nope, haven't heard of it before now. I'll check it out when I get some time.