Tuesday, December 02, 2008

Dropping Indexes

One of the optimizations I'm making for XA 1.1 is the removal of 3 of our 6 statement indexes. The reason for this is pretty clear: they're almost never used. Why would I want to double our space, and double our contention for the hard drive on data structures that are superfluous?

To date, Mulgara's indexes have been completely symmetric. I still want to maintain this with respect to subject, predicate and object, but I don't really see the need for it with graphs. (That said, the 2-column index in XA2 will have optimizations around common predicates, but in general there will still be symmetry). I've had people say that they want to use millions of graphs, but in reality I've yet to see it. The query languages (TQL, SPARQL, etc) haven't really supported large numbers of graphs anyway.

The index orderings we've had to date have been:
  SPOG
POSG
OSPG
GSPO
GPOS
GOSP
For G=Graph, S=Subject, P=Predicate, O=Object.

For anyone unfamiliar with these indexes, they permit a group of statements to be found given any possible pattern of 0, 1, 2, 3 or 4 elements.

The first 3 indexes allow for searching on statements that may occur in any graph. However, almost all queries identify the graphs to be searched in, meaning that we always end up binding the "graph" node before looking for statements. That means that the first 3 indexes are almost never used. However, it's the "almost" which is my problem at the moment.

Fortunately, the first 3 indexes can be easily emulated with our "System graph". This countains a list of all the known graphs, particularly the graphs stored with the "System Resolver" (this is the part of the system that uses the above indexes). Using this information, it is possible to pre-bind the graph node for every possible query. However, I really want to do this at the lowest possible level, so the interface on the resolver remains unchanged.

Dropping the first 3 indexes went smoothly, and 97% of the tests still work (OK, it's 96.99%, but who's counting?). However, the emulation of these indexes will probably take me a few days. That's a shame, as I'd prefer to get it all into the next release, but since I want to do a release before I go to Australia for Christmas (on Monday) then I'm pretty sure I can't do it in time (not if I want robust testing anyway).

Compromises

Emulating the indexes which allow unbound graphs, means that I'll need to bind the graph to a series of ALL the graphs in the system. Then for each of those graphs, I'll need to re-execute the resolution of the graph pattern being resolver. That means that for these types of queries, then it will increase in complexity with the number of graphs in the system. This goes completely against what we want in Mulgara, but as I said, it's such a rarely used feature that the cost seems mitigated.

I had thought that I'd be doing a query to find the graphs, and then join this to the resolution of the graph pattern that we want, but that failed to take several things into account. First, resolutions from the resolver come back with a particular order, and the kind of join I was proposing was not going to be ordered the way we wanted (it would have been ordered for within each graph, and then ordered within the graph). Reordering may have been prohibitively expensive (depending on context), so this was out.

It was while thinking through on this that I realized I can create a new Tuples "append" operation. The new append will take arguments that all have the same variables and the same ordering, and will perform a streamed merge-sort. This should give me exactly what I want.

So the next thing I need is the complete list of graphs to bind the "G" node to when querying the indexes. I have thought that I'd be doing a query of the system graph for this, but my thinking has moved on from there. To start with, in order to make this query, I'll need the local node value for <rdf:type> the URI for the "type" of graphs stored in the system resolver, and the system graph itself (a relative URI of <#>). The creation of these occurs during bootstrapping, and is fortunately over before any possibility of my "unusual" queries.

While thinking about getting the local node values for these URIs, it occurred to me that something similar to the mechanism to do this can be used to record whenever a graph is being created in the system graph. That means that I can store each of the graphs in a list (and re-populate this list on startup with a simple constraint resolution). This list then becomes the basis for the graph bindings when I'm trying to emulate the missing indexes.

My first concern was that this might take too much space, thereby limiting the number of graphs that someone can have (as I said, some people have proposed using millions), but then I realized that my merge-join was going to need to reference the same number of resolutions as the number of graphs, and this would take more RAM anyway. It's really a moot point anyway, since the system would choke from performing a million lookups before you need to worry about an Out Of Memory condition. All this reminds me... I should worry too much about optimizations at such at early juncture. Premature optimization is the root of all evil.

Anyway, I'll probably spend a day on this, and may even get it all going, but I won't have it tested in time for a release before the weekend. I'd better let Amit (my boss) know that he won't get it until Christmas. :-)

Size

Disk usage is probably the second most common question I get about Mulgara, after speed. So to complement the plots from Monday, I've also plotted the disk usage for these "number" graphs.

The lowest line represents the space being used for URIs and Literals. The upper line is for the statements themselves. For convenience, the top line is the sum of the other two.

This storage mechanism is doing no compression on the data whatsoever. The current code in XA2 is already using an order of magnitude less space, both because of more intelligent storage, and also because many blocks will be gzip compressed in our structures. Andrae's reasoning for that is that while CPUs are getting faster all the time, disks are not. This means that any processing we do on the data is essentially free, since the CPU can usually be done in less than the time it takes to wait for a hard drive to return a result, even a solid state drive.

I should note that these graphs are all on a version of XA1.1 that is not yet released (it's in SVN, but not in the trunk yet). I've been hoping to get this into the next release, but because I'm doing a release by the end of this week, then I'm thinking it will have to be in the release after (before Christmas).

Monday, December 01, 2008

Mulgara Stats

The one question everyone asks me about Mulgara is always some variation on "How does it scale?" It's never easy to answer, as it depends on the hardware you're using, the OS (Linux does memory mapping better than Windows), and of course, the data.

I wanted to put off benchmarking until XA2 was released (early next year). I've also been hoping to have decent hardware to do it with, though I'm less sure about when that might happen. However, I've improved things by releasing the XA1.1 system recently, and it doesn't hurt to see how things run on desktop hardware.

RDF data varies according to the number of triples, the number of distinct URIs and literals, and the size of the literals. Some data uses only a few predicates, a modest number of URIs, and enormous literals. Other data uses only a few short literals, and has lots of URIs. Then there is the number of triples being stored. As an example, doing complete RDFS inferencing will introduce no new resources, but can increase the number of triples in a graph by an order of magnitude.

There are various standard sets and I intend to explore them, but in the meantime I'm going with a metric that Kowari/TKS used back in the days of Tucana, when we had a big 64bit machine with fast RAID storage. Funnily enough, these days I have a faster CPU, but I still don't have access to storage that is as fast as that box had.

The data I've been working with is the "Numbers" data set that I first created about 4 years ago. I tweaked the program a little bit, adding a couple of options and updating the output. There are probably better ways to model numbers, but the point of this is just to grow a large data set, and it does that well. You can find the code here.

Hardware

The computer I've been using is my MacBook Pro, which comes with the following specs:
Mac OS 10.5.5
2.6 GHz Intel Core 2 Duo
4GB 667 MHz DDR2 SDRAM
4MB L2 Cache
HDD: Hitachi HTS722020K9SA00
186.31GB
Native Command Queuing: Yes
Queue Depth: 32
File System Journaled HFS+

Note that there is nothing about this machine that is even slightly optimized for benchmarking. If I had any sense, I'd be using Linux, and I wouldn't have a journaled filesystem (since Mulgara maintains its own integrity). Even if I couldn't have RAID, it would still be beneficial to use another hard drive. But as I said, this is a standard desktop configuration.

Also, being a desktop system, it was impossible to shut down everything else, though I did turn off backups, and had as few running programs as possible.

The Test

I used a series of files, generating numbers to each million mark, up to 30 million. The number of triples was approximately 8-9 times the largest number, with the numbers from 1 to 30 million generating 267,592,533 triples, or a little over a quarter of a billion triples.

Each load was done with a clean start to Mulgara, and was done in a single transaction. The data was loaded from a gzipped RDF/XML file. I ignored caching in RAM, since the data far exceeded the amount of RAM that I had.

At the conclusion of the load, I ran a query to count the data. We still have linear counting complexity, so this is expected to be an expensive operation (this will change soon).

Due to the time needed for larger loads, I skipped most of the loads in the 20 millions. However, the curve for load times is smooth enough that interpolation is easy. The curve for counting is all over the place, but you'll have to live with that.


The axis on the left is the number of seconds for loading, and the axis on the right is the number of seconds for counting. The X-axis is the number of triples loaded.

Counting was less that a second up to the 8 million mark (70.8 million triples). This would be because most of the index could fix into memory. While the trees in the indexes do get shuffled around as they grow, I don't think that explains the volatility in the counting times I'm guessing that external processes had a larger influence here, since the total time was still within just a few minutes (as opposed to the full day required to load the quarter billion triples in the final load).

Overall, the graph looks to be gradually increasing beyond linear growth. From experience with tests on XA1, we found linear growth, followed by an elbow, and then an asymptotic approach to a new, much steeper gradient. This occurred at the point where RAM could no longer effectively cache the indexes. If that is happening here, then the new gradient is still somewhere beyond where I've tested.

My next step is to start profiling load times with the XA1 store. I don't have any real comparison here, except that I know that there is a point somewhere in the middle of this graph (depending on my RAM) where XA1 will suddenly turn upwards. I've already seen this from Ronald's tests, but I've yet to chart it against this data.

I'm also very excited to see how this will compare with XA2. I'm meeting Andrae in Brisbane next week, so I'll find out more about the progress then.

Thursday, September 18, 2008

Disclaimer

New babies are wonderful, but the resulting sleep patterns are far from optimal. Please excuse me if I stop making sense halfway through any of the ensuing sentences.

Hash Tables

Whenever I have a spare moment (and sometimes when I don't) I'm forever re-examining how I think RDF should be indexed. After all, I've already found a few different ways to do it, both in Mulgara and out of it, and each have their own pros and cons.

One of the most interesting indexes to consider is the Hash Table. Constant time reads and writes makes for a compelling argument in terms of scalability. Re-hashing an index during expansion is painful on systems that should scale, but I was recently reminded that amortized complexity is still linear, so I shouldn't be too scared.

Years ago TKS (the forerunner to both Mulgara and Kowari) used a few on-disk hash tables, but they proved ineffective for us, and we moved to trees. But many of our assumptions back in 2000 no longer apply to modern systems, and I've already found several things worth re-examining for this reason. On top of that, Andy Seaborne was discussing using them for Jena, and while I was initially dubious, on further reflection I can see the reasoning.

Pros

It's O(1): That's kind of a trump card. Everything else I have to say here is a discussion as to what could possibly be more important than being O(1).

Opaque Data: Data store in a hash is treated as an atom, meaning there is no ordering or other meaning on the data. While this creates problems (mentioned below) it also provides the opportunity to distribute the data across a cluster like Hadoop. That's a big deal these days.

Cons

Re-Hashing: The first problem I think about with on-disk hash tables is the cost of a re-hashing operation. These are expensive in memory, but on disk they are going to be frightful. Reading the original hash will be OK, as this is a linear scan through the file, but writing will be problematic, as the seeks are essentially random. That's a cost of N seeks, for N entries (ignoring seeks for reads, but they're amortized, and could even be on another drive). There may be some algorithms for clustering the writes, but if you're trying to scale on the size of your data, then this would be overwhelmed.

The best way to address this is to allocate as much space as you can, and to be generous when growing. That could be a problem for some systems, but if you're really in the business of scaling on data, then you'll be up for it.

Space: Hash tables require a lot of empty space to work, else you end up with a lot of hashing collisions, and those lovely O(1) properties go out the window (until you expand and re-hash, but I've already talked about that). I shouldn't really make a big deal out of this, especially when you consider that Mulgara was built using the idea that "disk is cheap", but it does still feel a little strange to be that lavish. Also, being extravagant with space can lead to speed issues as well, so it's always worth looking at with the critical eye, even if the final decision is to use the space.

No Ordering: Data in a hash table cannot be ordered. Well, OK, a linked hash table can do it, but you only want to link by insertion order, or else all your O(1) benefits are gone.

Hashes and SPARQL

Of the three cons I listed here, it's relatively easy to justify the concerns about re-hashing and space. In fact, once you decide that "space is no object", then re-hashing isn't such a big deal since you can just start with an enormous table that never (or almost never) gets rehashed.

The ordering issue bugged me for a while, and it was then that I realized that this actually works well for SPARQL. In fact, this looks like yet another case where the heritage of filtering is showing up again (though maybe it's a coincidence this time).

When you use the appropriate resolvers in Mulgara (in either TQL or SPARQL, since resolvers are just mapped onto graph names) then data can be selected by "range". This lets us select an ordered set of date/times, numbers, or strings that occur between a pair of boundaries. (Particularly useful for something like selecting events during a particular time window). It is even useful for selecting URIs based on namespace. These selections are then joined to the remainder of the query to create a result. The end effect is processing much less data than simply selecting it all, and FILTERing it down by the data that meets the given criteria. We always pursued this in Mulgara, as we found that filtering could slow down certain queries by orders of magnitude.

However, SPARQL was never designed for this kind of thing, and as a result it relies entirely on filtering to do its work. This usually bothers me, but for hash tables it actually works, since they don't provide the ability to select a range anyway, and hence require filtering if you want to use them.

To Tree or not to Tree?

I've been wedded to trees in Mulgara for so long that it feels weird just examining a system without them. Of course, I've already moved away from the use of trees with the new statement store design, but I still thought that the data pool still had to be ordered, and hence, no hash tables.

Now I can see the utility of using hash tables in this part of the system, providing you are prepared to using filtering for your results. Jena was always designed around these principles (it's easy to use, it's easy to implement, and it's correct), so I understand why Andy would be attracted to it. However, I know that range queries are a big deal in Mulgara, so we really do need a tree somewhere.

But perhaps we can mitigate some of the expense of tree indexes?

Trees are really only needed for two types of query: ranges of data (meaning literals); and selecting strings or URIs by prefix. Neither of these are common operations, and are certainly not needed during the time-consuming load operation. So perhaps loads could be done entirely with a fast hash index, and afterwards a slow tree-based indexer could come through to order everything. Background indexing is nothing new, and even AllegroGraph does it, though I'm not sure how to manage a range query while waiting for an index to proceed.

Another possibility would be to do inserts into a tree index, and simultaneously index the tree node with a hash index. After all, the tree nodes are not being reclaimed, and while their position in a tree may change, their data does not. This would require another seek/write during writing, but would save on log(N) seeks when looking to see if a string or URI exists, which is the single most common operation during a load. That way there would be no background indexing to worry about waiting for, and the most common task drops from log(N) to a single seek. Now that has promise. I'll have to see if I can think of a decent Hadoop angle for it.

So now I need to write the hash table file. We already have a few things that are close, so maybe I can leverage off one of those?

Other Stuff

There's a LOT to write about with Mulgara, but this year I've tended to do the work rather than write about it. I believe that this is a false economy, since writing about things provides me with an invaluable log of what I did and when, and also helps me work out just what I need to be doing.

On the other hand, late at night is not the time for me to be writing, especially when a baby is going to be waking me at various times between now and morning.

All the same, I'll mention that I now have a couple of cute little servlets that let me do "HTTP GET" requests with SPARQL protocol parameters, and I get back either XML or JSON (depending on the value of an optional parameter called out. Hmmm, maybe I should have called it format?). One of the servlets is for TQL queries, while the main one is for SPARQL.

These servlets also accept "HTTP POST" requests. In this case, the TQL servlet will allow commands that update data. The SPARQL servlet will eventually do this too, but not until I've implemented "SPARQL/Update". They will also accept MIME encoded files containing RDF data (RDF/XML, N3 and I think Turtle) and will load them into the default graph, which can be specified with the default-graph-uri parameter.

I haven't committed all of this code yet, since I ran into a bug when loading an RDF file. It turned out that this file finishes with the line:
</rdf:RDF>
This line does not finish with a newline character, and this is confusing the ARQ parser we are using. Of course, I could just wrap the InputStream object in something that appends a newline, but this is an unnecessary (and horrible) hack, so I decided to look for the source of the problem.

At this point I realized that we are still on Jena 2.1, while the world has moved on to 2.5.6. Hopefully a move to 2.5.6 will fix this issue, so I decided to upgrade the Jar. Of course, this led to 2 other jars (icu.jar and arq.jar) along with other tests failing (I think they were trying to compensate for a timezone bug, but this has been fixed now).

While trawling through the Mulgara XSD classes I found what I think is the problem (compensation code for Jena not handling 0 months, though now it should). While there, I also learnt that despite parsing everything needed, the same data was being send to a Jena object for parsing. This seems quite redundant. It is also one of the few places that Jena classes are used (as opposed to just the ARP parser), so it would be great to drop this dependency if I can.

So now a simple bug fix (not handling a missing newline character) seems to be leading me into all sorts of updates. Story of my life.

OK, now I'm falling asleep between words, and have even caught myself starting to type something I started dreaming on 3 occasions. I think I've overstayed on my blog.

Thursday, July 31, 2008

SPARQL

Perpetual coding doesn't leave much time for blogging. I'm in the middle of a long-running set of tests, so I figured I should take the time to write, even if I'm too tired. :-)

SPARQL on Mulgara always seems to have more to do than I have time or mandate for. That should be OK, given that SPARQL is now available through the SAIL API, but it's never quite that simple.

To properly work with Sesame/SAIL we need to build (or at least deploy) Mulgara using Maven. Now I understand what Maven does... I've just never used it. On top of that, we have the horrible build scripts that go into Mulgara, making the whole notion of re-creating the build system a little daunting. All the same, I've learned about creating a pom.xml, along with modules and inheritance, but I still need to read more docs on the topic. I'd like to get to this soon, but there are so many other pressing things.

So working with SAIL isn't an out-of-the-box distribution yet, which is an impediment to using SPARQL. At this stage I think the Mulgara SAIL API is more of an advantage to Sesame than it is to us. Another reason why it would be good to get SPARQL going is because people are always asking me for it. So even if I don't get 100% conformance, I should try to get it close. Anyone who needs it perfect can use the SAIL API.

Web Services

The best way to get SPARQL compliance is to run the test suite. That means you need some way to issue the queries and check the results. Now, I could write the code to do this, but I know that other systems for running the test suite exist out there, and it would be better to use one of those if I could. However, those systems will all be using the SPARQL protocol for issuing queries, and that's one part I hadn't really touched yet.

Fortunately, the protocol is just a web service, and Mulgara is already running web services. The response is just in XML, and I've written some code to do that already (though it's not checked in anywhere yet). I just need to glue it to a web service.

Looking at it, the protocol is so simple that the service should be implementable with a relatively straightforward servlet. Servlets are quite easy to write, but deploying them is system dependent, so I thought I'd get the deployment part going first. I built a simple "hello world" servlet with the intent of expanding it into the real thing once it was integrated correctly.

Servlets

To start with, I followed the directions given for deploying a servlet in the Quick Start guide, and it all worked fine. Then I went to Mulgara to see how this would work.

Now I'd been aware that Jetty hadn't been updated in Mulgara for a while, and I thought that this would be a good chance to update it. However the existing version was 4.2.19, while the latest (released) version is 6.1.19. Some of the APIs appeared to be completely incompatible, and while there was an upgrade guide from Jetty 5 to Jetty 6, there was nothing about Jetty 4. Obviously this task had been left for too long.

So the first order of the day was not to get a servlet deployed in Mulgara, but rather to upgrade Mulgara to use the latest Jetty. This also dovetailed with another task I've been wanting to do for some time, which was to clean up the file where all of the Jetty configuration happens: EmbeddedMulgaraServer.

Upgrade

I eventually want to completely remove EmbeddedMulgaraServer and replace it with a lightweight program that loads up configurable modules. This will give us the benefits of having those modules ready for other types of deployment (another request I often get) as well as letting people customize the server, which is currently monolithic and unwieldy. I don't have time to get all of that done right now, but at least I got to tidy the code up to the point where this will be less intimidating. It also gave me a better view of what was going on in there (it regularly confuses developers who look at it).

Mulgara had been deploying two sets of static pages and 2 web services in Jetty. The static pages included the documentation that is both obsolete (to be replaced by the gradually expanding Wiki), and available on the website. The other pages are all data files, which I believe are used for example scripts. I think it's a terrible idea to have these in the system, so I ripped them out. Moments later I thought better of it, and so I emailed the list to see what people thought. I was bemused to see that not only was this a welcome move, people wanted to get rid of the HTTP server altogether! (These people obviously want to access those individual modules I mentioned earlier). So then I created both an option in the config file, and a system property which can both disable the server (the system property takes precedence).

That just left me with the 2 web applications in Web ARchive files to deploy. This is where I came unstuck.

WAR Files

I could not find any documentation on how to deploy a WAR file using the APIs in Jetty. So I muddled through the JavaDocs, picking up anything that looked promising. After an entire night of this, I eventually got something I thought might work, replacing WebApplicationContext class with the WebAppContext and trying to translate the differences in their APIs. I immediately got back an IllegalStateException that occurred while the system was accessing the WAR file. While trying to work it out I delved into the Java libraries, and discovered that something had closed off the archive file while it was still in the process of reading it. It seemed too far down in the system to be anything I could have caused (or prevented), so I went searching online to see if anyone knew about it.

It didn't take me long to see people mentioning this bug in relation to Jetty 5 about 2 years ago. It seemed strange that there wouldn't be a more recent reference, but that was the best I could get. Unfortunately, the response at the time was that the problem was indeed a bug with some of the Apache libraries that were used for this, which meant I was out of luck (sure, I could fix it, but that won't get me a deployed version of those libs any time soon).

I saw Brian online (apparently traveling as a passenger in a car) and he told me that he'd heard of the problem, and suggested that I "expand" the archive to deploy it. I did this by manually pulling the WAR files into a temporary directory before pointing the WebAppContext at it. This avoided the IllegalStateException.

Class Paths

The deployment of these WAR files into Jetty 4 had a few things that didn't translate so well. The first was the configuration of something called a SocketListener, which I figured out was replaced a Connector. The second was in setting up the class paths. The code for this used to be:
  HttpContext contexts[] = httpServer.getContexts();
for (int i = 0; i < contexts.length; i++) {
contexts[i].setParentClassLoader(this.getClass().getClassLoader());
}
This seemed reasonable, though I wasn't sure why it was being done. I was about to learn.

Jetty 6 no longer has the Context.setParentClassLoader() method, though it is now possible to set the actual class loader for the context. However, the class loader I had available in that context (this.getClass().getClassLoader()) was the same one that was already being used by that class. So I wasn't sure what to replace this with. Unfortunately, I made the mistake of choosing to set the class loader here anyway.

When I tried running the program again, I was immediately being told of missing classes. Of course, neither these classes, nor any code for them existed on my system. I eventually worked out that these were classes that were generated from Java Servlet Pages (JSPs), which took me into the configuration for generating these pages.

I hadn't realized we had JSPs in the system (will the cruft never end?!?) and I'd eventually like to get rid of these, even if I keep the web applications they're a part of. But for the moment, I had to upgrade those libs, and then update various build scripts which were trying to refer to the libs by name, and not with a generic variable (which we do for everything else - this lets us change versions relatively easily). I also discovered a "Tag" library for accessing Mulgara from JSPs. We don't seem to use it anywhere ourselves, and it just seems to be provided as a utility for users. The presence of this has me feeling reluctant to remove JSPs, but I'm still considering it.

Embedded JARs

Once the JSPs were running, I started getting errors about missing libraries that I expected were already in the class path. However, when I checked, I found that those libraries had NOT been included. It used to work, so I kept searching, and it didn't take me long to find them in the WAR file.

So this was the reason for the fancy classloader stuff. The classloader was supposed to find these JARs in the WAR file, and include them in its search. Only there was no such class loader in place. Hence my error.

The Javadoc mentions a class called WebAppClassLoader, which looked like an obvious candidate. However, the documentation made it appear that this class may not do very much, as it just extended the standard library class URLClassLoader. All the same, I tried it, but it didn't seem to do anything. (This was my big mistake).

I finally started adding the sources for all my libraries into my Eclipse environment, so I could debug it and see exactly what was happening. While time-consuming, it finally got me over the line. I also had a nice side benefit of learning just how the architecture of Jetty 6 works.

Deployed At Last

Tracing through the program, I found that a WebAppContext calls configureClassLoader on a WebInfConfiguration that it creates. This explicitly checks if the class loader is a WebAppClassLoader, and if it is, then it goes through the lib/ directory of the application, and adds any JAR files that it finds into its classpath.

Since the configuration is checking for this specific class loader, then this is obviously the only way to do it, unless you write a class loader for yourself. The application never creates one for you, which seems strange. The creation of the object is also strange in that it needs to be provided the web application that it works on (so it knows where to find the classes and libs), and it has to be explicitly set as the class loader for that application. So you need to say something like:
  webapp.setClassLoader(new WebAppClassLoader(webapp));
I'm confused why WebAppContext doesn't create automatically create a WebAppClassLoader for itself, giving it a this reference. You can always override it, but it would be rare to need to.

Anyway, I now knew what to do, and so I did it. Of course, it still didn't work. More debugging. That was when I ran headlong into that class loader code I wrote back at the start of this process. After setting the class loader for the WebAppContext this code was setting it back to the normal system class loader. That'll teach me for including code blindly.

Threads

So now everything was running "error free". I decided to throw a web browser at the WebUI application. Only, it wouldn't respond at all. I got a connection to the server, but it just sat there doing nothing.

Finally, I tried duplicating what I was doing in a short application using a simple servlet. It all looked OK, so I wen through step by step, making sure I had it exactly the same... and it locked up there too. So then I started changing settings one at a time until I found the one that was causing the problem.

On Jetty 4, two of the options we were setting on the SocketListener were minThreads and maxThreads, however neither of these were options for Connector. So I decided to make do with AbstractConnector.setAcceptors(int), which does a similar thing. However, I made the mistake of setting the number of acceptors to our previous madThreads value, which was 255.

If the number of acceptors is set this high, then the server is guaranteed to lock up. So I looked for the threshold at this this occurred. It turned out that the maximum value I could use was 24. It consistently works fine right up to this level, but any more and the system just blocks indefinitely. I checked out the source code, and discovered that all the acceptors are Runnable objects that get invoked by threads in a thread pool, but there is nothing about the size of that pool or anything else I could see that would create this limit of 24.

It also doesn't seem to matter what kind of Connector I'm using either, as the Acceptors are always the same.

A New Servlet

I'm finally at a point where the system works as well as it did at the beginning of the week, only now it's doing it with Jetty 6. It needed to happen, but I wish it hadn't been so painful.

I have other things to get to now, but I'll be trying to write this new SPARQL servlet soon. At least I have a modern framework to do it with now.

Wednesday, June 18, 2008

TV

Know what would make the Apple TV a no-brainer for me? Allow it to share a DVD from your desktop machine, like the MacBookAir can, and start include BluRay as a shareable disc type.

But no. I bet that interferes with a business model somewhere. :-(

Tuesday, June 10, 2008

Thesis

I've finally started writing my thesis, so don't expect to see me blog much in the near term. I know I haven't been blogging much at all this year, but I'm guessing I'm about to get worse (or who knows? Maybe I'll procrastinate and blog more).

I'm still in the introductory chapters, so I'm reviewing everyone else's work. I have a stack of references from a few years ago, but need to update some of it, and finally read some of the papers I put off all that time ago.

One of the really startling things is reading about stuff that I had to discover for myself while implementing Mulgara. As a database developer you just do things because they seem pragmatic, and you figure that everyone must do it that way. Then you read a paper where someone formalizes your assumptions and gives a name to it. I can think of several here, but the first that comes to mind is "DL-safe rules".

DL-safe rules are simply rules where the variables in the head must also occur in the body. Well, building rules for OWL that meet this criteria seems obvious to me, but apparently it merited a couple of papers on the topic. For a start, I'm not sure how you'd even do this without making sure your variables in the head all come from the body. Second, the only way this would work (that I know of) is to start introducing blank nodes for existential statements.... and that way lies madness.

For instance, if you define (somewhat informally):
  ∀x ∈ Man → ∃y : Man(y) ⋀ father(y,x)

Then simply by saying Man(fred) you have an infinite loop. Incidentally, this is a trivial demonstration of how hard it can be to model the real world. The simple solution is to somehow incorporate a new type, like Men-without-fathers, and put that in your rule (hmmm, doesn't the DL-Handbook mention something like that?). Whether you introduce an entity named adam or somehow model evolution (good luck there) is up to you.

Back to the example... Of course, in OWL you can just create a blank node for an unknown father, but if you're going to take it that far then you want to create a blank node for the father of the first blank node, etc. Maybe it's reasonable to simple create that first step, and not reason further on blank nodes, but now you're making a judgment call that:
a) May not prove to be as useful as you'd envisaged.
b) May have implications for your logic.

Besides, what's the point in inferring a new node that you can't perform further inferences on? You'd just have a node there not saying anything except that it's a "father". But if you want to include it in a rule for determining ancestor(x,y), then suddenly it can be re-inferred on again, and you run the risk of an infinite loop once more.

So DL-rules just make sense in OWL (at least, they do to me). It's strange to see people like Boris Motik take them so seriously.

Speaking of Boris, he basically wrote the thesis I was hoping to write (well, sort of - fortunately I have a few ideas of my own). I came to many of the same conclusions that he has, simply by virtue of implementing stuff for Mulgara (though by virtue of having another child, moving countries, interrupting my candidature, and holding down a full time job, I didn't publish anything in time). The difference between what I would have written and what Boris did write, is that he knows the theory way better than I'm every going to have the time for. I mean, I can follow it all, but it would never have occurred to me to give such algebraic formalism to everything the way he did. It's a little humbling to see someone do something like that so much better than you would have done.

Oh well. I guess I'd better stop procrastinating and write some more.

Monday, May 26, 2008

Mulgara Alpha

My last few weeks were spent trying to get Mulgara's SPARQL interfaces ready before the Semantic Technology Conference 2008. I met the criteria Amit (from Topaz) and I had agreed to beforehand, which allowed me to get out an Alpha release for the next version of Mulgara. There are still a couple of things missing, but the basics are all there now.

The road to SPARQL took a couple of turns I hadn't expected.

Back in February we were approached by Aduna who asked if we would be willing to support a level of integration between Sesame and Mulgara. While none of the Mulgara developers had the time to work with them directly, we said that we would be very happy to try to support Aduna where we could. The majority of this work was done by James Leigh (a programmer who commands my respect more and more on a daily basis), and he was able to get it all done in remarkable time. Even more impressive was that his integration work is 100% SPARQL compliant, even though some of the underlying structure isn't quite there yet!

My own work was to:
  • Parse SPARQL queries.
  • Convert this into the Mulgara Algebra.
  • Write new algebraic operations in the Mulgara query engine.
The work by Aduna was going to overcome the need for the first and second tasks, but I had already completed the first when we heard from Aduna, with most of the work left to be done required for both the SAIL interface and my own SPARQL implementation. Since this was the case, I decided to continue with my own interface, since there wasn't going to be much redundant work from that point onwards. Even with both interfaces working correctly, the SAIL API will be the one to use, as it also includes a SPARQL Protocol endpoint, which I haven't looked at yet.

While the SAIL integration may have appeared to be independent from my own work, it turned out that James's contribution was invaluable. His need to pass all the SPARQL tests drove a lot of my query engine work, pointing out both missing features and bugs I was unaware of. I still have a couple of things to go, but James has been able to work around them at the higher layers for the time being. This has a performance penalty, but these will be dealt with in the next couple of weeks.

Notable Feature Implementations

Language Tags

One missing feature that completely floored me was that Mulgara was not supporting language tags on untyped literals. It turns out that this was slated for addition just as Tucana was closed, which is why it never made it. Even so, I must admit that I was surprised that it took that long for this feature to be scheduled!

Fortunately, language tags were quick and easy to implement. The main issue was in the existing tests, as nearly half of our files use literals with language tags in them, and none of the "expected results" included them.

Repeating Variables

Another issue was in "basic graph patterns" that use a repeating variable. Mulgara already had some code to deal with this, but it was failing in most cases. Unfortunately, I responded to this as a "bug report", and fell into the trap of fixing the existing code. I got it working after a day, only to be told the next day that it still failed if the variable is repeated in the position of the graph name.

At that point I stepped back from the problem, and realized that the solution was actually quite easy. All you need do is replace the repeating variable with a set of unique names, and create a conjunction of the constraint repeated with the variables in rotating positions. After mentioning this to Andrae he informed me that he'd worked this out a few years before (even though someone else was implementing the code at the time), but he forgot to let me know. Oh well, at least I'm doing it correctly now.

While looking to implement this fix, I realized that the best way to perform this substitution would be via Andrae's query transformation SPI. This lets you search through a query structure, and replace elements with something more appropriate for the engine to work with. It was while working with this I realized that it provides me with a tool that will let me solve a problem I've had for some time.

Transitive

The trans feature in Mulgara is a mechanism that lets the user mark the predicate in a constraint as transitive. While it works really well, the syntax in TQL is ugly. However, the query transformer offers an alternative. Instead of wrapping a standard constraint in a trans(...) operator, the predicate can be typed as being transitive in a separate constraint. I was tempted to use the URI of owl:TransitivePredicate for this task, but this will interfere with declarations in ontologies, so a local URI will be much more appropriate (something like mulgara:TransitivePredicate). The really cool thing is that this will be sharable with SPARQL queries as well. That means we can start opening some of our functionality up to SPARQL users, while not needing to extend the syntax of that language. In fact, there are a few functions we can implement in this way, allowing us to do a lot in SPARQL without sacrificing the speed and functionality of TQL.

Date Times

One question I regularly received from James was about date times. Unfortunately, Mulgara stores these canonically (using UTC), and hence does not round-trip these values. The solution is to store the timezone offset along with the value. Another tricky thing is to record if a time of "midnight" is recorded as "00:00:00" or as "24:00:00", as both are valid, and both need to be returned as they were provided, and not in a normalized form. I haven't done this one yet, but I expect to get it done by the end of the week.

I had a comment from Andy Seaborne that despite timezones being described in hours and minutes, this only requires a resolution of quarter-hour intervals, so I can probably squeeze this into some existing storage somewhere. I appreciate the advice, but it leaves me wondering which timezone appears with a 15 minute offset from its nearest neighbors!

In the meantime, James got around the problem by removing the xsd:dateTime specific code from the version of Mulgara he is working with, so it gets treated as an unknown type. This modification can be removed as soon as I fix the issue (which I expect to be by the end of this week).

Memorial Day

There is still an enormous amount of information to cover on Mulgara, SPARQL, and especially the SemTech conference, but I'm falling asleep as I type. It's currently Memorial Day here in the USA, and since getting back from the conference on Friday night, I've had a huge weekend with my family. Yesterday I took both of the boys in a trailer for "Bike the Drive", which is a lot more cycling than I've done for a few months. Swimming and running have kept me relatively fit, but it still tired me out! Consequently I just can't think now, so I'll pick this up again later.

Sunday, April 13, 2008

Writing 2-columns

In my last post I described a scheme for representing 2 columns. But the moment I first thought of it, I decided it was too impractical. After all, each "triple" gets represented with 10 entries. If I want to include a graph identifier (i.e. a Quad store) then it goes up to 12 entries. If I want to cut down on disk seeking, then the idea seemed to be of little more than academic interest.

Then a little while ago I was explaining this scheme to a friend (Inderbir), and in the process I tried to explain why this was going to be impractical, but in the course of the discussion a few things occurred to me.

The statements to represent form a series of "doubles", which need to be indexed two ways: by each column. The data for a single statement will appear like this:
  Statement, _statement_x
SubjectIdentifier, _subject_x
PredicateIdentifier, _predicate_x
ObjectIdentifier, _object_x
_statement_x, _subject_x
_statement_x, _predicate_x
_statement_x, _object_x
_subject_x, my:subject
_predicate_x, my:predicate
_object_x, my:object
Where anything whose name starts with an underscore is a unique identifier. As I'd already mentioned, now that we use 64 bit identifiers in Mulgara, it makes sense to create these from an incrementing long value.

Given that each identifier only gets used for one statement, then the statement, subject, predicate, and object identifiers will all be allocated together, and will be consecutive. Indeed, if these identifiers are kept separate from the identifiers that will be allocated for the URIs and Literals of the statement, then the statement can be presumed to always be a multiple of 4, and the subject, predicate, and object identifiers will be 1, 2, and 3 greater, respectively. This means that the bottom two bits of the IDs can be used to represent the type of the ID, meaning that the first 4 statements in the above list can be inferred, rather than stored. Also, since the IDs for the subject, predicate, and object positions can be calculated by adding 1, 2, or 3, then the next three statements don't need to be stored either. Cutting the data down to 3 entries suddenly makes it look more interesting.

I should note at this point that I still expect to represent URIs and Literals with IDs that can be mapped to or from the data they represent. While the mechanism for doing this in Mulgara needs to be improved, it is still an important concept, as it reduces redundant storage of strings, and the comparison of Long values allows for faster joins. However, I do intend to return to this idea.

After reducing the data to be stored, we now have:
  _subject_x, my:subject
_predicate_x, my:predicate
_object_x, my:object
Indeed, since each of those IDs are consecutive, and always increasing, then in the index that is sorted by the first column, all three statements will go to the end of the file. This means that the file need not ever have a seek operation performed on it while it is being written to. Operating systems are usually optimized for append-only writing, so this is another bonus.

It is also worth noting that since the predicates are always consecutive, there is no need to write each of them either. Instead, the following can be written for each statement:
  _statement_x, my:subject, my:predicate, my:object
With this data, all of the above can be inferred. Indeed, the need for the statement to take up an ID on it's own can be dropped, and the subject, predicate, and object IDs are calculated by adding 0, 1, and 2 to the first ID. This leaves space for a fourth element, such as a graph identifier, before needing more than 2 of the low-order bits to give the type of the identifier.

On the other file, we will be storing the same data in reverse order:
  my:subject, _subject_x
my:predicate, _predicate_x
my:object, _object_x
In this case, the identifiers for the URIs, blank nodes, and literals of the subject, predicate and object will be all over the place (and will be regularly re-used), so there is no guarantee of ordering here. This means we have to go back to standard tree-based indexing of the data. However, we only have 3 search operations to go through here, which is significantly better than the searching we currently do in Mulgara.

Note that all of the above applies to statements with more than 3 elements as well. Each new element in a statement increases the size of the single write on the first index by one more long value, and adds one more seek/write operation to the second index. This is far less expensive than expanding the size of the "complete" indexes used in Mulgara.

Retrieving

I'll stop for a moment, and take a look at what a read operation looks like.

The first index file is written to linearly. Each record is identical in size, and the ID that starts the record is monotonically increasing. If the store were write-once-read-many (WORM), then the ID could be skipped altogether as this information would be inferred from the offset within the file. This may be useful for some applications, but I'd prefer to delete information in place (rather than creating a white-out list for later merging), meaning that the ID is still required in this case.

For this kind of structure, the file can be searched using a binary search. Also, the largest offset that an ID can appear at is the value of that ID multiplied by the size of a record, meaning that the number of seeks required for a search can be greatly reduced.

The second index is a standard tree. B-Trees are well known for not seeking much, so for a first cut, I would suggest this (though Andrae have other plans further down the line).

To find all statements that match one element (say, the predicate), then this requires a search on the tree-index, to find the first time that URI appears. The associated predicate ID is paired with a set of IDs that represent the use of that URI in statements (sometimes as predicate, sometimes as subject or object). These IDs are in consecutive order, and so can be merged with the first index as a linear operation. Adding in another element to search by (say, we are looking for a given predicate/object pair) then this becomes another search on the second index, and another linear merge.

Linear merges aren't too bad here, as it is always a linear operation to go through all of the data anyway (meaning that it can't be avoided). The only case where this is an unnecessary expense is if the "count" of a set of statements is required.

Efficiency in the Tree

While considering the above structure, it occurred to me that this index is having to store identifiers for the RDF nodes over and over, even though they all appear next to one another. There are ways of compressing this, but it made me question the redundancy altogether. What if the item was just stored once, and the "satellite data" (to use the term for data associated with key) was instead it's own structure? I thought that maybe this could be a tree, but then it occurred to me that the data represents statement IDs, and will therefore always be inserted in increasing order. So a list is most appropriate.

So now I could have each entry in this tree point to a list of statements that this RDF node participates in. Since the list will always be appended to, it makes sense that this is kept in another file, using a linked list of blocks. However, to cut down on seeks, the first few elements of the list would do well to appear with the node in the original tree.

So what sort of satellite data should be stored? For reading, the head of the list has to be stored, though as just mentioned, I think that this should be inline with the satellite data. The tail of the list should also be stored, else it would require a linear seek to work out where to insert, and this is not scalable. To give some help with management of the list, the size should also be recorded. This also makes counting trivial.

Up until now there has been a presumption that the identifiers of elements in a statement follow a particular bit pattern. However, if the satellite data contains three lists instead of one, then the number of the list is enough to indicate which position the node is used in. For instance, the node of <rdf:type> may have a few entries in the list for subject (indicating that it is the "subject" in just a few statements), may have a few entries in the object list (indicating that there are a few statements which refer to this URI), but will have millions (or more) statements in the predicate list, because this URI indicates a commonly used predicate.

If the presence of a statement ID in one list or another indicates that this node is used in a particular capacity for that statement, then this means that the presumption of using the low order bits of the ID for this purpose is removed. That gives us a little more flexibility.

Data Pool

All of the above presumes that there exists a mechanism to map URIs, strings, and other literal data on to an ID, and to map those IDs back into the original data. Historically, Mulgara has referred to the the store that performed this operation as the "String Pool". Since URIs are encoded as a string, and the first iteration of Mulgara only stored literals in a lexical form, this name was accurate. However, with the inclusion of numbers, dates, and other datatypes, it might be more accurate to refer to this construct as a "Data Pool" instead.

Part of the data pool structure of Mulgara uses a tree containing some (or all) of the data as a key, and a long value as the ID it is mapped to. Storing entries that are keyed on strings or other data is a lot like the second index just mentioned. So now I started to reconsider the presumption of a separate data pool altogether.

Instead of writing to the linear file first, the idea is to write to the tree index first. This involves a search. If the data is found, then the statement ID will be appended to the end of the appropriate list (this updates the linked list block, possibly spilling over into a new block, and then rewrites the tail/size of this list in the tree). If the data is not found, then a new entry is placed in the tree, two lists are initialized to nil, and the third is given the allocated statement ID. The list is not yet long enough to spill into the file full of linked lists, so this isn't too expensive. For a B-Tree with space, this will require writing of just a single block!

Now it isn't feasible to store everything in the tree as a key, so only the head of the data would need to go directly into the tree. The remainder of the data is still needed, but rather than trying to manage this data re-usably, the ideas from last post about keeping all the data in the pool can be adopted. In this case the data can simply be appended to a third file. The offset of this append then becomes the ID of that data. This ID is stored along with the rest of the satellite data in the tree. It is also the ID that gets stored in the first linear index file which can now be written to.

More Efficiency

So now instead of a "Data Pool" and 2 files, the design is now for 4 files. Two of them are only ever appended to, one always has direct seeks before writing, and only one of them is a tree that requires searching before a write can happen. Given that this is the entire store, then that's not too shabby! It's a darn sight better than the 196 files in Mulgara, almost all of which need multiple seeks to do anything.

But can I do better?

Andrae had already been looking at reworking the string/data pool, and a lot of things are quite obvious to do. For a start, any data that can fit into 54 bits (or so) ought to have its value encoded into its ID, with the top bits used for type identification. That many bits lets you encode all bytes, chars, shorts, ints and floats, as well as the majority of long values (and possibly a lot of doubles as well). Any date within a century of now will also fit in. This means that many items that are not strings don't need any extra storage. So along with the type bits, there would be another bit to indicate whether or not the data is encoded in the ID, or if it is found in the data file. Anything that can be encoded into the ID won't have to go into the data file, though it would still go into the indexes so statements using it can still be found. The main difference is that any statements discovered to contain one of these IDs would not require the extra seek to get the remaining information.

Another significant change has already been proposed by Andrae over a year ago. In this case, the different types of the data will be stored in different indexes, which are each optimized to handle such data. This increases the number of files, but only one of these files will be accessed at a time. Also, since each of these types are literals, there is no need for lists describing subject or predicate statements.

Similarly, blank nodes will have their own file, only they will not require any extra data beyond the lists, and no predicate list will be required.

Getting back to the fundamental types of strings and URIs, Andrae pointed out that Tries are an appropriate structure for reducing space requirements. This is perfect for managing the plethora of URIs that appear in the same namespace (or that just start with "http://"), as common prefixes to strings are not repeated in this structure. Like other tree structures, this would let us store arbitrary satellite data, meaning they are perfectly adaptable to this structure.

Interestingly, if we expand the trie to become a suffix trie, then we can get full text searching, which is one of the most common requests that Mulgara gets.

Hashed Predicates

The example I gave above about how <rdf:type> mostly participates in statements as a predicate, is common of many predicates. In many situations, the list of predicates to be used is quite small. In particular, there are likely to be just a few predicates that will be used the majority of the time, such as <rdf:type>, <rdfs:domain>, <rdfs:range>, as well as many application specific values.

Since these URIs are going to be accessed all the time, there isn't a lot of point in burying them deep in the URI tree. Instead, the most common URIs could each be given their own file, which indicates the "predicate statement list" for those URIs. Those URIs can be included in the tree for their subject and object lists, but the code that searches for predicates would skip the tree and go directly to the file instead. Any operations which require iterating over all the predicates can insert these values in via the algorithm, rather than getting it from the tree structure.

However, which URIs would be stored this way? This may vary from one application to another. So instead of hard coding the values in, they could be placed in a configuration file. Then the application would know to map these values directly to their own files instead. Since the filenames can be allocated by the system, they can be created with a hashing algorithm, or possibly be placed in the configuration file along with the predicate URI list.

I'd still prefer to configure this rather than allow ALL predicates to be done this way, as any predicates that are not used so commonly will not take the resources of another file. It also allows the system to have an arbitrary number of predicates beyond the most commonly used. But by having these files dedicated to common predicates, any requests for statements with a given predicate will require a single seek to the start of that file, and will immediately give the list, along with its size.

Comparisons

The evening after I presented this to the Mulgara/Topaz developers back in March, I happened to attend a presentation given on applying columnar databases to RDF. This described storing subject/object pairs in files, with one file per predicate. This particular optimization is similar, but it has a good fallback for when you run out of files for your predicates (after all, searching in good-sized B-Tree typically only requires a couple of seeks). This scheme also provides the ability to search for statements on subject or predicate, which apparently is less efficient in the presented system.

A nice feature that is shared by both this scheme and the columnar scheme is that selecting statements always gives sorted values that can be joined with linear merge-joins.

However, given the flexibility of this structure, I've been encouraged to write it up and let people know about it. Well, I've started that, but I thought it would be good to get something out there straight away, hence this post.

In the meantime, in amongst my SPARQL work I'm trying to build a proof-of-concept. I've done the complexity calculations to see both the worst case and the expected case, but it doesn't take much effort to see that it involves a massive reduction in the seeking, reading and writing done by Mulgara at the present. I won't be including all the optimizations discussed here, but I still expect it to be around two orders of magnitude faster, and to take up a couple of orders of magnitude less space.

Final Notes

None of the above discusses deletions, transactions, or any of that other stuff needed to make a database useful in the real world. These issues haven't been forgotten, but in order to present the structure I wanted to concentrate on the minimalism in reading and writing to the structure.

My plan for deletions is to go through the various lists and mark them with invalid identifiers (e.g. -1). These will have to be skipped linearly during read operations, which means that removing data has little impact on speed (except that blank IDs will never need to be converted into URIs, Literals, etc). At a later time, either by an explicit cleanup operation, or a background task, a cleanup thread will compact the data by shifting it all down to fill the gaps. Of course, this will require some locking for consistency, though since everything is ordered, there may be a chance to minimize locking by skipping any data that repeats or appears out of order.

Andrae has also spent a lot of time working on a theoretic framework for concurrent write transactions in RDF. His work is quite detailed and impressive. Fortunately, the engineering application of this work is completely consistent with this framework, so we hope to eventually integrate the two. In the meantime, Andrae's work will form the basis for XA2, which in turn will be taking a few avenues to permit this scheme to be easily integrated at a later date.

So for now, I have to get SPARQL up and running, while also looking for time to finish the proof of concept and writing everything up. I suppose I should be doing that instead of blogging. :-)

Tuesday, April 08, 2008

Indexing

I'm in the process of writing a number of things up at the moment, including the following description of RDF storage. But since academic papers take so long to write, and they're boring, I thought I'd blog the main bit of one of the things I'm writing about.

This all came about due to a description I wrote a few years ago about the number of columns needed to store data that was N columns wide. (Wow! Is it really 4 years?) It came down to a process and equation, of finding the minimum value of an expression, as S varies from 1 to N:
minS=1..N (N!/(N-S)!S!)
This gives a result of 3 indices for symmetrically storing triples, 6 indices for quads, 10 indices for quintuples, and so on. Note that this is the number of indices needed if you want to be able to use any search criteria on your tuples. This may indeed be the case for triples and quads, but if an element of the tuple becomes a unique ID (like it does for reification), then there is no need for symmetric indexing.

The rapid growth of this equation is a clear indicator that we want to keep the number of columns as low as possible. For expediency Mulgara moved from 3 columns to 4, so that we could encode graph identifiers with the triples, but that came at the expense of doubling the number of indices. This is really a big deal, as each index in Mulgara takes several files for managing the resources in the index, and for holding the index itself. Each piece of information that has to be read or written means another disk seek. This can be mitigated by read and write-back caching by the operating system, but as the amount of data exceeds what can be handled in memory, then these benefits evaporate. So keeping the number of indices down is a big deal.

Ronald Brachman's work in '77 shaped the future direction of description logics, including the use of the idea that everything can be represented using binary and unary predicates. RDF is defined using binary predicates, and unary predicates are simulated using the rdf:type predicate, which means that RDF is inherently capable of representing description logics, and indeed, any kind of knowledge representation. The issue is that it can be inefficient to represent certain kinds of structures.

The RDF representation of reification requires 3 statements for reification (plus one that can be inferred) and these are independent of the actual statement itself. An extra column can eliminate these 3 statements altogether, but the indexes grow accordingly. Graph membership can be accomplished using extra statements as well, and again, this can be trivially eliminated with an extra column. The question is, when do the extra columns (with the consequent factorial growth) become more expensive than adding in more statements? Should the number of indices be limited to 4? To 3?

2 Columns

I always found it interesting that the equation above has a solution for N=2. I considered this to be an artifact of the equation, but it bugged me all the same. So then a couple of years ago I gave it some thought, and realized that it is indeed possible to represent a triple using "doubles". Of course, once a triple can be represented, then anything can be represented. The question is efficiency.

If the indices were to contain only 2 columns, then this means that only unary predicates could be used. This implies that the predicates define a type. After some thought I realized that I could use unique types to identify each element of an RDF statement, and then a unique type to represent the statement itself. Of course, there is nothing new under the sun, and just recently I discovered that the CLASSIC system introduced unique atomic concepts for each individual in the system in a similar way.

To map the following triple:
  <my:subject> <my:predicate> <my:object<
to unary predicates, I used a scheme like the following:
  Statement(_statement_x)
SubjectIdentifier(_subject_x)
PredicateIdentifier(_predicate_x)
ObjectIdentifier(_object_x)
_statement_x(_subject_x)
_statement_x(_predicate_x)
_statement_x(_object_x)
_subject_x(my:subject)
_predicate_x(my:predicate)
_object_x(my:object)
Where each of _statement_x, _subject_x, _predicate_x and _object_x are unique identifiers, never to be used again. In fact, my use of underscores as a prefix here indicates that I was thinking of them as a kind of blank node: unique, but without a distinguishing label.

When I first came up with this scheme, I thought it a curiosity, but hardly useful. It seemed that significant work would need to be done to reconstruct a triple, and indexing so many items would require a lot of seeking on disk. I was also concerned about the "reckless" use of the address space for identifiers in creating unique IDs for so many elements.

Then recently I was describing this scheme to a friend, and I realized that when I considered some other ideas I'd been working on lately, then there was something to this scheme after all.

Disk Seeking

I've been very disappointed with Mulgara's loading speed on certain types of data recently. If the data has a lot of unique URIs and strings, then the size of the store was getting too large, and the length of time taken to store the data was too long. I was also surprised at the gigabytes of file storage being used when the data files were only a few hundred megabytes. Mulgara is supposed to be scalable, and this wasn't acceptable behavior.

Consequently, I've been doing more work with algorithms and data structures recently. I have not been trying to supplant Andrae's work but was instead hoping to tweak the existing system a little in order to improve performance.

The first thing that becomes apparent is that the plethora of files in Mulgara is a real bottleneck. Each file on its own may be efficient (not all are), but cumulatively they cause a disk to seek all over the place. Since this is probably the single most expensive action a computer can take (other than a network request), then reducing the seeks is a priority.

Profiling the code led to a couple of improvements (these have been rolled into the Mulgara 1.2 release), but also showed that the biggest issue is the String Pool (more properly called the "Data Pool" since it now stores any kind of data). This is a facility that maps any kind of data (like a URI or a string) to a unique number, and maps numbers into the data they represent. With a facility like this, Mulgara is able to store triples (or quads) as groups of numbers. We call these numbers "Graph Nodes", or gNodes.

The string pool was spending a lot of time just searching to see if a URI or string to be inserted into the graph was already mapped to a number, and inserted it if not. Some work was also being done to keep track of what had been allocated in a given transaction phase, so that any allocated resources (like disk blocks) could be freed and reallocated if the data were ever removed. However, items are rarely removed from the string pool. Removals mostly occur when an entire graph is dropped, and these graphs are often dropped just before a slightly modified version of the same data is to be inserted. In this case, the same data will be removed from the string pool, and then re-inserted. That's a lot of work for nothing. It makes much more sense to leave everything in the string pool, and only remove unused items when explicitly requested, or perhaps as a background task. (Unused items can be easily identified since they don't exist in the statement indices).

If the string pool were changed to be a write-once-read-many pool, then a lot of the structures that support resource reuse (Free Lists, which are a few files each) can be removed from the string pool. Of course, the reduced reading/writing involved with removing and re-inserting data would also benefit. So this looked promising.

Another idea is to take any data that fits into less than 64 bits (say, 58 bits) and store it directly in the ID number instead of in the pool. The top bits can then indicate the type of the value, and whether or not it is "stored" or if it is simply encoded in the ID. This covers a surprising range of required numbers, and most dates as well. This idea was mentioned to me in SF last year, and it sounded good, only I had completely forgotten that Andrae had already proposed it a year before (sorry Peter, you weren't first). But wherever the idea came from, it promised to dramatically help dates and numbers. In fact, it helps all the data, since the tree no longer has as many elements stored in it.

There were also other ideas, such as moving the tree type of the index. We mitigated the use of AVL trees in the indices by using pointers to large blocks of data. However, this becomes a subtraction of a constant in the complexity analysis, while a wider tree becomes a division by a constant. Constants don't usually mean much in complexity analysis, but when each operation represents a disk seek, then the difference becomes significant. While this is something that must be looked at, it didn't make sense when we knew that XA2 is coming, and that the trees will change anyway.

Address Space

You may have noticed that I'm talking a lot about resource reallocation, and 64 bits in the same breath. This shows some of the history of Mulgara. The system originally ran on 32 bits, where not reusing resources was a guaranteed way to wrap around in the number space and cause no end of problems. When the system was upgraded to 64 bits, it still made sense to manage resources for reallocation, as some resources were still limited. However, resources that represented IDs in an address space were not reconsidered, and they ought to have been. Looking at what literals could be encoded in a 64 bit value (and how many bits should be reserved for type data) was the impetus I needed to make me look at this again.

Given that every resource we allocated took a finite time that was often bounded by disk seeks, it occurred to me that we were not going to run out of IDs. If we only used 58 bits, then we could still allocate a new resource every microsecond and not run out of IDs for over 9000 years. A more reasonable design period is 100 years (yes, this is a wide margin of safety), and constant allocation of resources at a microsecond per resource means that we still only need 52 bits. So we're safe not reusing IDs, and indeed, we have over a byte of information we can use in this ID to do some interesting engineering tricks.

Structure

So I had a number of these lessons fresh in mind when I recently tried to describe just why a 2 column store was inefficient. During the course of the conversation I started seeing ways in which I could apply some of these techniques in a useful way. It took a while for it to come together, but I now have something that really shows some promise.

The details here are reasonably detailed, so it makes sense to take a break here, and write it all up in a fresh post in the next day or so. A little more sleep might also help prevent the rambling that I've noticed coming into this post. :-)

Tuesday, April 01, 2008

Collections

So I'm trying to work out what is necessary in OWL, and what is necessary and sufficient. Actually, I just want "necessary and sufficient", but knowing the difference helps. :-)

Anyway, while working through this blog, I worked it out. But it probably won't hurt to write it down anyway...

I had narrowed my problem down to the following:

If I had a Collection like:
  <rdf:Description rdf:about="http://example.org/basket">
<ex:hasFruit rdf:parseType="Collection">
<rdf:Description rdf:about="ex:banana"/>
<rdf:Description rdf:about="ex:apple"/>
<rdf:Description rdf:about="ex:pear"/>
</ex:hasFruit>
Then this is translated to:
<ex:basket> <ex:hasFruit> _:l1 .
_:l1 <rdf:first> <ex:banana> .
_:l1 <rdf:rest> _:l2 .
_:l2 <rdf:first> <ex:apple> .
_:l2 <rdf:rest> _:l3 .
_:l3 <rdf:first> <ex:pear> .
_:l3 <rdf:rest> <rdf:nil> .
Now is this list open or closed? This is an important question for OWL, since collections are used to construct sets such as intersections.

If it's open, then I could add in another piece of fruit...
<ex:basket> <ex:hasFruit> _:l0 .
_:l0 <rdf:first> <ex:orange> .
_:l0 <rdf:rest> _:l1 .
This would work, but it implies that I can infer that every element of the list can be directly connected to the basket. i.e.
<ex:basket> <ex:hasFruit> _:l0 .
<ex:basket> <ex:hasFruit> _:l1 .
<ex:basket> <ex:hasFruit> _:l2 .
<ex:basket> <ex:hasFruit> _:l3 .
Now this makes sense to me, but I don't recall seeing it anywhere in RDF. For instance, it's not in the semantics document for RDF or RDFS. The section on Collections does say that RDF does not require any well-formedness on the structure of the list (indeed, branched structures are explicitly mentioned), but since only OWL-Full allows arbitrary RDF structures, it isn't generally applicable to what I'm interested in.

I'd come to this question while I was checking that an owl:intersectionOf with "complete" modality was necessary and sufficient. I presumed that it was, but it doesn't hurt to check. After all, I've been caught out in the open world before. :-)

I first went to the abstract syntax for class axioms to find out how "partial" modalities were encoded, vs. "complete". The triples encoding of the abstract syntax shows that "partial" is simply a list of rdfs:subClassOf statements for each element in the intersection, while "complete" uses an RDF collection. Actually, the expression "SEQ" is used, but sequences are then described as being of type rdf:List, and not rdf:Seq (which, incidentally, are extensible, but no OWL aficionado will have anything to do with them, so I knew that wasn't a possibility).

Now to make sure that "complete" really is complete, I needed to ensure that lists couldn't be extended.

There is a hint that lists can't be extended in OWL-DL in the OWL Guide:
"If we wanted to add a new winery in some other ontology and assert that it was disjoint from all of those that have already been defined, we would need to cut and paste the original owl:AllDifferent assertion and add the new maker to the list. There is not a simpler way to extend an owl:AllDifferent collection in OWL DL. In OWL Full, using RDF triples and the rdf:List constructs, other approaches are possible."

That raises the intriguing possibility that in OWL-Full an intersection can never be complete. But since OWL-Full is undecidable anyway, I guess that's not something I need to worry about.

That then brought me back to the description for Set Operators which I haven't read in a while. And in reading this I realized that I was a moron for forgetting it...
The members of the class are completely specified by the set operation.

The text then goes on to describe that an individual that is a member of each element of an intersection is then a member of the intersection. In other words, membership in each element is a necessary and sufficient condition for membership in the intersection. Had lists been open, then membership would have merely been necessary, but not sufficient, since there could be another class in the intersection that has not been asserted (yet).

So complete is indeed "necessary and sufficient". But if I'd just looked at the Guide in the first place I could have saved myself a bit of time. Sometimes I feel like an idiot... and then I go and compound it by writing about my stupidity on my blog.

Oh well, this SPARQL implementation won't write itself. I'm down to OPTIONAL - which I expect to take about an hour, and the algebra integration. I'd better make that transformation clean, as I expect to be doing it again soon for the Sesame algebra.

Somehow I also need to find some time to finish writing that paper about 2 column RDF indexes. Did I mention that I think they're a cool idea? :-)

Tuesday, March 18, 2008

Functions

Whew! I've finally finished filter functions.

I was just about done when I had two issues show up for me. First off, I realized that each parameter of regex takes an expression that resolves to a simple literal. In other words, it is possible to calculate a different pattern and/or flag for every line <shudder/>. OK, so I wouldn't do it, but the spec says it, so I did it. Not that it was hard. It just seems obtuse.

While I'm on it, the flags for regex don't quite match the flags in Java. Granted, they're ALMOST the same, but if I want to be a stickler about this things, then it's not quite there. The most apparent difference is that the "x" character is not the same as enabling the COMMENTS flag in Java - though it's similar. In fact, in Java 5, the COMMENTS flag does not even appear as an option in the Javadoc, though a quick scan of the library source shows that it is.

Once I found small differences (which frankly I expected to find) I decided not to look for any more. The point is that I am not going to implement my own regex engine. Sure, it would be a great learning experience (I know that suffix trees get me part of the way - but I'd have to learn some more to get all of it), but it would take me months, and for no useful purpose. I'm surprised they didn't just choose a standard engine and say "use a standards-compliant regex engine, like XXX". As it is, it looks like everyone will be nearly there, but never quite make it.

The next problem was that I hadn't looked carefully enough at the definition of equal. I was mostly right, but it turns out that if you compare two literals that are different, then you don't return false: you throw a type exception. That just feels broken. Yes, I understand the semantics, but it's a perfectly common thing to do to check that two literals are the same. Having unexpected data throw an exception from a perfectly formed query might make the type theoreticians happy, but from the perspective of a software developer it looks like bad judgement.

Ironically, you CAN choose to return true for two different literals if you have a specific extension that handles direct comparisons between their types. For instance, you can check if "5"^^xsd:integer is equal to "5"^^xsd:long. Or perhaps you want to compare "5"^^temp:celsius and "41"^^temp:fahrenheit. If you want to get the same lexical form, then you use the sameterm() function, so that case is covered. But what if you want to compare two literals to have the same semantic value, and simply return false if they don't? Maybe I need to re-read this spec, because it doesn't work for me. Still, I've implemented it as asked, if it was more annoying to do so.

So now I have a lot of unit tests to write. Yes, I know the TDD purists will be out to get me, but the exact implementation and interfaces were still floating a little when I started, and besides, it is faster to write code with the tests written after. This is mostly because you don't have to change the tests if you realize you need to change the interfaces. And time is something I'm working hard against at the moment.

Filter

Andrae had a go at me for looking to make filters annotations on the constraints in the AST for the query. I didn't see a problem with this (and there is no operational difference) until Andrae pointed out that it would have a big impact on the optimizer and query re-writer, since each node can have more that one type: a filtered version and an unfiltered version.

He was suggesting that I use the conjunction code to apply filters (and the concrete syntax of SPARQL almost seems to imply that FILTER is added in as a conjunction - though this might just be to allow alternative syntaxes) but I pointed out that this will get awkward as the BOUND() function requires that variables not be guaranteed to be pre-bound. This led to a discussion of the use of BOUND(), and I was able to show that it is often used in conjunction with NOT and OPTIONAL to emulate subtraction functionality. When he saw what I meant, he was quite congratulatory of SPARQL for taking a log(n) operation and making it linear in n.
(For any non-Australians reading this.... yes, that was sarcasm)

At least this conversation made me realize that filtering the output of each Tuple would be a mistake (good thing I haven't written this yet). Instead I'll be implementing FILTER in the AST as a new constraint element that wraps another constraint (this makes it easy for the optimizer and transformer to ignore) and to create a new operation akin to MINUS that will do the work. Currently MINUS removes elements on the left that match (via variable bindings) elements on the right. The new code will remove them based on failing the FILTER test. Simple. :-)

Saturday, March 15, 2008

Writing

I've been trying to sit down and write for over a week, but each time I try I end up writing code instead. I've even fallen behind reading Slashdot. I've been getting a lot of messages from people wanting to know what happened last week, what our plans are for Mulgara, etc, but I just haven't been able to respond. That's what happens when a developer tries to work in the real world. I handle the real world, and I can handle code, but not at the same time. :-(

For the moment, I have priorities with work that I have to see to, so I'll be concentrating on technical things for a while. However, there are a few things happening with Mulgara, so I'll try to mention them as I go. In the meantime, I'm working on SPARQL queries.

SPARQL

The two main features that we're missing now are OPTIONAL and FILTER. Looking at OPTIONAL some time ago I realized that it's a hybrid between ConstraintConjunction (the inner join aspect), and ConstraintDisjunction (matches on the left side leaving unbound columns). I worked on something similar when I did ConstraintDifference a few years ago, so I know that this is easy. Hence, I put this part off until last.

In the last week or so (in between the meeting in San Francisco, and getting a nasty virus) I've been on filters. Right now I'm down to some classes to represent the operator definitions for all the functions like bound(), isIRI() and regex(). I already have the functionality implemented, but you still need to represent it in an abstract syntax if you're going to construct expressions at query time. So it's all just some boiler plate code to represent the parameters and pass the context on down to any variables that need resolving. After that, I'm on to the unit tests. In an ideal world, I'd test everything but in reality I have less time than that. Many of the functions are so similar that I'll just be testing a good sample of each of them.

Looking at the list in the SPARQL definition, you might think that there aren't too many functions at all, but you would be wrong. For a first approximation, many of the functions have to be reimplemented for each type of parameter. I've even gone to the effort of making sure that working on an <xsd:int> returns an <xsd:int> (when appropriate), and that an <xsd:short> returns an <xsd:short>. Since I was already trying to keep floating point numbers and integers apart, then this seemed to be a natural extension. Then I have to consider the types of numbers typed into the SPARQL query, literal numbers typed in to the query, and variables that get bound to numbers during processing. This raises the complexity considerably.

My first attempt had me doing largish methods that have copious "if (value instanceof ...)" statements in them. This is clunky and brittle. The moment I went to do it a second time, I decided to throw it out, and do it all with maps to functors (where are closures?!?). This actually worked well, and has the advantage of giving short and simple functions, and consistent patterns to follow in implementations. I'd have liked to use generics a little more, but they are really suited for interpreting code you are writing, rather than code that is being structured from a parser. Consequently, in one class I ended up writing a little Ruby script to write the series of functor classes I needed for arithmetic operations! Scary, I know, but it works quite well. It was either that or a series of if/then/else blocks taking me down dark passages I never want to enter.

The frustrating thing is that via autoboxing, you can write the same arithmetic over and over again, and have it do different things. For instance, the expression:
x * y
can result it totally different return types depending on whether x and y are Doubles, Floats, Integers, etc. This is common when programming in Java using the native types (like double and int) but this must be established at compile time, not when processing query. That means you want to have access to every combination of parameters at run time. This can be done with autoboxing, and defining classes with interfaces that return java.lang.Numbers. Then the code x*y can be written over and over, and it means something different each time. Java generics are nice, but they are a long way short of C++ templates, a fact especially obvious when you want to use them on native types (along with a hundred other reasons). But Generics + Autoboxing can sometimes get you some of the way.

OK, so that gave me access to each combination of parameters, but surely there's a better way to do it dynamically? Well, not in Java. The only approaches I've seen in the past either use heuristics to work out which version of arithmetic to run, or else it promotes everything into a standard type (like Double). The latter has arithmetic problems, and gives an inappropriate type for the result. The former can just be complex to read, write, and verify.

The problem comes back the CPU having different instructions for the different forms of arithmetic. A compiler has no problems selecting which one to use, but that is because it has access to the entire library of instructions. Conversely, a parser is not expected to have access to all instructions, leading to the problems I'm talking about. So you either choose a subset of instructions to work with (ie. upcast everything), or else you provide all instructions in a library, and then map the parameters into the correct instruction - either with the heuristic tree or something like a hash map.

Dynamic languages have a much easier time of it. For a start, they usually have all instructions at their disposal in the interpreter. Many (though not all) of them also simplify their numeric types to only a couple of types. Whatever they use, the poor programming schmuck writing their own interpreter (that would be me) need only write x*y and let the dynamic language developer work out what he wanted. At the very least, we can emit it in a string and do an eval().

Oh well, I shouldn't complain. I have all the functions written out (via Ruby) and a hash map that lets me get what I need trivially. With the exception that there is a lot of machine generated code that looks like the same thing over and over, the whole system comes down to just a few lines of easily verifiable code - which is what I like to see. Following the code path you'll see that any kind of operation just goes through a few steps and it's done.