Saturday, February 24, 2007

Zarro Boogs Found

At the start of this week I tried to integrate several patches submitted to Jira. Unfortunately, there were two impediments to this.

The first problem was simply because XML output has now been configured to provide datatypes and language types to any literal that defines them. This makes a lot of sense to me, and I was happy to integrate it. Unfortunately all the scripted tests are looking for exact textual matches on the output, so many of them are suddenly failing. This is trivial but time consuming to fix. As I write this, I'm running tests over my latest attempt to squash all these bugs. I hope I finally got them all.

The other problem seemed to be easier, but was far more insidious.

I was getting an iTQL error that I didn't understand. The stack trace in the log did not take me far enough, so I added in some more logging. The next time through the tests, the section I was analyzing passed, but something else failed.

This went on for a whole night. Every time I thought I was getting closer, the failing code would move. This generally indicates threaded code with a race, but I couldn't imagine where. I finally tracked one instance of the bug down to a transaction exception, which was the point where I decided I needed to talk to Andrae.

String Pool

In the meantime, Andrae had decided to take on the hard task of the fixing a problem that had been reported with the string pool. I had so much on my plate that I wanted to avoid it, though I knew it was a significant issue, and would compromise confidence in the project.

Since I needed Andrae's help I offered some description on the structure of the string pool. However, I quickly realized how lazy I was being, so I decided to look for myself.

Andrae had managed to narrow down the problem to a very specific behavior. This made it much easier for me to think through exactly what was happening, and how the system had to get there.

The problem was due to an entry in the AVLTree, with no corresponding entry in the IntFile. To understand this, I need to explain some of the basics of the string pool.

String Pool

The string pool gets its name from the fact that it used to store strings in it. The strings were either URIs or string literals for RDF. However, it didn't take long before we needed to store other data types as well, starting with doubles, and eventually moving on to the entire set of XSD datatypes. By this point, we had a lot of code using the name "String Pool", and we were all used to referring to it by that name, so we kept calling it that, despite the fact that it holds much more than strings.

The purpose of the string pool is to map these data objects to a unique number representing a node in the RDF graph, which we call a graph node, or gNode. We also need to map from a gNode back to the data, so we know what that node is supposed to be representing.

The string pool is based on two main structures, with each structure providing one of the two mappings we need.

The first is an AVLTree, which is used to sort all entries (of all data types). This is used to look up data (strings, URIs, numbers, dates, etc) to map it back to a gNode (a "long"). It is also used to find ranges of data, such as all URIs beginning with the rdf domain, or all dates in a range. This tree is phased, in that it contains multiple root nodes, with each root corresponding to a phase. Only tree nodes uniquely belonging to the latest phase can be written to. Older phases are for reading data only, and each phase is associated with Answers obtained when that phase was the most recent. (This is why Answers must be closed - we can't mark nodes from an old phase as free until the Answer no longer needs them).

The other structure is called an IntFile. This is an array integers, which we use to hold an array of string pool entries. These entries each contain exactly the same information as the nodes in the AVLTree (laid out in the same way). The entries are indexed in the array by gNode value. Each entry is 72 bytes, so a gNode ID of 100 would correspond to data starting at byte 7200.

Importantly, this array is not phased. Once a gNode gets allocated in a phase, it represents the same data in all phases. The gNode can only be reused after the data has been deleted, AND all phases that have ever referred to that gNode have been closed.

So we have a tree which maps from data to gNode (the "localization" operation), and the array which maps from gNode to data (the "globalization" operation).

Out of Synch

Back to the problem: we have an entry in the tree that is not in the array. Now the methods that put data in the string pool put that information into both structures at the same time. The structure of this information is identical for both structures (the array doesn't have to store the gNode, since the location of the entry gives this). For a gNode to come up as a "blank node" means that the entry in the array is full of zeros, meaning it hasn't been written to.

The problem is trying to work out how the same data can be written into two files, and only show up in one of them.

Data written in a phase tree is really irrelevant until it is committed. Committing means writing the location of the new root to a separate file (called the metaroot file). So if we see data in the tree, then the tree MUST have been correctly committed to disk. The operation of committing means "forcing" all data to disk in a consistent way. We do this by forcing the tree to disk (if this fails partway through, then the old tree data is still valid), and only once this is finished can we save this new root to the metaroot file (making this new root valid in future). If you look at the prepare() method, then you'll see this data being forced to disk, and only then is the metaroot file updated and forced to disk.

So how can this AVL tree get ahead of the array in the IntFile? This could only happen if the IntFile were not forced to disk when the tree was forced to disk. This would mean that while the data was "written" to the array, it would not have arrived on the disk yet, even though it was committed to the tree. This would be bad, as nothing should be "committed" until it is completely consistent on disk.

So the solution was to find if there was any code path where the tree were forced to disk and the array were not. Looking in the string pool, it turns out that the tree is only forced to disk in one place, and the array is never forced to disk! So this must be the source of the problem.

Write behind at the OS level is generally a safe operation, even when a process is killed. However, it is possible to call write() and have things die before the data makes it to disk. (The libc libraries can do write-behind, the OS could die in some way, power gets pulled, etc). So the lack of synchronization is unlikely, but possible. This is why we've never seen it before now.

The solution is simply to ensure that the IntFile is completely written, before any data relying on that file is consistent enough to be committed. This means we just need the single line before updating the metaroot file:
 gNodeToDataFile.force();

Tests

I believe the bug could be reproduced by inserting a large number of unique literals into the store (sufficient to overflow the caches and force the JVM to buffer IO); followed by a commit(); and then kill -9'ing Mulgara immediately the commit returns.

Andrae's suggestion here is an attempt to fill the write-behind buffers, committing the data in an inconsistent state (since there is necessary data in the write-behind buffer), and then killing the write operation before these buffers can reach the disk.

Since we're trying to prevent the IO operations from operating correctly at the libc/OS level, it's not a problem that is suitable for running in JUnit!

So we're confident we've found the problem, but consistently reproducing it is not
trivial. This makes it similarly difficult to properly test that it's fixed.

I'm the one who understood the string pool structure enough to look for this and find it, but I wouldn't have even been looking in the right place had Andrae not taken me there. This story got played out in a similar way again the following day.

Transaction Exceptions

The String Pool bug was a vital one to find and fix, but it was extremely rare, and really hadn't affected us before this first bug report. It is a testament to how rare the bug is that this was our first report of it in the nearly 6 years that it has existed! However, I still had my transaction bug to find, and I was getting this one all the time.

I had managed to create a test case that was guaranteed to fail for me. My code performed a simple query (a duplicate of one that usually failed in our tests), and then iterated through the Answer, printing the results. However, before iterating, I did exactly what ItqlInterpreterBean does, and checked isUnconstrained() first (in which case the return value should be true).

This almost always failed, but I soon discovered that if I did the query (successfully) in the iTQL shell, then it would sometimes let my test code run correctly. My first approach was to see the difference between my code and the iTQL shell, and I finally realized that the shell would always call beforeFirst() on an Answer before check isUnconstrained(). Doing this in my sample code also fixed the problem. But why?

The exception I was getting was always a MulgaraTransactionException. The problem was happening when the transaction manager detected that an operation was being performed by a different thread to the one that had just registered itself for performing the operation. But I didn't know why this was happening.

While discussing what was happening, I started to gain an appreciation for what was happening inside the transaction manager. Andrae and I (mostly Andrae) even managed to spot some minor problems, which we could clean up. If nothing else, it was helping understand this new part of the system.

Transactions are the area Andrae just spent a couple of months working on, so I asked him to look at it. This was when I discovered that Andrae wasn't seeing these problems at all! I'd already established that a thread race was going on, and it seemed that my Core Duo (unfortunately, it's a 32 bit Core 1, and not the 64 bit Core 2) was demonstrating sufficient difference to a single cored machine to demonstrate the bug. So I gave Andrae an account on my machine, and went to bed.

RMI Threads

The next morning Andrae had some news for me. The transaction was being thrown correctly. For some reason an AnswerPage object was accessing an Answer using a different thread to the one that was asking for isUnconstrained() on the same Answer.

I was confused, as I thought that these two read-only operations would be perfectly safe operating in separate threads. Andrae explained that the semantics of an RMI connection like this required that only one thread be accessing the object at a time. Once he told me this, I immediately understood why, but it had never occurred to me before then.

The purpose of an Answer page is to package a page of Answer data, compress it, and send it over RMI to the client, all completely transparently to the client. This saves the client from making an expensive RMI call every time it needs the next row of an Answer. This is only ever performed on large result sets. The problem was that this would result in a pause for the client every time it ran out of the current page, and it needed to wait for the next one. The solution was to get the next AnswerPage in the background. In other words, they were supposed to operate in a separate thread. I have a blog entry on when I wrote this, back in 2004.

In general operation, using an Answer never resulted in two threads trying to access the server at once. The initial call to beforeFirst() would set the paging thread off, and pretty much all other calls would access local data. The only exceptions (like getRowCount()) were never going to be called while in the process of iterating over the Answer. So while I had not been aware of this issue with multi-threading over RMI when I wrote the code, the operation of the class would not have caused a problem anyway.

This all changed a few months after Answer paging was written, when isUnconstrained() was introduced. This indicates a "match" or true value for a query, even when there are no variable values to be returned. So before iterating over a result set, a client should always check for the value of isUnconstrained(). Unfortunately, the first thing a large Answer does after construction is to build start the paging thread. If isUnconstrained() is called immediately after construction of the Answer, then this will conflict on the server.

There are a few things to note here. A small Answer set will never be paged, so this bug will not manifest for Answers of less than 100 lines (or whatever mulgara.rmi.marshallsizelimit is set to by the user). Also, the form of a query will indicate if isUnconstrained() needs to be called, and only short, non-paged Answers will need this method called. Usually, the only method that will ever be called while paging Answers in the background is the next() method. However, ever since isUnconstrained() was introduced, generic code for handling and printing Answers, such as what is found in ItqlInterpreterBean, will always need to call isUnconstrained() no matter the size of the answer set. In fact, this must be the cause of the "Concurrent access exception" errors that have been reported, and I never understood.

Andrae's new transaction code was written specifically to prevent inappropriate access like this. So by throwing this exception it is doing its job perfectly. It's a godo thing he wrote it, or else this bug would have been around for a lot longer!

Once Andrae explained to me that we now had a situation where two threads could try to speak to the server at once, I realized I had to prevent both threads from operating on the server at the same time. The background thread is only ever started synchronously, in response to a call to the constructor, beforeFirst() or next(), and it ends asynchronously. On the other hand, the other methods which use RMI are all started and finished synchronously with respect to the rest of the client system. The way to handle this then is for any potentially conflicting methods to wait until a page returns.

Fortunately, the mechanism for this page return is already in place. The method is called waitForPrefetchThread(). This is the ideal method to use here, since it manages timeouts, throwing an exception if the server takes too long. All we had to do was put this method into the start of every method which makes an RMI call. It only took a few minutes to add the code, and then for the first time in a long time, every test passed!

Like the first one, this bug has been around for a while. It would also have taken much longer to be found if it were just Andrae or just myself working on it. It's a good thing we're both on it at the moment. I'll have to try to make more time available when we can both be online at once.

I hope Andrae is feeling good about these bug fixes, because I sure am!

No comments: