I've finished the difference operator, and am now part way into the tests for it. This operation has been frustrating, as it required a disproportionate amount of work for a relatively small amount of code. The problem was understanding all of the join code, and understanding the ramifications of apparently simple things. It took quite a lot of effort, including input from Simon and Andrae. It was also difficult making the mental context switch from the algebraic level (higher up in the abstraction stack) to the low level tuple-at-a-time interfaces.
Unfortunately I wrote at length about this last week before losing what I'd typed, and tonight I'm disinclined to go into all the details again. I'll just gloss over things, else I'll never write about it.
I had to look the terminology up, but the left and right sides of a subtraction are called the minuend and subtrahend respectively, and I've used these terms in the code. The idea of the difference operation is to remove all tuples (or "rows") from the minuend, where there is a corresponding row in the subtrahend. In this case, corresponding means that all the matching variables contain the same values.
An efficient way to do this involves sorting of the subtrahend. The code would then iterate over the minuend, and perform binary searches on the subtrahend to determine if each row should be removed.
This efficiency can be improved if the minuend is also sorted in the same order as the subtrahend. If a match is found on a row in the minuend (meaning that row is to be discarded) then a sorted minuend can use a binary search to jump over all the subsequent rows with the same variable values.
The problem with some of the steps in efficiency is the initial cost. Small data sets don't really matter, given that they will never make it to the disk, so it's only the large ones which need to be considered. Sorting large sets will be on disk where it is expensive. This complexity has to be traded off against the complexity of leaving the data unsorted.
In the case of the subtrahend, this tuples will need to be searched for every row of the minuend. If this were performed against unsorted data, then the complexity would be polynomial, which is unacceptable. Sorting can also be polynomial (we use the quicksort algorithm) but this is a worst-case condition, and has very low probability. Quicksort actually averages at n.log(n) which is pretty good. The data will often be sorted already, so a general algorithm which always uses sorted data and only sorts when it needs to, will perform reasonably well.
Also, the nature of subtracting means that our use cases typically have a small subtrahend, in relation to the minuend. This further increases the benefits of sorting the subtrahend.
Conversely, sorting the minuend is not so clear cut. For a start, sorting is only of benefit when there are a lot of rows with the requisite variables set to the same values. This does happen, but not for the majority of rows. Performing a search after every row to find which row that set of variables ends on, changes that part of the operation from a linear complexity up to n.log(n). There are ways to avoid the check every time (move on to the next row, and if it matches the last row then do the search), but there is still an expense. Given the minor improvement that is likely, then the initial cost of the sort would not appear to be worth it. As a result, the minuend is not being sorted.
Ideally we could develop a heuristic to determine when the minuend should be sorted, but this is not trivial, and smacks of premature optimisation. If there is a problem in the future then this could be revisited.
Sorting is not as easy as I think it should be. For a start, I wanted to know when I needed to sort and when I didn't. The only indication of the sort order for a tuples is the
RowComparator, so I thought I would be working with that.
After looking at this interface, I realised that there is no way to query the order that it imposes. There is also no way to compare two separate instances of
RowComparator except by reference. While I understand the philosophy of making this an opaque object, it should at least be able to compare itself to other comparators so they can determine if they impose the same ordering.
Looking at the other comparators in the system, I realised that we only have two types. The first is used for ordering the AVL trees in the native quad store, while the second is used for ordering globalised results alphabetically. Neither of these were useful for ordering by a list of variables, based on local node IDs.
Andrae helped me to discover that I needed to use the
TuplesOperations.project() method to get the ordering that I need. There is a sort method available, but it will sort by all the variables in the tuples (I don't mind which order the variables are in, which is good, because the sort method does not accept an order). So the unwanted variables must be removed before the sort can take place. This is done lazily with the
This operation can result in duplicate rows, so internally the method sorts the results to make duplicates appear consecutively. In this case, the
project will do all of the work.
The main problem with using
project blindly is when the data is already sorted in the correct way. In this case, no operation is necessary, and the expense of the projection should be avoided. This is easy to spot when the tuples to be projected only contains the required variables and no others, but we currently have no way of seeing the sort order when the tuples has other variables as well. This is a failing in the general case for join operations, and will need to be addressed eventually.
In the meantime, the sorting of the subtrahend is determined by a couple of simple steps. If the data has the correct number of variables, then it is checked for a sort order. If there is an order, then it assumed to be correct, otherwise a
sort is performed. If the number of variables is incorrect (ie. there are extra variables which are not found in the minuend) then a
project is performed. This unnecessarily reduces the number of variables, but also performs the necessary sort.
After sorting the subtrahend, the implementation of the
Difference class is relatively straight forward. Like many of these operation classes, the operation is performed lazily.
Almost all methods are passed on to the minuend, as this class represents that parameter, minus a few rows. The one major exception is the
next method. In this case, the minuend's
next method is called, but then the current row on the minuend is tested by checking for those variables in the subtrahend. If the row exists, then the minuend is moved on. This continues until a row is not matched in the subtrahend, or the minuend runs out of rows.
Describing this in hindsight makes it all look simple, but getting here has been anything but. I'm looking forward to getting the tests complete and moving back to the higher level work.
Tuesday, March 29, 2005