Monday, March 14, 2005

Joins
Today was one of those days where you feel like you've been working hard all day, but then there is very little to report.

Today I was looking at the join code... again. I've pulled the file aside, and I've been annotating it with comments. I'm tempted to check this back in, as comments for other people to read. However, I'm not sure that is appropriate, as I've been over commenting the code, and the code is likely to get lost in the noise.

Either way, I understand it a little better now. Once again, I was fortunate enough to have some input from Andrae as I went through a part of it.

I think I should have a better understanding of this code for a couple of reasons. The most pressing is that subtractions will use very similar operations, though the more I read the more I doubted this. Funnily enough, now that I know a lot more, I'm back to believing this again.

The next best reason is because until now, only Andrae and Simon have understood this part of Kowari. I understood the architecture, and the principles in use, but this is a far cry from being able to follow the code. Indeed, even knowing what the code was trying to do, I still needed to annotate it extensively before I could follow any of it.

Having another person understanding this code makes it more likely to be maintained. It also means that there is yet another person who can add features to it in the future, if required.

Refactoring
The first thing I noticed yesterday was that Andrew had refactored the UnboundJoin and IndexedJoin code. He moved all of their common code was moved into a base class JoinTuples. At face value this makes sense, but on further investigation it was a waste of his time:

$ find src -name \*.java | grep -v Test | xargs grep -l IndexedJoin
src/jar/tuples/java/org/kowari/store/tuples/IndexedJoin.java
src/jar/tuples/java/org/kowari/store/tuples/JoinTuples.java
In other words, this class is not used anywhere (except the tests). I believe this class became redundant due to the big join refactoring that went on last year.

This refactor is in now, so I guess it doesn't hurt to leave it. Maybe one day I'll refactor it back the way it was. :-)

Sorting
The most striking thing about this operation is that there is no sorting done on any of the input data. This stunned me, as it means that the algorithm is guaranteed to be inefficient when the data is unsorted. I asked Andrae about this, and it turns out that the answer isn't as simple as I'd have thought.

The TuplesOperations.join() method does all of the work of setting up the parameters for the UnboundJoin constructor. This is so the join code can just do a join, and not worry about optimising its input. I've discussed a lot of this set up work already in the last couple of days.

The last operation in the setup is to call the sortOperands() method. This method does two things. It sorts the constraints according to their result size (which is relatively cheap to obtain), from smallest to largest. This results in fewer iterations during the join. The other thing this method does is to re-order the results of a Tuples, using the defineIndex() method on the Tuples. This is only available to Tuples which are the direct result of a constraint, and are therefore based directly on an index. The idea is to use a different index once some variable bindings have been established on other arguments in the join, and a different index is therefore available.

The result is that the operands for a constraint are usually always in the required order for the join to proceed efficiently. The only way this will not happen is when one of the operands is not based directly on an index, but is the result of an append operation.

While this situation is plausible, it almost never happens in practice. Typically, a query is built as a sum of products, and the products are the joins. While it is possible to perform an append somewhere down the line, and join its results, this just never seems to come up.

If it becomes a problem then there would be two ways around it. The first is to reorganise the query into a sum of products form. This would perform some joins redundantly, but quickly. The second way is to find any compound parameters in a join, and wrap them in a HybridTuples for re-ordering. This would wear an initial cost, but would end up being much faster during the join operation.

An appropriate feature to add in this situation would be an algorithm choice based on complexity analysis of the unordered join vs. preordering. Plugging in some of the numbers obtained from tuples sizes would tell the algorithm which way to proceed.

Fortunately, there are no use cases for this at the moment, so there is no pressing need to perform any of this.

Subtraction
The details of the join implementation have definitely provided me with a lot of clues as to how I should go about the subtraction procedure. I still have a few things I want to sort out, so I'll probably work on this a bit overnight before trying to write up a description.

No comments: