Thursday, March 10, 2005

Friday
I'm writing up Thursday's news and views on Friday morning because I couldn't really do much typing last night. I've picked up a stomach bug, which has slowed me down a bit (the joys of having a child in daycare), so I probably won't get a lot done today. That's not so bad, because over the last few weeks I've been working late every day, and a lot on weekends as well... a point that Anne has been making recently. :-) A day to recuperate might be nice.

Subtractions and Joins
The join code caused me a lot of angst yesterday. I haven't worked in this layer very much, and not for a long time. So even though I understand the principles it was unfamiliar. It turns out that it is unfamiliar for Simon as well, though he originally wrote the join code.

Since the principles of a subtraction are very similar to a join, thought I should start there. This starts with TuplesOperations.join(). Of course, none of this is documented in the least, so working through it is a bit of a chore. I'll start with a cut down listing:

public static Tuples join(List args) {
List operands = flattenOperands(args);

List unified = unifyOperands(operands);

List sorted = sortOperands(unified);

switch (sorted.size()) {
case 0:
return empty();

case 1:
return (Tuples)sorted.get(0);

default:
Tuples result = new UnboundJoin((Tuples[]) sorted.toArray(new Tuples[0]));
closeOperands(sorted);
return result;
}
This has all the debug and exception code removed.

The function takes a list of Tuples. This is for optimising on the associative property of AND so that groups of AND operations like (a AND b AND c) can be calculated as a single operation, rather than the pair of operations ((a AND b) AND c).

The flattenOperands() method deals with the associativity. I wasn't initially sure what the "operands" were supposed to be, though I guessed that they were the incoming Tuples in the args list. To confirm this I looked at the method:
private static List flattenOperands(List operands) throws TuplesException {
List result = new ArrayList();
Iterator i = operands.iterator();
while (i.hasNext()) {
result.addAll(flattenOperand((Tuples)i.next()));
}
return result;
}
So it iterates through the argument list, and flattens each Tuples, before adding it to the flattened result. OK, so what does flattening a Tuples do?
  private static List flattenOperand(Tuples operand) throws TuplesException {
List operands = new ArrayList();
if (operand instanceof UnboundJoin) {
Iterator i = operand.getOperands().iterator();
while (i.hasNext()) {
operands.add(((Tuples)i.next()).clone());
}
operand.close();
} else {
operands.add(operand.clone());
}
return operands;
}
So if the Tuple is an UnboundJoin then each of its "operands" are added to the flattened list; otherwise the original Tuple is added.

So what is a Tuple's list of operands? The Tuple argument for this function is called operand, so are a Tuple's operands other Tuples which are related to it in some way? Checking out the Tuple interface tells us:
  /**
* Return the list of operands to this tuples.
* This is intended to allow debugging traversal of an unevaluated tuples Hirsch.
* Be aware that the tuples returned from this method are not cloned, and should
* be considered immutable.
*/
public List getOperands();
Of course, the word "operand" is not mentioned anywhere else in the interface.

At this point I could either go hunting for the implementations of getOperands, or make a guess based on what I know of the system, and check it with someone online... and Simon happened to be online.

We have a few Tuple types which are virtual, instead of being "materialised" on disk. These Tuples are based on other Tuples, and perform merges dynamically. For instance, the UnorderedAppend simply holds its component Tuples, and while being iterated over, it will internally iterate over each component Tuples in turn. Joins work similarly to this, but they require the underlying Tuples to be ordered according to the variables being used in the join, so they can be iterated upon appropriately. If the Tuples are not in the correct sort order, then they have to be materialised (ie. have their own internal structure that represents the Tuples, rather than relying on other Tuples to do it for them. If this is large enough it gets pushed to disk) and then re-sorted.

I reasoned that a Tuple's "operands" might be the underlying Tuples which a virtual Tuples uses. This makes sense when viewed in the context of the flattenOperand method, as it asks for the operands, and adds them to the list, instead of the original Tuple. Note also that the operands of a Tuple are only retrieved when the Tuple is of type UnboundJoin. In other words, it only flattens the operation for joins, and not appends.

Appends could also be flattened in this way, and they possibly are, but this code is specifically for flattening out joins. Any equivalent code for append operations must be elsewhere.

I asked Simon about all these operations, and he wasn't sure either, but he agrees with all of the above, and offered a couple of pointers for me along the way.

While struggling through it all, Simon pointed out that I'm looking at the most obscure code in the system. Any documentation that it has is most likely out of date, and method names are often unintuitive, so it can be very difficult to decipher. At least I don't feel like an idiot after he told me that.

Unifying and Sorting
The next method in join() is unifyOperands(), followed by sort(). The first one I don't quite get. Simon's explanation seemed to involve a partial resolution of the join, and I know that is supposed to happen later on (if Andrae ever reads this then perhaps he can put a comment in below to explain it). However, the sort operation is a part of the optimiser. This method puts the list into an order that will result in the minimal processing of Tuples. For instance, if the first Tuple has 1000 rows, and the second has a matching 1000 rows, and the third has only 2 rows, then the join will involve matching the first 1000 against the second 1000, and trying (unsuccessfully) to matching against the remaining 2 rows. If the order were reversed, then when the first 2 rows are matched against the 1000 rows, then the result is likely to be very small, maybe 10. This can then be matched against the next 1000 row tuple to come up with the final result.

This ordering of tuple joins has a significant impact on performance. DavidM and I built a heuristic analyser that found the most common cases and optimised for them, but we needed time that we were never given to build a really good optimiser. A couple of years later Simon refactored it, and improved the system significantly. More recently, Andrae has cleaned it up, resulting with we have today. Apparently, the unifyOperands method is involved with this process, but I don't quite understand how.

However, all of these operations are for optimising joins, based on two properties of join operations: joins are commutative and associative. Neither property is applicable to a difference operation, so they can all be skipped.

Finally, the join goes through the switch statement. This part is actually pretty easy. If there are no items in the list of tuples to be joined, then the result is empty. I believe that this can probably be replaced with an assertion, as I don't think that it is possible to be trying to "join" nothing. Similarly, if the list is only one item long, then that item can be returned with no further processing. Again, I wonder if it is possible to even get here if a single argument is passed through?

Finally, if there is a valid list of Tuples to be joined, then these are passed off to the constructor for an UnboundJoin object. This is the class that does the actual joining, and all of the preceding code simply optimises its arguments so it will be executed as efficiently as possible.

I've started to go through this class, but I'm despairing that I can borrow the code I'd hoped to use. It really is very specific to joining. This makes sense because of the optimisation work that has gone into it. At least it gives me some ideas for working on the subtraction code.

It's a good thing I budgeted a good amount of time here.

No comments: