Monday, April 04, 2005

Sorting
What is wrong with sorting?

Debugging the difference operation has been a frustrating process, not least because it appears to be working. To start with there were a few problems, but I tracked these down quickly, and resolved them all. But then the unit tests started to give me some strange results.

The first thing I found was that the result of a difference had a null set of variables. This didn't make sense, as it was just returning the variables of the minuend, and the difference operation would have thrown an exception if the minuend had no variables.

After trying to work out what was wrong with the Difference class for way too long, I suddenly realised that the object I was looking at was not the result of a difference, but the sorted result of a difference. So I looked at the unsorted result, and sure enough everything was correct. As far as I can tell, the Difference class is working perfectly well.

So where is the problem? Well sorting does not use a RowComparator (like the lexical sorter does), but is based on how the variables line up with columns from the underlying indexes. This will not work with a Difference class, as defined by its interface, as this technique is presupposing knowledge of the internal workings of the class.

So there are two choices here. Either I create a RowComparator and use the other sort method (the one used for lexical sorting of the output to the user), or I drop the principle of information hiding, and allow sort to have special knowledge of the Difference class. I fear that the latter is the way to go here, but it bothers me, as it will not work on a different (and more efficient) implementation of the difference algorithm.

For the moment, differences are calculated lazily by iterating on the minuend, and filtering based on what is found in the subtrahend (that's a simple description of a lot of work!). So structurally, the result should resemble the minuend in almost every respect. Many of the Difference methods reflect this, in that they simply delegate their work to the minuend.

Given this, I have not been able to work out why the sorting functionality does not just work on the minuend part of the Difference object. This won't work if the Difference implementation is ever updated to sort the minuend as well, but since that is not required in this implementation there should not be a problem.

A cursory glance at the sort didn't reveal anything, so I'll be going through it in detail tomorrow.

In the meantime, I altered the tests to check for the unsorted data first. So far everything has passed, which gives me confidence that it is all working (though I still need to check the unbound variables tests more carefully). Once "sorting" has been resolved I should be able to get out of this code.

No comments: