Tuesday, January 31, 2006

1 Day Processing
As expected, bureaucracy is alive and well at the US consulate. Since we were told that processing would take 1 day, we thought that flying 9 days later would be safe. Unfortunately, this was not the case.

The original plan had the movers packing for us tomorrow, moving us out on Thursday, and us flying on Friday. However, I've been putting off canceling utilities, paying for the flight, and confirming the movers until the visas arrived. We expected the paperwork yesterday (Monday), and accepted that they might get delayed a day (Today), but still nothing is forthcoming. So once 5pm rolled around we called everyone up and postponed until next week.

Other than the annoyance of waiting longer before we travel, the postponement did reduce some of the pressure initially (especially for Anne). However, it has caused a whole new set of problems. For instance, our mail starts being redirected this Friday, and requires 5 days notice to change (along with a fee). If the visas haven't arrived before the redirection, then they won't get here at all! This is because they are being sent via the Australia Post overnight courier service, which shows up with the mail.

I don't think Express Post items get redirected like normal mail does. Instead, they will be returned to the sender - in this case, the consulate. I think I'd have more luck getting the documents if they followed the redirection to Chicago.

The movers are booked solid next week, so they're coming tomorrow anyway. I'm not sure how we'll get by with everything packed up: plates, cutlery (or flatware), summer clothing, etc. But we'll have to make do.

In the meantime I'm trying to work, but my productivity is steadily declining this week (it wasn't too bad yesterday, wasn't too good today, and I expect it to be non-existent by tomorrow). As such, I don't think I'll be writing much on technical topics for a few days.

Friday, January 27, 2006

Visas
Anne, Nic and I flew to Sydney on Wednesday to visit the US consulate. This was to present our paperwork, and attend a 5 minute interview in order to get our visa. The rules said Anne had to be there, and Nic naturally has to be with Anne. Luc got to stay with his grandparents for the day. Refer to Anne's entry for his day. (That reminds me... I need to tell her to use JPG instead of PNG. Even with low compression the photo is 1/4 the size).

I can only presume everything went well, as they said we would have our paperwork back in a couple of days. This is sure to be delayed an extra day with the Australia Day holiday on the Thursday.

I can't say I enjoyed the process very much. We had to be up at 4:30am in order to get our flight, and we needed to go through security at the consulate early enough to make our "appointment", which was set at 10:30am. What really happened, was that we cleared security with 5 minutes to spare, and then waited in line for several hours before being called up. At that point our documentation was inspected and filed, and we were asked to take a seat again, until the interview.

At this point we've been in there for several hours, having drunk a couple of coffees to help deal with the lack of sleep (Nic is a good sleeper, but he's still a 10 week old baby and needs feeding overnight). We were in a room where all the staff are behind thick glass (why are they so paranoid about security then?) and the only way in or out is a thick security door. The nearest accessible rest rooms were 59 floors away, and on the other side of security. This made for a very unpleasant wait.

Babies being as they are, Nic needed changing after a couple of hours, so Anne took him downstairs and back through security again. Coffee being as it is, I was obliged to follow soon after. I was lucky, since I got back just in time for the interview. Five minutes and $520 later, and it's all over. We grabbed a bite to eat, caught a train back to the airport, waited on the tarmac while they fixed a faulty fuel pump, and came home.

Time from leaving the house to getting home again? 14 hours. What a waste of a day.

Monday, January 23, 2006

Syntactic Sugar
I really like some of the new features in the language for JDK 1.5. Generics, autoboxing, and annotations can make code easier to write and maintain. However, Java has implemented a number of these simply as syntactic sugar, rather than as a real change to the environment. This was important for maintaining compatibility, but sometimes it gets annoying.

While looking for a bug today (it turned out to be a typo that happened to compile) I decided to print out various forms of Unicode. These included a char[], a Character[], and a List<Character>. The code to print these used the new looping construct:

  void print(char[] c) {
for (char x: c) System.out.print(x);
System.out.println();
}

void print(Character[] c) {
for (Character x: c) System.out.print(x);
System.out.println();
}

void print(java.util.List c) {
for (Character x: c) System.out.print(x);
System.out.println();
}
Notice how the difference between each method is trivial? If Java used templates in a similar way to C++, then this would be one method. Yes, I know the compiler creates 3 methods in the binary. But why would I care about code redundancy on that scale when Universal binaries make duplicates of every piece of executable code? These days we need space for data, and almost never for code (even embedded applications typically have megabytes available to use).

So generics are nice, but when they can't be used in methods like these, they feel too restrictive.

The other thing that bothers me, is that the only way to describe management of an array is to create static methods that take arrays as parameters, and/or return arrays. Java arrays have no inherent extensibility. Sometimes I find that it would be useful to complement certain classes with another class that holds an array of the first class. Unfortunately, these classes are not compatible with real arrays. At least lists are there to take up some of the slack, but ArrayList can only get you so far.

I'm probably just a little sensitive to the inflexibility of arrays, as I've been (finally) learning some Python, where life is definitely better for arrays. I therefore found it amusing that one of my responses to the sort/unique problem of the other day was from Stephen where he showed of his 1337 h4x0r skillz with Python to solve the problem in 2 lines using built-in language features. Maybe I should be moving to Jython. :-)

Family Life
Anne has recently decided to put up a web page on our domain. She's using iWeb and I've been impressed with how quickly she was able to put together a site that looks reasonably attractive. I just hope it doesn't look like the web site of everyone else who uses iWeb. :-)

This will only be of interest to people who know us personally, as it is very focused on our family. I have no control over content, though I have been encouraged to read it myself. I think Anne just wants someone to proof read. ;-)

Thursday, January 19, 2006

Unicode
In my last post I was converting arrays of char into arrays of java.lang.Character. However, under Java 1.5 this runs into a new problem.

In Java 1.4, the Character class was based on Unicode 3.0, which defined Unicode characters in a 16 bit space. In Java 1.5 this has been revised to Unicode 4.0, which defines Unicode in a 21 bit space. Unfortunately, the 1.4 implementation of Character wrapped the 16 bit char native type, and is not extensible enough to deal with Unicode 4.0.

Instead of creating a new class to deal with this, Character has included a new set of static methods to work with the larger characters. The result is a little messy.

Character now works with 32 bit integers, which it refers to as CodePoints. Most of the static methods which take a char have been duplicated to accept an int as a CodePoint. This all works from the perspective of the static methods, but there is no longer a class that represent a complete Unicode character. The closest available is java.lang.Integer, which is fine for storing the value, but has no support for character operations.

I first realized this problem when I tried to convert from a String.toCharArray() into an array of Character. The new Character class has a static method for reading from a Unicode character from a char[], which holds characters in UTF-16 format. This made me wonder if the char[] from String.toCharArray() were actually being returned in UTF-16. This isn't explained explicitly, but the documentation for String indicates that the result is in UTF-16.

UTF-16 refers to an array of char where most Unicode characters (or CodePoints) fit into a single char or Character. However, if a character falls into a particular range, then it is not a complete Unicode character on its own, but instead it is merged with the following char to form a full 21-bit value. A pair of characters like this is called a surrogate pair.

Since an individual char is no longer guaranteed to represent a complete CodePoint, it is not possible to sort an array of char or Character as previously described. Instead, each character has to be decoded into a CodePoint first, and the array of CodePoints can then be sorted. Only there is no such object as a CodePoint, so an array of Integer has to be used instead.

But what about the method to convert a char[] to a Character[]? This now needs to be updated to convert a UTF-16 char[] into and array of objects which are large enough to handle full CodePoints. There also needs to be a corresponding method to convert back to a UTF-16 char[]. With all this work, why not create a new class for CodePoint, which includes these methods, plus all the methods appropriate to manipulating CodePoints?

Again, all the iterative code can be encapsulated in methods to hide the details, but now these methods can be attached to the object in question. This time the code is more complex, as it needs to test each character to see if it is part of a surrogate.

I've coded most of the CodePoint class, but I still have to bring over a few of the static methods from Character and make them instance methods on CodePoint. There are a LOT of methods, so I don't see myself writing tests for all of them. Maybe I'll have a go if other people are interested.

This may be overkill, given that only a few languages need characters which fall into the surrogate range. All Latin based languages are completely covered by char and Character. However, all it would take would be a single extended character to break an application which didn't handle all CodePoints. We've had comments about appropriate Unicode support for Chinese characters in Kowari before, so it's a legitimate concern.

Loops Considered Harmful
Yesterday someone from work told me that he was trying to take all the characters in a java.lang.String, and return a string containing a single instance of each of those characters, in sorted order.

He had some straightforward code to do the work, but he had used a boolean array of ASCII bytes, and couldn't handle UniCode chars. In my cleverness, I wanted to show him how to do the work in 2 lines of code, only I immediately started running into limitations within Java.

Why two lines? Well it should be possible to put the contents of a string into an object that can sort them and remove duplicates, and then pull this data back out and put it into a new string. Learning Lisp I've learnt to expect the sort of functionality needed to do this simply and easily. There shouldn't be a need to iterate over the characters, recording the ones in use, and manually sorting. The data being manipulated should be a single "entity", rather than arrays of elements that require iterative techniques to manipulate. This functionality should all be available, making the final function a trivial call to just a few methods. As Abelson and Sussman said, "Programs must be written for people to read, and only incidentally for machines to execute." Iteration, sorting, and other mechanical operations should all be services found in lower levels of the code, and abstracted away.

Incidentally, this is exactly the sort of thing that is done all the time with the STL in C++. Because all of the collection classes (and arrays) meet the required interfaces, then almost any data type can be used in any position. All you need here are algorithms for sorting and removing duplicates, feeding in a string and getting one out the other side. Admittedly, C++ contains language features which allow significant abuse (and I've seen it abused quite badly), but it also allows for some very elegant code. Since C++/STL is an "unsafe" language offering some of the functionality you expect to see in Lisp, then this brings to mind the quote from Philip Greenspun: Any sufficiently complicated C or Fortran program contains an ad-hoc, informally-specified bug-ridden slow implementation of half of Common Lisp.

Back to Java, the obvious way to handle a problem like this is with a java.util.SortedSet (the only implementation of this in java.util.* is TreeSet). I just had to get the characters out of the string, insert them into the SortedSet and get them back out again, in String form. You'd think this would be easy, but I encountered several problems, showing up problems in the Java libraries, and even the Java language.

The String, the characters retrieved from it, and the sorted set of characters should all form "entities", which are simply different representations of exactly the same data. Since Java and its libraries expect each of these representations at different times, it should be easy and trivial to go from one representation to another. However, this isn't the case.

Strings can provide their contents as an array of char. However, these SortedSets need to take objects as their basic elements. (I understand why the language is built this way, but it still frustrates me that I can't declare a SortedSet<char>.) The object that wraps a char is java.lang.Character, so the array of chars will need to be converted to Characters before they could be put into a SortedSet. Also, since we want to treat the data as a single entity, we want to avoid iterating through the data, and instead insert the entire entity into the SortedSet. This means converting a char[] into a Character[].

Autoboxing
While Java 5 now supports autoboxing between native types and their associated objects, there is no way to convert an array of a native type into an array of associated objects. Since arrays are a part of the language, and not an accessible class, there is no way to extend arrays into doing this either. Instead, a loop must be written to iterate through one array, converting the elements to the types in the other array. This is an easy thing to do, but it's the kind of fundamental operation that should be found in a library, rather than being re-implemented by every developer.

So the first thing I had to do was write a method to autobox a char[] into a Character[] and another method to unbox it back again. These are quite trivial:

  private static final Character[] toCharacters(char[] chars) {
Character[] characters = new Character[chars.length];
for (int i = 0; i < chars.length; i++) characters[i] = chars[i];
return characters;
}

private static final char[] toChars(Character[] characters) {
char[] chars = new char[characters.length];
for (int i = 0; i < characters.length; i++) chars[i] = characters[i];
return chars;
}
So this will create an entity that holds all of the characters in java.lang.Character objects, rather than the char values that come from a String. Unfortunately, Java arrays are not compatible with the collections classes (unlike other languages, like Python, or C++ when using the STL), so the array needs further conversion still. This time there is a utility to convert an array into a collection-compatible object, called java.util.Arrays.asList(). So now I can finally put the string into a SortedSet:
SortedSet charSet = new TreeSet<Character>(
Arrays.asList(toCharacters(inputString.toCharArray())));
The java.util.Collection interface also specifies the toArray() method for converting the collection into an array, so it is possible to move back this way as well. However, the String class still needs to work with char rather than java.lang.Character, so the array unboxing method above is needed to convert this way:
public static String sortedWithoutDuplicates(String inputString) {
SortedSet<Character> charSet =
new TreeSet<Character>(
Arrays.asList(toCharacters(inputString.toCharArray())));
return new String(
toChars(charSet.toArray(new Character[charSet.size()])));
}
It's still disappointing how ugly this looks at the end. The first line has two more method calls than it should need:
  1. toCharacters() to convert the char[] from inputString.toCharArray() to a Character[].
  2. Arrays.asList() to convert the Character[] to a List<Character>
Instead, it would be nice to see constructors for collections accepting arrays. Since the language does not support array boxing/unboxing, then It would also be good to see a method like String.toCharacterArray(). With just those two changes the first line would instead be:
  SortedSet<Character> charSet =
new TreeSet<Character>(inputString.toCharacterArray());
Similarly, the second line has some ugliness to it as well:
  1. java.util.Collection<E> defines the methods:
    • <T> T[] toArray(T[] a)
    • Object toArray()
    We have to use the first method if we want to get back a Character[], but this means we need to create the array manually before calling the method.
  2. toChars() to convert the Character[] into a char[] for the String constructor.
It would be nicer if the Object toArray() method had be redefined into: E[] toArray(). I can't see why this change wasn't made with the release of generics, since E[] is still a valid Object[], and the other method got redefined this way when generics came out. I'd also like to see String accept a Character[] in a constructor, in the same way I'm looking for a String.toCharacterArray() method. These changes would make the second line look like:
  return new String(charSet.toArray());
So my fictional extensions would turn the three methods above (toCharacters(), toChars() and the mess in sortedWithoutDuplicates()) into the following single method:
public static String sortedWithoutDuplicates(String inputString) {
SortedSet<Character> charSet =
new TreeSet<Character>(inputString.toCharacterArray());
return new String(charSet.toArray());
}
As a postscript, I should explicitly point out that I don't think that the constructor for TreeSet<Character> should take a String, nor should a String constructor accept a SortedSet<Character>. These operations are too specific to the described task, and do not merit inclusion in a standard library. In particular, they would only apply if the wrapped type of the SortedSet is Character, and would break down on other types.

Why would I make suggestions about modifications to the standard libraries? While we have to work with them, we shouldn't just lie down and accept the shortcomings of Java. It doesn't happen all the time, but Sun do listen to suggestions on the language and libraries, particularly when suggestions extend the standard libraries without breaking them. Also, if they ever release Java as open source software, then we will have an opportunity to make contributions which may even get accepted.

Saturday, January 14, 2006

Universal Binaries
Now that Macs are finally available with Intel chips, it occurred to me to give serious consideration to Universal Binaries. I never considered it a big deal, since XCode handles that for you. But all my recent binary code (non-Java) has been built with Makefiles so it can be compiled for multiple platforms. So I thought I'd better learn how to do it from the command line.

My first thought was that gcc might have some sort of "Mac-universal" option for the -arch flag. The man page says:

-arch arch
Compile for the specified target architecture arch. The allowable
values are i386, ppc and ppc64. Multiple options work, and direct
the compiler to produce ``universal'' binaries including object
code for each architecture specified with -arch. This option only
works if assembler and libraries are available for each architecture
specified. (APPLE ONLY)
I naively thought that -arch i386 -arch ppc would work. However, the result was just a PPC file. The man page for ld had the answer for this:
       The link editor accepts ``universal''  (multiple-architecture)  input
files, but always creates a ``thin'' (single-architecture), standard
Mach-O output file.
...
Only one -arch arch_type can be specified.
So how were universal binaries made if the linker can't do it? Again, the man page for ld provided the answer:
       The compiler driver cc(1) handles  creating  universal  executables  by
calling ld(1) multiple times and using lipo(1) to create a ``univer-
sal'' file from the results of the ld(1) executions.
So it looks like the build process starts in the same way that it always did, only now it repeats the process for the other architecture, and merges the final executables. This means that intermediate files either need to be in separate directories, or else they need to have different names (which doesn't sound like a good idea to me). It also includes the new lipo tool for merging. (I can't think of what this might be short for. I keep thinking "liposuction".)

I tried using this technique to build a universal binary for a "Hello World" program, but kept getting missing symbols for the linker stage on the i386 architecture. A look at the compiler log for an XCode project showed an option of -isysroot /Developer/SDKs/MacOSX10.4u.sdk. This isn't mentioned in the manual for gcc or ld, but it does the trick.

To illustrate the process, a simple program might be built from two files: first.c and second.c
  gcc src/first.c -c -o obj/first.o
gcc src/second.c -c -o obj/second.o
gcc obj/first.o obj/second.o -o bin/program
(I've used gcc to handle linker arguments automatically for me.)

The same build for a universal binary would look like:
  gcc src/first.c -c -arch ppc -o obj/ppc/first.o
gcc src/second.c -c -arch ppc -o obj/ppc/second.o
gcc obj/ppc/first.o obj/ppc/second.o -o bin/ppc/program
gcc src/first.c -c -arch i386 -o obj/i386/first.o
gcc src/second.c -c -arch i386 -o obj/i386/second.o
gcc obj/i386/first.o obj/i386/second.o -arch i386 -isysroot /Developer/SDKs/MacOSX10.4u.sdk -o bin/i386/program
lipo -create bin/ppc/program bin/i386/program -output bin/program
lipo can also be used with the -info flag to check a file type:
  $ lipo -info bin/ppc/program  
Non-fat file: bin/ppc/program is architecture: ppc
$ lipo -info bin/i386/program
Non-fat file: bin/i386/program is architecture: i386
$ lipo -info bin/program
Architectures in the fat file: bin/program are: ppc i386
This is all fine for Apples, but how should I write a portable Makefile that does all of this? Looking at a current Makefile, I really want to define the architecture in variables, and call into the same Makefile twice, followed by the lipo command to merge the results. However, Makefiles are not as easy to program as most languages. The biggest problem for me is how to call into another Makefile regardless of which target you are currently making. A test for "Mac OS X" at the top of the Makefile will not work, as commands can't be placed outside of a "target" block.

The only way I know to do this is to test for "Mac OS X" in every target, and duplicate the compilations. This would be tedious to write, be prone to errors, and be awkward to update (particularly for non-Mac users). I suppose I need to learn a lot more about Makefiles, but I'd rather not have to.

This could be easier in Ant. After all, the code I'm writing is actually a JNI library. Maybe I should write a gcc module?

Monday, January 09, 2006

CAS
The interfaces in UIMA have seemed rather obtuse recently.

To process a document, an analysis module is given a CAS (Common Analysis System), and returns a CAS. The CAS contains a reference to the original document, and any annotations that have been made so far. A module can ask the CAS for the original document, perform its analysis, and add any annotations back into the CAS. At the end of the process, the UIMA framework returns a CAS object which can be checked for annotation properties.

Adding annotations to a CAS is an easy process for a module. An appropriate Annotation derivative is created for the current CAS (the CAS object is passed as a parameter to the Annotation's constructor). It is also easy to read these annotations after UIMA has finished. However, the CAS objects in each case are accessed through different classes, with different features. In an analysis module I have a JCas object, but the results of the UIMA run are returning a TCAS object instead. TCAS is actually just a specialization for handling text of a standard CAS object, but I didn't see that immediately (I should have paid closer attention).

Fortunately, a JCas object has a method called getCAS(), and a CAS has a getJCas() method, as these objects have a 1:1 correspondence. Why have two objects? I think it's to separate out the CAS concepts from the Java specific information needed to manipulate a CAS. But don't quote me.

Separating these two classes, and providing different types in different circumstances has tripped me up a bit. Maybe someone can explain the reasoning, but I've just found it annoying.

Speculation
The blogosphere has started picking up some of the hassles between Kowari and NGC, so I figured that it's my turn to comment.

Last year, NGC asked for a 1.2 release to be pushed back to the 9th of January. Being an Open Source project, NGC could not force this postponement, even though they now own the copyright for the core of the code, and have initial contributor status (as defined in the MPL). But Open Source is new for NGC, and we wanted to support them supporting us, so we accepted the delay.

Then last week an NGC lawyer asked David to push it back further, citing "irreparable harm" to their company if the release proceeded as scheduled. To me, this speaks of a misunderstanding of the entire Open Source development process. The code and documentation is already out there for anyone to see. Just do a cvs update on Sourceforge. A "release" in this context simply means a marker on the code to indicate a certain level of features and stability. (We did a little more as developers, due to the code freeze, testing, and so on, but that doesn't influence users directly).

With NGC giving inappropriate directives to David, and given the work he has done with them, he decided that his best course of action is to leave the Kowari project.

While I don't think that NGC are acting to squash the Kowari project, they have put some serious blocks in the path of administering it. This has stemmed from them not understanding Open Source development, nor the MPL licence that Kowari has been released under. This is understandable, given their corporate/defence background, but it is still frustrating.

Since NGC are not aware of how an Open Source project works, it is conceivable that they would try to interfere with users' ability to continue to use and extend Kowari. Fortunately, the MPL offers protection for everyone here. Any challenges to a user's rights under this licence would necessitate a number of large corporations coming to the defence of the MPL, among them would be AOL and IBM.

Similarly, the MPL protects developers. We still have copyright on personal contributions (which excludes those contributions made while working for the Initial Developer). Those bits can be used elsewhere. There are also rights to do other things like "forking" the project. Personally, I'd rather not do anything like that, as it means abandoning the reputation that Kowari has built up over time, and antagonising NGC in the process.

The situation is annoying for developers (and untenable for David), but it shouldn't mean big problems for users. Kowari, or something like it, should continue. In the worst case, a snapshot of the currently accessible source (January 10, 2006) will always be usable by the community.

I haven't done any Kowari work in the last month, and probably won't do much until I get to Chicago next month. After that, I plan on continuing developing for Kowari, or a project like it.