Thursday, January 19, 2006

Unicode
In my last post I was converting arrays of char into arrays of java.lang.Character. However, under Java 1.5 this runs into a new problem.

In Java 1.4, the Character class was based on Unicode 3.0, which defined Unicode characters in a 16 bit space. In Java 1.5 this has been revised to Unicode 4.0, which defines Unicode in a 21 bit space. Unfortunately, the 1.4 implementation of Character wrapped the 16 bit char native type, and is not extensible enough to deal with Unicode 4.0.

Instead of creating a new class to deal with this, Character has included a new set of static methods to work with the larger characters. The result is a little messy.

Character now works with 32 bit integers, which it refers to as CodePoints. Most of the static methods which take a char have been duplicated to accept an int as a CodePoint. This all works from the perspective of the static methods, but there is no longer a class that represent a complete Unicode character. The closest available is java.lang.Integer, which is fine for storing the value, but has no support for character operations.

I first realized this problem when I tried to convert from a String.toCharArray() into an array of Character. The new Character class has a static method for reading from a Unicode character from a char[], which holds characters in UTF-16 format. This made me wonder if the char[] from String.toCharArray() were actually being returned in UTF-16. This isn't explained explicitly, but the documentation for String indicates that the result is in UTF-16.

UTF-16 refers to an array of char where most Unicode characters (or CodePoints) fit into a single char or Character. However, if a character falls into a particular range, then it is not a complete Unicode character on its own, but instead it is merged with the following char to form a full 21-bit value. A pair of characters like this is called a surrogate pair.

Since an individual char is no longer guaranteed to represent a complete CodePoint, it is not possible to sort an array of char or Character as previously described. Instead, each character has to be decoded into a CodePoint first, and the array of CodePoints can then be sorted. Only there is no such object as a CodePoint, so an array of Integer has to be used instead.

But what about the method to convert a char[] to a Character[]? This now needs to be updated to convert a UTF-16 char[] into and array of objects which are large enough to handle full CodePoints. There also needs to be a corresponding method to convert back to a UTF-16 char[]. With all this work, why not create a new class for CodePoint, which includes these methods, plus all the methods appropriate to manipulating CodePoints?

Again, all the iterative code can be encapsulated in methods to hide the details, but now these methods can be attached to the object in question. This time the code is more complex, as it needs to test each character to see if it is part of a surrogate.

I've coded most of the CodePoint class, but I still have to bring over a few of the static methods from Character and make them instance methods on CodePoint. There are a LOT of methods, so I don't see myself writing tests for all of them. Maybe I'll have a go if other people are interested.

This may be overkill, given that only a few languages need characters which fall into the surrogate range. All Latin based languages are completely covered by char and Character. However, all it would take would be a single extended character to break an application which didn't handle all CodePoints. We've had comments about appropriate Unicode support for Chinese characters in Kowari before, so it's a legitimate concern.

2 comments:

Burak Emir said...

Hi Paul, I found this very useful when trying to fix Scala's JSON parser for JDK 1.4 - keep it up. Publish that code point class!

Paula said...

OK, it's up here.