In my last post I was converting arrays of
char into arrays of
java.lang.Character. However, under Java 1.5 this runs into a new problem.
In Java 1.4, the
Character class was based on Unicode 3.0, which defined Unicode characters in a 16 bit space. In Java 1.5 this has been revised to Unicode 4.0, which defines Unicode in a 21 bit space. Unfortunately, the 1.4 implementation of
Character wrapped the 16 bit
char native type, and is not extensible enough to deal with Unicode 4.0.
Instead of creating a new class to deal with this,
Character has included a new set of static methods to work with the larger characters. The result is a little messy.
Character now works with 32 bit integers, which it refers to as CodePoints. Most of the static methods which take a
char have been duplicated to accept an
int as a CodePoint. This all works from the perspective of the static methods, but there is no longer a class that represent a complete Unicode character. The closest available is
java.lang.Integer, which is fine for storing the value, but has no support for character operations.
I first realized this problem when I tried to convert from a
String.toCharArray() into an array of
Character. The new
Character class has a static method for reading from a Unicode character from a
char, which holds characters in UTF-16 format. This made me wonder if the
String.toCharArray() were actually being returned in UTF-16. This isn't explained explicitly, but the documentation for
String indicates that the result is in UTF-16.
UTF-16 refers to an array of
char where most Unicode characters (or CodePoints) fit into a single
Character. However, if a character falls into a particular range, then it is not a complete Unicode character on its own, but instead it is merged with the following
char to form a full 21-bit value. A pair of characters like this is called a surrogate pair.
Since an individual
char is no longer guaranteed to represent a complete CodePoint, it is not possible to sort an array of
Character as previously described. Instead, each character has to be decoded into a CodePoint first, and the array of CodePoints can then be sorted. Only there is no such object as a CodePoint, so an array of
Integer has to be used instead.
But what about the method to convert a
char to a
Character? This now needs to be updated to convert a UTF-16
char into and array of objects which are large enough to handle full CodePoints. There also needs to be a corresponding method to convert back to a UTF-16
char. With all this work, why not create a new class for CodePoint, which includes these methods, plus all the methods appropriate to manipulating CodePoints?
Again, all the iterative code can be encapsulated in methods to hide the details, but now these methods can be attached to the object in question. This time the code is more complex, as it needs to test each character to see if it is part of a surrogate.
I've coded most of the
CodePoint class, but I still have to bring over a few of the static methods from
Character and make them instance methods on
CodePoint. There are a LOT of methods, so I don't see myself writing tests for all of them. Maybe I'll have a go if other people are interested.
This may be overkill, given that only a few languages need characters which fall into the surrogate range. All Latin based languages are completely covered by
Character. However, all it would take would be a single extended character to break an application which didn't handle all CodePoints. We've had comments about appropriate Unicode support for Chinese characters in Kowari before, so it's a legitimate concern.
Thursday, January 19, 2006