Unicode
In my last post I was converting arrays of char
into arrays of java.lang.Character
. However, under Java 1.5 this runs into a new problem.
In Java 1.4, the Character
class was based on Unicode 3.0, which defined Unicode characters in a 16 bit space. In Java 1.5 this has been revised to Unicode 4.0, which defines Unicode in a 21 bit space. Unfortunately, the 1.4 implementation of Character
wrapped the 16 bit char
native type, and is not extensible enough to deal with Unicode 4.0.
Instead of creating a new class to deal with this, Character
has included a new set of static methods to work with the larger characters. The result is a little messy.Character
now works with 32 bit integers, which it refers to as CodePoints. Most of the static methods which take a char
have been duplicated to accept an int
as a CodePoint. This all works from the perspective of the static methods, but there is no longer a class that represent a complete Unicode character. The closest available is java.lang.Integer
, which is fine for storing the value, but has no support for character operations.
I first realized this problem when I tried to convert from a String.toCharArray()
into an array of Character
. The new Character
class has a static method for reading from a Unicode character from a char[]
, which holds characters in UTF-16 format. This made me wonder if the char[]
from String.toCharArray()
were actually being returned in UTF-16. This isn't explained explicitly, but the documentation for String
indicates that the result is in UTF-16.
UTF-16 refers to an array of char
where most Unicode characters (or CodePoints) fit into a single char
or Character
. However, if a character falls into a particular range, then it is not a complete Unicode character on its own, but instead it is merged with the following char
to form a full 21-bit value. A pair of characters like this is called a surrogate pair.
Since an individual char
is no longer guaranteed to represent a complete CodePoint, it is not possible to sort an array of char
or Character
as previously described. Instead, each character has to be decoded into a CodePoint first, and the array of CodePoints can then be sorted. Only there is no such object as a CodePoint, so an array of Integer
has to be used instead.
But what about the method to convert a char[]
to a Character[]
? This now needs to be updated to convert a UTF-16 char[]
into and array of objects which are large enough to handle full CodePoints. There also needs to be a corresponding method to convert back to a UTF-16 char[]
. With all this work, why not create a new class for CodePoint, which includes these methods, plus all the methods appropriate to manipulating CodePoints?
Again, all the iterative code can be encapsulated in methods to hide the details, but now these methods can be attached to the object in question. This time the code is more complex, as it needs to test each character to see if it is part of a surrogate.
I've coded most of the CodePoint
class, but I still have to bring over a few of the static methods from Character
and make them instance methods on CodePoint
. There are a LOT of methods, so I don't see myself writing tests for all of them. Maybe I'll have a go if other people are interested.
This may be overkill, given that only a few languages need characters which fall into the surrogate range. All Latin based languages are completely covered by char
and Character
. However, all it would take would be a single extended character to break an application which didn't handle all CodePoints. We've had comments about appropriate Unicode support for Chinese characters in Kowari before, so it's a legitimate concern.
Thursday, January 19, 2006
Subscribe to:
Post Comments (Atom)
2 comments:
Hi Paul, I found this very useful when trying to fix Scala's JSON parser for JDK 1.4 - keep it up. Publish that code point class!
OK, it's up here.
Post a Comment