Thursday, January 19, 2006

Loops Considered Harmful
Yesterday someone from work told me that he was trying to take all the characters in a java.lang.String, and return a string containing a single instance of each of those characters, in sorted order.

He had some straightforward code to do the work, but he had used a boolean array of ASCII bytes, and couldn't handle UniCode chars. In my cleverness, I wanted to show him how to do the work in 2 lines of code, only I immediately started running into limitations within Java.

Why two lines? Well it should be possible to put the contents of a string into an object that can sort them and remove duplicates, and then pull this data back out and put it into a new string. Learning Lisp I've learnt to expect the sort of functionality needed to do this simply and easily. There shouldn't be a need to iterate over the characters, recording the ones in use, and manually sorting. The data being manipulated should be a single "entity", rather than arrays of elements that require iterative techniques to manipulate. This functionality should all be available, making the final function a trivial call to just a few methods. As Abelson and Sussman said, "Programs must be written for people to read, and only incidentally for machines to execute." Iteration, sorting, and other mechanical operations should all be services found in lower levels of the code, and abstracted away.

Incidentally, this is exactly the sort of thing that is done all the time with the STL in C++. Because all of the collection classes (and arrays) meet the required interfaces, then almost any data type can be used in any position. All you need here are algorithms for sorting and removing duplicates, feeding in a string and getting one out the other side. Admittedly, C++ contains language features which allow significant abuse (and I've seen it abused quite badly), but it also allows for some very elegant code. Since C++/STL is an "unsafe" language offering some of the functionality you expect to see in Lisp, then this brings to mind the quote from Philip Greenspun: Any sufficiently complicated C or Fortran program contains an ad-hoc, informally-specified bug-ridden slow implementation of half of Common Lisp.

Back to Java, the obvious way to handle a problem like this is with a java.util.SortedSet (the only implementation of this in java.util.* is TreeSet). I just had to get the characters out of the string, insert them into the SortedSet and get them back out again, in String form. You'd think this would be easy, but I encountered several problems, showing up problems in the Java libraries, and even the Java language.

The String, the characters retrieved from it, and the sorted set of characters should all form "entities", which are simply different representations of exactly the same data. Since Java and its libraries expect each of these representations at different times, it should be easy and trivial to go from one representation to another. However, this isn't the case.

Strings can provide their contents as an array of char. However, these SortedSets need to take objects as their basic elements. (I understand why the language is built this way, but it still frustrates me that I can't declare a SortedSet<char>.) The object that wraps a char is java.lang.Character, so the array of chars will need to be converted to Characters before they could be put into a SortedSet. Also, since we want to treat the data as a single entity, we want to avoid iterating through the data, and instead insert the entire entity into the SortedSet. This means converting a char[] into a Character[].

Autoboxing
While Java 5 now supports autoboxing between native types and their associated objects, there is no way to convert an array of a native type into an array of associated objects. Since arrays are a part of the language, and not an accessible class, there is no way to extend arrays into doing this either. Instead, a loop must be written to iterate through one array, converting the elements to the types in the other array. This is an easy thing to do, but it's the kind of fundamental operation that should be found in a library, rather than being re-implemented by every developer.

So the first thing I had to do was write a method to autobox a char[] into a Character[] and another method to unbox it back again. These are quite trivial:

  private static final Character[] toCharacters(char[] chars) {
Character[] characters = new Character[chars.length];
for (int i = 0; i < chars.length; i++) characters[i] = chars[i];
return characters;
}

private static final char[] toChars(Character[] characters) {
char[] chars = new char[characters.length];
for (int i = 0; i < characters.length; i++) chars[i] = characters[i];
return chars;
}
So this will create an entity that holds all of the characters in java.lang.Character objects, rather than the char values that come from a String. Unfortunately, Java arrays are not compatible with the collections classes (unlike other languages, like Python, or C++ when using the STL), so the array needs further conversion still. This time there is a utility to convert an array into a collection-compatible object, called java.util.Arrays.asList(). So now I can finally put the string into a SortedSet:
SortedSet charSet = new TreeSet<Character>(
Arrays.asList(toCharacters(inputString.toCharArray())));
The java.util.Collection interface also specifies the toArray() method for converting the collection into an array, so it is possible to move back this way as well. However, the String class still needs to work with char rather than java.lang.Character, so the array unboxing method above is needed to convert this way:
public static String sortedWithoutDuplicates(String inputString) {
SortedSet<Character> charSet =
new TreeSet<Character>(
Arrays.asList(toCharacters(inputString.toCharArray())));
return new String(
toChars(charSet.toArray(new Character[charSet.size()])));
}
It's still disappointing how ugly this looks at the end. The first line has two more method calls than it should need:
  1. toCharacters() to convert the char[] from inputString.toCharArray() to a Character[].
  2. Arrays.asList() to convert the Character[] to a List<Character>
Instead, it would be nice to see constructors for collections accepting arrays. Since the language does not support array boxing/unboxing, then It would also be good to see a method like String.toCharacterArray(). With just those two changes the first line would instead be:
  SortedSet<Character> charSet =
new TreeSet<Character>(inputString.toCharacterArray());
Similarly, the second line has some ugliness to it as well:
  1. java.util.Collection<E> defines the methods:
    • <T> T[] toArray(T[] a)
    • Object toArray()
    We have to use the first method if we want to get back a Character[], but this means we need to create the array manually before calling the method.
  2. toChars() to convert the Character[] into a char[] for the String constructor.
It would be nicer if the Object toArray() method had be redefined into: E[] toArray(). I can't see why this change wasn't made with the release of generics, since E[] is still a valid Object[], and the other method got redefined this way when generics came out. I'd also like to see String accept a Character[] in a constructor, in the same way I'm looking for a String.toCharacterArray() method. These changes would make the second line look like:
  return new String(charSet.toArray());
So my fictional extensions would turn the three methods above (toCharacters(), toChars() and the mess in sortedWithoutDuplicates()) into the following single method:
public static String sortedWithoutDuplicates(String inputString) {
SortedSet<Character> charSet =
new TreeSet<Character>(inputString.toCharacterArray());
return new String(charSet.toArray());
}
As a postscript, I should explicitly point out that I don't think that the constructor for TreeSet<Character> should take a String, nor should a String constructor accept a SortedSet<Character>. These operations are too specific to the described task, and do not merit inclusion in a standard library. In particular, they would only apply if the wrapped type of the SortedSet is Character, and would break down on other types.

Why would I make suggestions about modifications to the standard libraries? While we have to work with them, we shouldn't just lie down and accept the shortcomings of Java. It doesn't happen all the time, but Sun do listen to suggestions on the language and libraries, particularly when suggestions extend the standard libraries without breaking them. Also, if they ever release Java as open source software, then we will have an opportunity to make contributions which may even get accepted.

15 comments:

Andrew said...

So here's my Java mad skills:
1/ String sorted = new String( Arrays.sort(unsorted.toCharArray()));


2/ To convert an array of primitive to the Object array just have a utility method in a loop:
java.lang.reflect.Array.set(destArray, count, Array.get(srcArray, count));

As they can be objects (and arrays are objects) there's no problem.

From what you've described there probably is some unicode issues with such a simple sort. Staying within objects seems to be a better way to do it...hide things behind some class instead of having to pass and convert.

Andrew said...

Sorry for number two the variable should be "index" not "count".

Quoll said...

1/ Actually I looked at doing this at one point, and it is certainly useful. But it doesn't remove duplicates. (Did you remember that requirement?)

2/ What's the difference between this code and saying:
destArray[index] = srcArray[index];
???
Both versions do boxing/unboxing, so what's the difference (except the reflection overhead you are incurring)? Not sure if I'm missing something here...

...and yes, there is an issue with sorting if you just convert a string into chars (or Characters). You have to convert it into CodePoint integers to reliably sort.

Andrew said...

BTW, I think I might have lost a post.

1/ The remove duplicates I did forget - I got all sure that I knew a small way to do it. Basically, to remove duplicates, using something like a set or a list, you need to stick with a primitive collection. That's where things like trove4j and PCJ come in.

I think autoboxing just confuses people into thinking Java can do more than it really can. Primitives and Objects are disjoint and to me should be treated that way i.e. moving from one to another is a conversion. I guess boxing/unboxing is just a little hack to reduce how much code you write at the sake of confusion (at least for me).

2/ The only advantage is that you can stick it in a method. I did a Google on it. Examples: arraycopy(int[], Integer[])
arraycopy(int[], int[])
arraycopy(Integer[], Integer[])
arraycopy(Integer[], int[])

Method:
public static void arraycopy(Object srcArray, Object destArray) {...}
tp://forum.java.sun.com/thread.jspa?threadID=532625&messageID=2571860

Andrew said...

That link:

http://forum.java.sun.com/thread.jspa?threadID=532625&messageID=2571860

Quoll said...

1/ I'm a big fan of Trove, though not everyone is happy to add it into their projects.

Notice that Trove works the way it does by defining every combination of primitive (and Object) for each container type? This is indicative of the language problems I'm talking about.

Coming from a background in C++, it would be great to see a merge (somehow) between Trove and generics, so it would be just as natural to say List<char> as it is to say List<Character>

You're right... Autoboxing is just syntactic sugar, and can lead to misconceptions (Java Puzzlers has a couple of good examples of this). It's nice though.

2/ OK, I can see that arraycopy lets me go from several methods (like getChars and getCharacters) down to a single method. Thanks.

On the negative side, it loses type safety and also requires casts at the calling end. I can live with casts (though they make things messy), but type safety really bothers me.

Andrew said...

There seems to be a JSR that's been around since 2002 for support of primitives in Generics here. From a person at Sun, "workaround : Subclass/write your own collection for every primitive type or live without them." Pretty sad.

Stephen Thorne said...

Being a python troll, I feel compelled to present a python version.

def sortedUnique(seq):
"""
@param seq: An arbitary sequence of elements.
@return: a list of all unique elements from the sequence, in sorted order.
"""
return sorted(set(seq))

assert sortedUnique('abc') == ['a','b','c']
assert sortedUnique('cba') == ['a','b','c']

Anonymous said...

http://jerub.livejournal.com/73671.html?thread=164807#t164807

That works too.

Peter Moran said...

And in ruby:
my_string.squeeze

Ruby gets more and more convincing when I see posts like yours - thanks.

Quoll said...

It's possible to write a squeeze method in any language. But is it needed? It seems that this is an infrequent enough operation that no one outside of Ruby saw fit to include an implementation.

No matter which language/library you choose, there will ALWAYS be an operation that is not implemented by default. The fact that Ruby has an implementation for this operation has no bearing on this argument.

The real question is whether the language makes it easy or difficult to write this function (or one like it). An important part of this consideration is if the library routines facilitate this easily.

My point with Java is that the language lets you implement operations like this reasonably well (others are a little better, but Java is not too bad), but the libraries are missing some fundamental operations.

Brad said...

A relatively old post but...

If folks want Ruby, Python, whatever, then use those dynamically typed languages. That has little to do with the static typed, primitives of Java. If you want a scripting language, use one based on JSR-223 -- including python and ruby.

I can't recall the last time I wrote code with arrays of primitives. Since the compiler takes care of much of the conversion, it isn't as heavy as it would appear to be.

It really is a non-issue.

Quoll said...

Yes, it really is an old post. I forgot all about it.

I'd been spending a bit of time learning more modern languages, and then having to come back to Java for my day job. Most other languages would have let you do this operation in just a couple of steps at a higher conceptual level (insert all characters of a string into a sorted set, and then get the elements of the set back into a string). Since I'd been looking at functional languages, I was translating the approach into a series of functions, which isn't necessarily the best approach with Java.

The "Java" approach may not even have been to use functions like this, but rather to use loops over an array, since after all, you need an array of char[] at the end. But that feels too much like C, and it was certainly the opposite of the style of programming I was trying to get into at the time. Supposedly, programmers should use the std libs to do a lot of the work for them, but the dichotomy between native types and their equivalent classes got in the way here.

At the end of the day, what I get out of this is that regardless of the approach you adopt, Java takes a lot of code to do some reasonably simple things. Almost every language that has become popular in the last decade (Python, Ruby, Groovy, Scala, Clojure) would have done this with a fraction of the code. That means that using another language would be faster to write, and less prone to bugs.

(Yes, I know that Python predates the "last decade", but it still got popular in that timeframe)

Javin @ FIX Protocol Tutorial said...

Hi,
Autoboxing is a great feature provided by JAVA5 but using it blindly may result in very subtle problem which will take hours and hours to
debug and find. to read more see the link
What is the problem while using '==' in autoboxing world in Java 5 ?

Thanks
Javin

Subham @ Collections Tutorial said...

SortedSet is an interface that implements Set interface , TreeMap is a great option to sort the elements , but the problem will be that it does not remove the duplicates . So instead of TreeMap you can try TreeSet , I think that will work fine , Suppose
TreeSet treesetobj = new TreeSet();
treesetobj.add("gearon");
treesetobj.add("blogspot");
treesetobj.add("com");
You can more about HashSet , TreeMap and TreeSet here
Java Collections Interview Questions and Answers