Saturday, July 08, 2006

Forking Code

Of course, I can't really discuss work online, which has been a good reason to keep quiet lately. (Since I'm working so much of the time). The other reason is because my spare time is kept occupied with Mulgara.

Anyone looking at Mulgara recently would have some idea of what I'm been doing recently, instead of blogging. The short story is that NGC have come to a legal agreement acknowledging that Kowari is open source, and they should not have stopped progress in the project.

There are some who hope we can now abandon the fork, and go back to working on Kowari. In one sense, I'd certainly like that. Kowari has a name for itself, and it's a name I've worked with for years. It helps people who have a commitment to using Kowari, and for whom a name change would be an annoying distraction.

On the other hand, NGC didn't have a legal leg to stand on when they started all of this, and yet they demonstrated that you don't need one to cause problems. The problems have all come down to misunderstandings, but Open Source Software has always been difficult for NGC to understand, and it is always possible they will have difficulty understanding it in the future. I know they also offended some of the Kowari developers while mendaciously describing our work, and that never goes over well with people.

So while it would be a terrible shame to lose the name Kowari, there areenough people agreeing that there are enough reasons to go ahead with Mulgara. Besides, having started it up and got it running now (many thanks to Luigi and Herzum Software for the support) I feel I have some ownership over the process in the new project.

The other thing people might notice is that I've been working on code. Alas, there are no new features yet, but I'm getting close to a point where they will be forthcoming.

Roughly, the Mulgara work falls into 3 categories:
  • Administration for the project.
  • Forking the Kowari sources.
  • Upgrading Mulgara from Java 1.4 to Java 1.5.

Administration means learning about, and setting up Mailman, learning how to configure and use Subversion, learning the latest changes in Apache configurations (I think I last configured Apache at version 1.3), and setting up HTTPS and WebDAV access. Yuck.

I still haven't worked out how to get WebDAV to force logins over HTTPS. I can enforce logins, and I can get it over HTTPS, but not both. Oh well. I guess it just means that administration has to be done by Jesse or myself for teh time being.

Forking Mulgara has had its own difficulties. We just about had it done a few weeks ago, but NGC started insisting that certain recent code belonged to them and should not have gone into CVS at Sourceforge. They've since retracted this claim, but in the meantime we went back to the 1st of August last year and forked from that point. So we had to repeat a lot of the work we'd already done. Very annoying.

One benefit to going back to August has been that we may get to avoid a bug that we think has been introduced since then. A couple of users have reported difficulty loading large RDF/XML files. Since the parsing has all changed since August (not as a fix, but in an attempt to avoid using Jena), then this could be the cause of the problem. It will be a good opportunity to check it out anyway.

A number of bugs were introduced when all the packages, files and variables were renamed in the fork. In the course of tracking down all these bugs, I've used the opportunity to move to Java 1.5. It just created a few more bugs to track down. Fortunately, I knew where to find most of these problems from earlier personal attempts to convert to Java 1.5.

Java 1.5

The most obvious issue moving to Java 1.5 has been the new keyword "enum". I've noticed in several other places that "enum" is a popular name for packages and variables. Kowari was no exception. This was easily remedied. Unfortunately Apache has a package called enum which is referenced by code generated by the WSDL compiler. The workaround is to compile just this generated source with the -source 1.4 flag.

The other problem I encountered was in the change to Unicode 4.0.

I never gave a lot of thought to Unicode before Kowari was released as Open Source Software. Until that point I'd been able to get away with ASCII. After all, I don't know any other spoken languages (I really ought to do something about that) and all my work had been in English.

After Kowari had been out for a little while we started getting bug reports and fixes from someone in China compaining about how Kowari's strings were not handling certain characters properly. This woke me up to how international OSS is. I was used to working with people in the United States, but by being involved with something like this on the internet we were exposed to a whole new international community.

This has made me a little more sensitive to the need to support Unicode in all projects. OSS projects need it for the reasons I mentioned above, and commercial projects need it if they ever hope to succeed internationally (a real concern for me when my employer has offices in Italy, the UK, Turkey, and France).

Now 16 bit characters are all well and good, but the latest version of Unicode requires a full 20 bits. These extra characters (called "supplemental" characters) are not used very much, but I've already learnt that we can't ignore the things we are ignorant of. Unfortunately, the previous mechanism for managing these characters would not work in Java 1.5.

Previously we had used a regular expression to look for a pair of "surrogate" characters, which together form a single Supplemental Unicode character. Unfortunately, the character classes which used to be supported for this in Java 1.4 are no longer available. Some research showed that the regex library now wanted to treat these characters as a single character (which is a definite improvement), but checking documentation, and finally Java source code showed that there is no "character class" for referring to the supplemental character set.

I'm short on time here, but for anyone interested, the old regular expression for this was:

Now the correct expression is to use the numeric range for the surrogates, and use them to build a class of unicode literals which span this region (U00010000 to U0010ffff):

This seems to be the worst of all worlds. We have to use a pair of 16 bit codes to create a single 20 bit unicode (using the surrogate arithmetic for calculating the final value. e.g. UD800+UDC00 = U00010000). However, we can no longer refer to these codes as a "pair of surrogates", but as a single character (even though we DEFINE that character with a 2-part code). It might have been nice if Sun created a character class like this for us. Something like:

Well, at least it now works in Java 1.5. It just doesn't work in Java 1.4.