Friday, July 30, 2004

I still didn't get to testing the N3 import code today.

TJ has been concerned about the time taken to return large amounts of data over RMI. He has wanted to compress this data for some time, and so he started doing so today.

When the code first failed for him I came in to have a look, and discovered that he hadn't been looking at the right file. I offered a few points of advice, but then ended up staying for a bit longer. A bit longer went on until mid-afternoon when TJ had a conference call, at which point I continued on trying to do compression.

Using some standard examples, TJ was trying to compress all communication with an RMI object by making sure that the object was an extension of UnicastRemoteObject, and then he provided his own socket factories in the constructor. This worked well, so long as he didn't try to wrap the data streams with a compression object.

To start with, TJ had tried to use ZipInputStream and ZipOutputStream, but these are designed for file compression, and the decompression call immediately complained about the item number. This was resolved by moving to an appropriate compression utility such as or

As soon as the correct compression streams were in place then it would all come to a halt. This was happening because the server was not sending back its response. I realised that this was because of buffering on the compression stream, and that it just needed to be flushed to make sure the data was sent through. However, RMI normally works across a socket which doesn't need to be flushed, so it never bothers to flush.

After thinking about it, I decided that there was really only one time that data really needed to be compressed, and that was in the return of AnswerPageImpl objects across the network. If these could be serialized to a compressed form then most of the network traffic would be reduced. So I moved onto implementing readObject and writeObject. TJ pointed out that profiling had already showed significant time was being spent in reflection during serialization, so I pulled back and went on to readExternal and writeExternal in the Externalizable interface instead. This is because of a lot of metadata about the class is avoided when using Externalization over Serialization.

Now that I was writing to the compression stream myself, I was able to flush the data stream, but again the server would just sit there and not respond to the client. I talked with DM about this, and tried a few other things, before it occurred to me to close the stream instead of flushing it. This worked correctly. So now I know that flushing to a DeflatorOutput does nothing. It won't be prepared to send data until it is closed.

Compression Performance
Once compression on the returned data was working correctly TJ and I could start timing the responses. TJ loaded WordNet into the database, and performed a query which returned all the data to the client. The queries went over a 100Mb network, through 2 switches.

With compression we were able to return the data in about 1 minute 40 seconds (1:40), with a compression ratio of about 80%.

Without compression we retrieved the same data in 1:13. Something was obviously wrong here. The compression/decompression was being run on 2.8GHz P4 processors, and they should not have even blinked at this kind of operation.

This led to me checking all sorts of things, ranging from compression ratios and times of execution, through to buffer sizes and compression levels.

Double Buffering
I had initially created an object writing stream with the following code:

  ByteArrayOutputStream data =  new ByteArrayOutputStream();

  ObjectOutputStream objectData = new ObjectOutputStream(new DeflaterOutputStream(data));
  // write to objectData
I could then use data.toByteArray() to get the resulting bytes. However, this wouldn't tell me how much compression we were achieving. So I changed it to look like this:
  ByteArrayOutputStream data =  new ByteArrayOutputStream();

  ObjectOutputStream objectData = new ObjectOutputStream(data);
  ... // write to objectData
  ByteArrayOutputStream zData = new ByteArrayOutputStream();
  DeflaterOutputStream deflater = new DeflaterOutputStream(zData);
The result of this was that data.toByteArray() could provide the raw, uncompressed data, while zData.toByteArray() provided the compressed data.

While the presence of both sets of bytes allowed me to obtain more information about the compression rates we'd achieved, I assumed that there was some cost, since an extra byte buffer was being unnecessarily created. I assumed incorrectly. When I compared the times of both methods, the code with the two ByteArrayOutputStream objects was consistently 3 seconds faster (over the ~1:40 time period). Having both objects available is handy, especially for metrics, so I was more than happy to get this result.

Why did the use of a second buffer speed things up? I can only guess that writing the entire data stream in one hit was faster than compressing it by parts, as the first method does.

Bad Compression
Out of frustration with the longer times taken to return compressed data, I started playing with the compression level and buffer size.

Looking at the Deflater code, I discovered that the default buffer size is 512. I was not able get any real performance change out of varying this size.

The other thing to try was the compression level. Neither the documentation nor the provided source shows what the default level is. So I did some comparisons with compression set to 1 (minimal) and 0 (no compression). The results here surprised me.

Minimal compression still gave me a compression ratio of about 80%. This could mean that the default level is 1, or close to it.

Even more surprising, a compression level of 0 took almost exactly the same length of time to execute as a compression level of 1. ie. ~1:40. This was surprising, as 0 compression should have taken no time, and yet the code was still being slowed down by over 20 seconds compared to no compression at all. So there is significant overhead in calling the compression library, but actually doing compression while there was not taking an excessively long time.

So where is the overhead in compression coming from?

The Deflation classes use JNI calls to do the actual compression (even when the compression level is 0). So the problem has to be either in crossing the JNI boundary, or in the compression algorithm itself.

Java relies pretty heavily on JNI, and it is normally quite efficient. I've certainly never seen it behaving slowly in the past. That leads me to think that the compression algorithm is at fault, possibly in scanning the data to be compressed (which would happen even with a compression level of 0). DM pointed out that Sun ignore the zlib libraries on *nix platforms and that they implement these libraries for themselves. Given the performance I've seen here I'm guessing that they have done a terrible job of it.

While it offers nothing yet, compression may still be of benefit across a slower network. I've introduced a system property for a compression level, and I use this to select in compression when the level is above 0. If the level is undefined, or defined at 0 then the uncompressed byte array is used as the serialized data, compression is not performed at all, and there is no performance penalty.

Fortunately, having the uncompressed byte array hanging around was speeding things up, so there was no performance penalty in using it.

For the moment, the system property is necessarily defined at the server side. This is because it is the server which sends the pages back over the wire, and it is at this point that the compression is performed. If compression is to be used seriously in the future then the client should be the one to set the level. This means that client requests for pages should include the desired compression level as a parameter. This is easy, but unnecessary for the moment.

Future Speedups
There are two ways I can speed up returning pages like this.

The first is to use a decent compression algorithm. The ones that come with Java 1.4 are obviously too inefficient to be of use. Most implementations are unlikely to be blindingly fast unless they are done in native code, in which case there is the major problem of portability.

On the other hand, I may be too quick to assume that a pure Java compression would not be as fast. This should be the subject for some future study.

The second option is to improve serialization of the AnswerPageImpl class. Profiling has shown that at least a third of the time taken to return a query is spent performing reflection to serialize this code. Since we know exactly the kind of data that can be found in a page, it would be much more efficient to write only what we need to the data stream. This would save on both the amount of data sent, and the reflection needed to find this data. It is this approach which I'll be taking in the near future.

No comments: