Friday, April 30, 2010

Topaz and Ehcache


Don't ask what I did 2 days ago, because I forget. It's one of the reasons I need to blog more. I also forgot because my brain got clouded due to yesterday's tasks.

I didn't get a lot done yesterday for the simple reason that I was filling in paperwork for an immigration lawyer. For anyone who has ever had to do this mind numbing task, they probably know that you end up filling in mostly the same things that you filled in 12 months ago, but just subtly different, so there's no possibility of using copy/paste. They will also know that getting all of the information together can take half a day. Strangely, the hardest piece of information was my mother's birthday (why do they want this? I have no idea). There is a bizarre story behind this, that I won't go into right now, but with my mother asleep in Australia there was no way to ask her. Fortunately, I had two brothers online at the time: one lives here in the USA, and the other is a student in Australia (who was up late drinking with friends, and decided to get online to say hi before going to bed). Unfortunately, neither of them new either (the well-oiled brother being completely unaware of why we didn't know).

But I finally got it done, cleared my immediate email queue (only 65 to go!) and got down to work.

My first task was to get the Topaz version of Mulgara up and running the same way it used to run 10 months ago. I had already tried going back through the subversion history for the project (ah, that's one of the things I did two days ago!), but with no success. However, I had been able to find out that others have had this error with Ehcache. No one had a fix, since upgrading the library normally made their problem go away. Well I tried upgrading it myself, but without luck. Evidently the problem was in usage, but I didn't know if it was a problem in the code talking to Ehcache, or the XML configuration file that is uses. Since everything used to work without error, I figured that the code was probably OK, and that it was the configuration at fault. The complexity of the configuration file only deepened my suspicion.

I didn't want to learn the ins-and-outs of an Ehcache configuration, so my first non-lawyer related task yesterday was to look at the code where the exception was coming from (thank goodness the Java compiler includes line numbers in class files by default). So it turned out that Terracotta (the company who provides Ehcache) have a nice navigable HTML versions of all their opensource code, which made this task much more pleasant than having to get it all from Subversion. This led me to the line that was throwing the exception, which looked like:
List localCachePeers = cacheManager.getCachePeerListener("RMI").getBoundCachePeers();
Great, a compound statement. OK, so I use them myself, but they're annoying when you debug. Was it cacheManager that was null or was it the return value from getCachePeerListener("RMI")?

At this point I jumped around in the code for a bit (I quite like those hyperlinks. I've seen them before too. I should figure out which project creates these pages), looking for what initialized cacheManager. I didn't find definitive proof that it was set, but it looked pretty good. So I looked at getCachePeerListener("RMI") and discovered that it was a lookup in a Hashmap. This is a prime candidate for returning null, and indeed the documentation for the method even states that it will return null if the scheme is not configured. Since the heartbeat code was making the presumption that it could perform an operation on the return value of this method, then the "RMI" scheme is evidently supposed to be configured in every configuration. The fact that it's possible for this method to return null (even if it's not supposed to) means that the calling code is not defensive enough (any kind of NullPointerException is unacceptable, even if you catch it and log it). Also, the fact that something is always supposed to be configured for "RMI" had me looking in the code to discover where listeners get registered. This turned out to come from some kind of configuration object, which looked like it had been built from an XML file.

So the problem appears to be the combination of something that's missing from the configuration file, and a presumption that it will be there (i.e. the code couldn't handle it if the item was missing). At this point I joined a forum and described the issue, both to point out that the code should be more defensive, and also to ask what is missing. In the meantime, I tried creating my own version of the library with a fix in it, and discovered that the issue did indeed go away. Then this morning I received a message explaining what I needed to configure, and also that the code now deals with the missing configuration. It still complains on every heartbeat (in 5 second intervals), but now it tells you what's wrong, and how to fix it:
WARNING: The RMICacheManagerPeerListener is missing. You need to configure
a cacheManagerPeerListenerFactory with
class="net.sf.ehcache.distribution.RMICacheManagerPeerListenerFactory"
in ehcache.xml.
Kudos to "gluck" for the quick response. (Hey, I just realized – "gluck" is from Brisbane. My home town!)

Incidentally, creating my own version of Ehcache was problematic in itself. It's a Maven project, and when I tried to build "package" it attempted to run all the tests, which took well over an hour. Coincidentally, it also happened to be dinner time, so I came back later, only to discover that not all of the tests had passed, and that the JAR files had not been built. Admittedly, it was an older release, but it was a release, so I found this odd. In the end, I avoided the tests by removing the code, and running the "package" target again.

With all the errors out of the way I went back to the Topaz system again and run it. As I said earlier, it was no longer reporting errors. But then when I tried to use queries against it, it was completely unresponsive. A little probing found that it wasn't listening for HTTP at all, so I checked the log, and sure enough:
EmbeddedMulgaraServer> Unable to start web services due to: null [Continuing]
Argh.

Not only do I have to figure out what's going on here, it also appears that someone (possibly me) didn't code this configuration defensively enough! Sigh.

At that point it was after dinner, and I had technical reading to do for a job I might have. Well, I've received the offer, but it all depends on me not being kicked out of the country.

1 comment:

Dragisa Krsmanovic said...

The cache invalidator in Topaz forces clients to use matching version of Ehcache. Otherwise you might end up with serialization errors due to class version mismatch. Any upgrade to Ehcache needs to happen both on Ambra and Topaz.

Makes me think that cache invalidator shouldn't be under Topaz since it's something that is specific to a particular implementation of the client.