Tuesday, December 13, 2005

Catch Up
I've been wanting to write for nearly a week now. Every time I try to sit down for it I've had a work task, family needs, or packing to take priority over this blog. I ended up having to write little notes to myself to remind me of what I wanted to blog about.

JNI and Linux
Having made the Link library work on Mac OSX using JNI, I figured it would be easy to get it working on Linux as well. Unfortunately it didn't work out that way.

To start with, I got an error from the JVM saying that it could not find a symbol called "main" when loading the library. This sounded a little like dlopen loading an incorrectly linked file. I'm guessing that the dlopen procedure found what it thought was an executable, and therefore expected to see a main method. Googling confirmed this, but didn't really help me work out the appropriate flags for linking to fix this.

I had compiled the modules for the library using -fPIC (position independent code). I then used a -Wl,-shared flag to tell gcc to pass a -shared flag to the linker, in order to link the modules into a shared library. However, it turned out that I really needed to just use -shared directly on gcc. I've still to work out what the exact difference is, but that's not a big priority for me at the moment, since I have it working. According to DavidM there is something in the gcc man page about this, so at least I know where to look.

After linking correctly, the test code promptly gave a Hotspot error, due to a sigsegv. This meant that there was a problem with the C code. This had me a little confused, as it had run perfectly on OSX. Compiling everything in C and putting it all in a single executable demonstrated that the code worked fine on Linux, so I started suspecting that the problem might be across the JNI interface. This ended up being wrong. :-)

There are not many differences between the two systems, with the exception of the endianess of the CPUs. However, after looking at the problem carefully, I could not see this being the problem.

The initial error included the following stack trace:

C [libc.so.6+0xb1960]
C [libc.so.6+0xb4fcb] regexec+0x5b
C [libc.so.6+0xd0a98] advance+0x48
C [liblink.so+0x19f9d] read_dictionary+0x29
C [liblink.so+0x1d705]
C [liblink.so+0x1d914] dictionary_create+0x19
C [liblink.so+0x286c9] Java_com_link_Dictionary_create+0xc1
The only code I had real control of was in Java_com_link_Dictionary_create, dictionary_create and read_dictionary. I started by looking in Java_com_link_Dictionary_create and printing the arguments, but everything looked fine. So then I went to the other end and looked in read_dictionary.

I was a little curious about how read_dictionary was calling advance, as I hadn't heard of this function before. Then I discovered that the function being called was from the Link library, and has a signature of advance(Dictionary). This didn't really make sense, as my reading of the stack trace above said that advance came from libc and not the Link library (liblink). This should have told me exactly what was happening, but instead I tried to justify what I was seeing. I convinced myself that the function name at the end of each line described the function that had called into that stack frame. In hindsight, it was a silly bit of reasoning. I was probably just tired.

So to track the problem down I start putting printf() statements through the code. The first thing that happened was that the hotspot errors changed, making the error appear a little later during execution. So that meant I had a stack smash. Obviously, one of the printf() invocations was leaving a parameter on the stack that helped the above stack trace avoid the sigsegv. OK, so now I'm getting some more info on the problem.

It all came together when I discovered that I was seeing output from just before read_dictionary() called advance(), and from just after it, but not from any of the code inside the advance() function. At that point I realised that the above stack trace didn't need a strange interpretation, and that the advance() that I was calling was coming from libc and not the local library.

Unfortunately, doing a "man advance" on my Linux system showed up nothing. Was I wrong about this method? I decided to go straight to the source, and did a "nm -D /lib/libc.so.6 | grep advance". Sure enough, I found the following:
  000b9220 W advance
So what was this function? Obviously something internal to libc. I could download the source, but that wasn't going to make a difference to the problem or the solution. I just had to avoid calling it.

My first approach was to change the function inside Link to advance_dict(). This worked perfectly, and showed that I'd found the problem. However, when the modules were all linked into a single executable it had all worked correctly, and had picked up the local function, rather than the one found in libc. Why not?

I decided that if I gave the compiler a hint that the method was local, then maybe that would be picked up by the linker. So rather than renaming the function to advance_dict(), I changed its signature from:
  int advance(Dictionary dict)
to:
  static int advance(Dictionary dict)
I didn't know that this would work, but it seemed reasonable, and certainly cleaner since it's always a bad idea to presume that your name is unique (as demonstrated already). Fortunately, this solution worked just fine.

DavidM explained to me that static makes a symbol local to a compilation unit (which I knew) and was effectively a separate namespace (which I also knew). He also explained that this "namespace" has the highest priority... which I didn't know, but had suspected. So I learned something new. David and I also learnt that libc on Linux has an undocumented symbol in it called advance. This is worth noting, given how common a name that is. As shown here, it is likely to cause problems on any shared library that might want to use that name.

There's more to write, but it's late, so I'll leave it for the morning.

No comments: