I've been wanting to write for nearly a week now. Every time I try to sit down for it I've had a work task, family needs, or packing to take priority over this blog. I ended up having to write little notes to myself to remind me of what I wanted to blog about.
JNI and Linux
Having made the Link library work on Mac OSX using JNI, I figured it would be easy to get it working on Linux as well. Unfortunately it didn't work out that way.
To start with, I got an error from the JVM saying that it could not find a symbol called "main" when loading the library. This sounded a little like
dlopen loading an incorrectly linked file. I'm guessing that the
dlopen procedure found what it thought was an executable, and therefore expected to see a
main method. Googling confirmed this, but didn't really help me work out the appropriate flags for linking to fix this.
I had compiled the modules for the library using
-fPIC (position independent code). I then used a
-Wl,-shared flag to tell gcc to pass a
-shared flag to the linker, in order to link the modules into a shared library. However, it turned out that I really needed to just use
-shared directly on
gcc. I've still to work out what the exact difference is, but that's not a big priority for me at the moment, since I have it working. According to DavidM there is something in the gcc man page about this, so at least I know where to look.
After linking correctly, the test code promptly gave a Hotspot error, due to a
sigsegv. This meant that there was a problem with the C code. This had me a little confused, as it had run perfectly on OSX. Compiling everything in C and putting it all in a single executable demonstrated that the code worked fine on Linux, so I started suspecting that the problem might be across the JNI interface. This ended up being wrong. :-)
There are not many differences between the two systems, with the exception of the endianess of the CPUs. However, after looking at the problem carefully, I could not see this being the problem.
The initial error included the following stack trace:
The only code I had real control of was in
C [libc.so.6+0xb4fcb] regexec+0x5b
C [libc.so.6+0xd0a98] advance+0x48
C [liblink.so+0x19f9d] read_dictionary+0x29
C [liblink.so+0x1d914] dictionary_create+0x19
C [liblink.so+0x286c9] Java_com_link_Dictionary_create+0xc1
read_dictionary. I started by looking in
Java_com_link_Dictionary_createand printing the arguments, but everything looked fine. So then I went to the other end and looked in
I was a little curious about how
advance, as I hadn't heard of this function before. Then I discovered that the function being called was from the Link library, and has a signature of
advance(Dictionary). This didn't really make sense, as my reading of the stack trace above said that
advancecame from libc and not the Link library (liblink). This should have told me exactly what was happening, but instead I tried to justify what I was seeing. I convinced myself that the function name at the end of each line described the function that had called into that stack frame. In hindsight, it was a silly bit of reasoning. I was probably just tired.
So to track the problem down I start putting
printf()statements through the code. The first thing that happened was that the hotspot errors changed, making the error appear a little later during execution. So that meant I had a stack smash. Obviously, one of the
printf()invocations was leaving a parameter on the stack that helped the above stack trace avoid the
sigsegv. OK, so now I'm getting some more info on the problem.
It all came together when I discovered that I was seeing output from just before
advance(), and from just after it, but not from any of the code inside the
advance()function. At that point I realised that the above stack trace didn't need a strange interpretation, and that the
advance()that I was calling was coming from libc and not the local library.
Unfortunately, doing a "man advance" on my Linux system showed up nothing. Was I wrong about this method? I decided to go straight to the source, and did a "nm -D /lib/libc.so.6 | grep advance". Sure enough, I found the following:
So what was this function? Obviously something internal to libc. I could download the source, but that wasn't going to make a difference to the problem or the solution. I just had to avoid calling it.
000b9220 W advance
My first approach was to change the function inside Link to
advance_dict(). This worked perfectly, and showed that I'd found the problem. However, when the modules were all linked into a single executable it had all worked correctly, and had picked up the local function, rather than the one found in libc. Why not?
I decided that if I gave the compiler a hint that the method was local, then maybe that would be picked up by the linker. So rather than renaming the function to
advance_dict(), I changed its signature from:
int advance(Dictionary dict)
I didn't know that this would work, but it seemed reasonable, and certainly cleaner since it's always a bad idea to presume that your name is unique (as demonstrated already). Fortunately, this solution worked just fine.
static int advance(Dictionary dict)
DavidM explained to me that
staticmakes a symbol local to a compilation unit (which I knew) and was effectively a separate namespace (which I also knew). He also explained that this "namespace" has the highest priority... which I didn't know, but had suspected. So I learned something new. David and I also learnt that libc on Linux has an undocumented symbol in it called
advance. This is worth noting, given how common a name that is. As shown here, it is likely to cause problems on any shared library that might want to use that name.
There's more to write, but it's late, so I'll leave it for the morning.