Thursday, September 29, 2005

Work
Since I'm working on commercial software now I'll be doing a lot of logging on the company's internal Wiki instead of here. I'll continue to talk about Kowari or study in here, but there are only so many hours in the day!

Remote Servers
As part of what I'm doing this week, I need to talk to a Wordnet RDF set. With my poor little notebook struggling on all the tasks I'm giving it, I figured that it made more sense to put the Kowari server on my desktop machine (named "chaos"). Unfortunately, I immediately hit a problem with RMI that had me stumped for some time today.

Starting Kowari on the desktop box worked fine. Querying it on that box also worked as expected. But as soon as I tried to access the server from my notebook I started getting errors. Here is what I got when I tried to create a model:

create ;
Could not create rmi://chaos/server1#wn
(org.kowari.server.rmi.RmiSessionFactory) couldn't create session factory for rmi://chaos/server1
Caused by: (ConnectException) Connection refused to host: 127.0.0.1; nested exception is:
java.net.ConnectException: Connection refused
Caused by: (ConnectException) Connection refused
My first response was confusion at the connection attempt to 127.0.0.1. Trying to be clear on this, I changed the request to talk directly to the IP address:
create ;
Could not create rmi://192.168.0.253/server1#wn
(org.kowari.server.rmi.RmiSessionFactory) couldn't create session factory for rmi://192.168.0.253/server1
Caused by: (ConnectException) Connection refused to host: 127.0.0.1; nested exception is:
java.net.ConnectException: Connection refused
Caused by: (ConnectException) Connection refused
I started to wonder if this was a problem with a recent change to Kowari's code (which was a scary prospect), and started looking more carefully a the code, and the logged stack traces.

The clue came from the client trace:
Caused by: java.rmi.ConnectException: Connection refused to host: 127.0.0.1; nested exception is:
java.net.ConnectException: Connection refused
at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:567)
at sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:185)
at sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:171)
at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:101)
at org.kowari.server.rmi.RemoteSessionFactoryImpl_Stub.getDefaultServerURI(Unknown Source)
at org.kowari.server.rmi.RmiSessionFactory.(RmiSessionFactory.java:132)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:274)
at org.kowari.server.driver.SessionFactoryFinder.newSessionFactory(SessionFactoryFinder.java:188)
... 13 more
Caused by: java.net.ConnectException: Connection refused
So the problem appears to be a connection to the local system, which isn't running a Kowari instance, so it fails. The error was occurring in the RmiSessionFactory constructor, but this seemed OK, and the stack above and below it in the stack trace was all Sun code. So what was happening here?

The relevant code in the constructor looked like this:
  Context rmiRegistryContext = new InitialContext(environment);
// Look up the session factory in the RMI registry
remoteSessionFactory =
(RemoteSessionFactory) rmiRegistryContext.lookup(serverURI.getPath().substring(1));
URI remoteURI = remoteSessionFactory.getDefaultServerURI();
The failure happens on the last line here.

What is the process, and how is this failing? Well, it starts by looking up a name server to get an RMI registry context. The important thing to note here is that this works. Since the RMI registry is running on the server rather than the client, then we know that it spoke to the remote machine and didn't try to use 127.0.0.1. So far, so good.

Next, it pulls apart the path from the server URI and looks for a service in the RMI registry with this name. In this case the name is the default "server1", and the service is a RemoteSessionFactory object. This also works.

The problem appears on the last line when it tries to access the object that it got from the registry. For some reason this object does not try to connect to the machine where the service is to be found, but instead tries to access the local machine. So somehow this object got misconfigured with the wrong IP address. How could that happen?

Since nothing had changed in how Kowari manages RMI, I started to look at my own configuration. Once I saw the problem, realised how obvious it was. Isn't hindsight wonderful? :-)

Nameservers
Once upon a time I ran Linux full time on Chaos. This meant that I could run any kind of service that I wanted, with full time availability. One of those useful services was BIND, allowing me to have DHCP dynamically hand out IP addresses to any machine on my network, and address them all by name. Of course, BIND passed off any names it hadn't heard of to higher authorities.

However, obtuse hardware, Windows only software, and expensive VM software that suddenly stopped working one day (it died after the free support period ended, and no, I can't afford support), slowly took their toll. I finally succumbed and installed that other OS.

Once Chaos started rebooting, I could no longer rely on it for DHCP or BIND. DHCP was easily handled by my Snapgear firewall/router, but I was left without a local nameserver.

New computers to my network are usually visitors wanting to access the net. This doesn't require them to know my local machine names, nor do my other machines need to access them by name. So I figured I could just manually configure all of my local machines to know about each other and I'd be fine. This is where I came unstuck.

The problem was that I had the following line in /etc/hosts on Chaos:
127.0.0.1  localhost chaos
I thought this was OK, since it just said that if the machine saw its own name then it should use the loopback address. I've seen countless other computers also set up this way (back in the day when people still used host files). For anyone who doesn't know, 127.0.0.1 is called the "loopback address", and always refers to the local computer.

This confused RMI though. When a request came in for an object, the name service sent back a stub that was supposed to connect to a remote machine named "Chaos". However, to prevent the stub from looking up the name server every time, it recorded the IP address of the server instead of the server's name. In this case it looked up /etc/hosts and discovered that the IP for that "chaos" was 127.0.0.1. The object stub then got transferred across the network to the client machine. Then when the client tried to use the stub, it attempted a connection to 127.0.0.1 instead of to the server.

The fix was to modify /etc/hosts on Chaos to read:
127.0.0.1  localhost
192.168.0.253 chaos
So now the stub that gets passed to the client will be configured to connect to 192.168.0.253. This worked just fine.

So now I know a little more about RMI. I also know that if I ever get any money, I really want a spare computer so I can boot up Windows and not have to take my Linux server offline to do it.

1 comment:

Anonymous said...

Thank you a lot for your post. I got the same problem as you and I spent hours recoding and configuring Rmis without succes.
I really enjoy you explanation, it's very clear and it's also interesting to know your reasoning.

Good job :)

Tibs