Working notes: 12/01/2005

Thursday, December 29, 2005

Domains
I've always thought that domain squatting was an unethical way to make money, but had only heard stories of it before now.

Anne suggested that it might be nice to pick up the gearon.com domain, if it was available. After all, I normally take up a lot of the first result page at Google when you type in my surname. (Just looked, and today I don't! I really need to blog more often.) OK, so I'm not a commercial entity, but everyone recognizes .com, while people often look at me funny when I say .org or .net. The .com thing has brand recognition.

So I had a look, and discovered that the domain is already registered, but it is for sale. It turns out that it's available for purchase through the bidding process available at Afternic.com. The minimum bidding price was way more than I'd have liked to spend, but it's my name, so why not?

A week later I discovered the bid was rejected. I asked Afternic what a decent price is supposed to be (according to their market analysis). Their response was that the current market value is $200. Still far too much, but it gave me confidence to ask the current domain holder how much they would like.

The answer? $2950. Quoting from the email:

Price is very low for a family name.

Huh? Whose family? The Rockefellers?

I didn't care about the domain all that much (it should probably go to a more commercial interest, like something run by Michael Gearon or Tierney Gearon), but registering a name and then charging to give it back to an owner of that name is a principle I find rather offensive. I suppose I should be grateful she was asking for $3000 and not $30,000.

I resolved it by registering gearon.org for $8.20.

UIMA
The other day I followed a link over to IBM's DeveloperWorks, and found that they have an RSS feed for their tutorials. I was pleased to find a simple Python tutorial that I'm using to finally introduce myself to that language. But more importantly, I found a tutorial for generating a UIMA annotator.

The UIMA docs are very verbose, and a tutorial like this has been great for cutting through the chaff. It's still full of stuff I don't need (mostly because I've already learnt it from the official UIMA docs), but it's still been a real help.

My biggest problem at the moment is that UIMA wants all my annotations in character offsets. Unfortunately the library I'm using is providing my information in word offsets. That's trivial to convert when words are separated by whitespace, but punctuation leads to all sorts of unexpected things, particularly since the grammar parser treats some punctuation as individual words, while others get merged into existing words.

I'm starting to wonder if I need to re-implement the parser so I know what the character offsets of each word will be. Either that, or I'll be doing lots of inefficient string searching. I don't find either prospect enticing. Maybe if I sleep on it I'll come up with something else.

Wednesday, December 28, 2005

Gödels Theorem
I was just looking at the fascinating exhibit of equations by Justin Mullins. I'm not sure if I see the exhibit as art, since the visual appearance evokes little in people who do not understand the equations (with the possible exception of the Four Color Theorem), but they are certainly beautiful.

I particularly loved the end of the narrative for Gödel's theorem:

Others have wondered what Gödel’s theorem means for our understanding of the human mind. If our brains are machines that work in a consistent way, then Gödel’s theorem applies. Does that mean that it is possible to think of ideas that are true but be unable to prove them? Nobody knows.

Note the sentence that I highlighted. If it is true, then that sentence is an unprovable idea. I love it.

Saturday, December 17, 2005

Qubytes
Lots of places are commenting on the new quantum memory chips in silicon.

I'm surprised at this. I expected that embedding quantum devices in silicon would be done with quantum dots, rather than ion traps. It is probably better than it was done with ion traps as there seems to have been more research into quantum processes using this technology. After all, what good is a quantum state if you can't apply transformations on it without collapsing the state?

All the same, a chip like this is just a first step in a long line of problems to be solved. There is no discussion about setting up quantum states, nor reading them back. There is no discussion about the ability to entangle the qubits on the chip, and how far that will scale. Transformations will eventually have to be built in to the chip. But if research has taught me anything, it's that the big problems are usually solved by lots of people chipping away at the little problems. By the time the final solution comes around, it doesn't seem like a big deal any more.

Tracker
The little tracker icon I have on this page is a link to a service that tells me how many hits the blog is getting (but not the RSS feed). I haven't bothered to look at the stats in a long time. After all, I'm rarely writing, so why would anybody (beyond my friends) bother to read?

Apparently I was wrong in that assessment. I'm averaging over 20 hits a day, with peaks over 40, and I'm writing less that once a week. My infrequency is due to lack of time. Given how many people are reading so little here, I'm wondering if anyone else suffers the same problem. :-)

Thursday, December 15, 2005

Blogging
I notice a new post on the Google Blog describing a new tool for Firefox. When installed, a small message will appear on any page you visit, showing a list of blogs which refer to that page. Sounds cute.

I normally browse with Safari (gotta love those native widgets), but I keep Firefox installed (after all, some pages have extra features when viewed with Firefox). So here was a chance to upgrade my version of Firefox (I hadn't picked up 1.5 yet) and install Google's new tool.

So where should I go first to check out the comment? Well obviously my usual home page of Google comes up, and there are ample comments. How about the page talking about the new tool? Lots of comments there too. Oh, I know! How about this blog? :-)

Unfortunately the list of comments was a little disappointing, mostly including my friends. That will teach me to not blog regularly. However, I did find one blog on semantic web development that I found really interesting. I was just disappointed that he didn't have a lot of incoming links (though there was one worth checking out).

All in all, it's a tool that I like. In fact, I wouldn't mind a similar tool that did a link: search on Google, rather than just in the bloggosphere.

Tuesday, December 13, 2005

Modeling Talk
Last week I was invited along to a talk given by Bob at SAP. I enjoy seeing what Bob's working on when I'm not discussing OWL with him. He's a clever guy, and understands modeling quite well. I just wish I'd written about it sooner, as I won't be so clear anymore.

Probably the most important thing I got out of his talk was an overview of category theory. Andrae and Simon have both spoken about it, and I've come to understand that it's relevant, but as yet I haven't learnt anything about it. Bob gave the 30 second overview for computer scientists, which I found quite enlightening.

I finally got my copy of Types and Programming Languages (otherwise known as TAPL), and have been looking forward to reading it. But when ordering this book I discovered that Benjamin Pierce has also written a much smaller book called Basic Category Theory for Computer Scientists. I had considered getting this book (at only 117 pages, it looks like a relatively quick read), and Bob's talk has now convinced me. The only problem is that I'll have to put the order off until we move to the States... whenever that happens.

modeling
Speaking of books, I also picked up a copy of MDA Distilled, Principles of Model-Driven Architecture. Some of the work I did with SAP came out of this book (virtually guaranteed when you work with one of the authors), and it talks about the kind of dynamic modeling that I've been talking about investigating with OWL. I haven't been through all of it before now, so I thought it would be worthwhile reading it in detail.

Coincidentally, I was explaining some of my ideas to someone at work today (Indy), and referred to this book to describe some of the background. I had some idea that Herzum Software worked with MDA (which is why I thought they might be interested in this work), but I had never thought of it as a formal association. Indy quickly made it very clear that Herzum Software specifically put themselves out there as an MDA company. That makes perfect sense as it aligns with what I already knew, but being in my own little corner of the world has kept me isolated from the advertising of it. Anyway, it's nice to know that the direction I'm moving in is paralleled by the work of my new employer.

RDFS Entailment
I've also been in an email discussion about entailment on RDFS. It seems that the following statements:

  <camera:min> <rdfs:range> <xsd:float>
  <_node301> <camera:min> '15.0'^^<xsd:float>

will lead to an entailment of:

  <xsd:float> <rdfs:subClassOf> <rdfs:Resource>
  '15.0'^^<xsd:float> <rdf:type> <xsd:float>
  '15.0'^^<xsd:float> <rdf:type> <rdfs:Resource>

It seems that I didn't cover all the possible rules which could lead to a literal in the subject position. It's quite annoying, as these are completely valid entailments, according to RDF semantics. Making special cases to avoid particular results seems like a hack.

In a similar way, it seems wrong to not allow entailments about blank nodes. I should re-visit the decision there. I think I need to re-read the semantics document to see if I can get further enlightenment. At the least, I know that I can't entail a statement with a blank node as the predicate. Like the problem with literals, the semantics document appears to justify this sort of statement, but the RDF syntax doesn't allow for it. I know this is a particular bugbear for Andrae.

Catch Up
I've been wanting to write for nearly a week now. Every time I try to sit down for it I've had a work task, family needs, or packing to take priority over this blog. I ended up having to write little notes to myself to remind me of what I wanted to blog about.

JNI and Linux
Having made the Link library work on Mac OSX using JNI, I figured it would be easy to get it working on Linux as well. Unfortunately it didn't work out that way.

To start with, I got an error from the JVM saying that it could not find a symbol called "main" when loading the library. This sounded a little like dlopen loading an incorrectly linked file. I'm guessing that the dlopen procedure found what it thought was an executable, and therefore expected to see a main method. Googling confirmed this, but didn't really help me work out the appropriate flags for linking to fix this.

I had compiled the modules for the library using -fPIC (position independent code). I then used a -Wl,-shared flag to tell gcc to pass a -shared flag to the linker, in order to link the modules into a shared library. However, it turned out that I really needed to just use -shared directly on gcc. I've still to work out what the exact difference is, but that's not a big priority for me at the moment, since I have it working. According to DavidM there is something in the gcc man page about this, so at least I know where to look.

After linking correctly, the test code promptly gave a Hotspot error, due to a sigsegv. This meant that there was a problem with the C code. This had me a little confused, as it had run perfectly on OSX. Compiling everything in C and putting it all in a single executable demonstrated that the code worked fine on Linux, so I started suspecting that the problem might be across the JNI interface. This ended up being wrong. :-)

There are not many differences between the two systems, with the exception of the endianess of the CPUs. However, after looking at the problem carefully, I could not see this being the problem.

The initial error included the following stack trace:

C [libc.so.6+0xb1960]
C [libc.so.6+0xb4fcb] regexec+0x5b
C [libc.so.6+0xd0a98] advance+0x48
C [liblink.so+0x19f9d] read_dictionary+0x29
C [liblink.so+0x1d705]
C [liblink.so+0x1d914] dictionary_create+0x19
C [liblink.so+0x286c9] Java_com_link_Dictionary_create+0xc1

The only code I had real control of was in Java_com_link_Dictionary_create, dictionary_create and read_dictionary. I started by looking in Java_com_link_Dictionary_create and printing the arguments, but everything looked fine. So then I went to the other end and looked in read_dictionary.

I was a little curious about how read_dictionary was calling advance, as I hadn't heard of this function before. Then I discovered that the function being called was from the Link library, and has a signature of advance(Dictionary). This didn't really make sense, as my reading of the stack trace above said that advance came from libc and not the Link library (liblink). This should have told me exactly what was happening, but instead I tried to justify what I was seeing. I convinced myself that the function name at the end of each line described the function that had called into that stack frame. In hindsight, it was a silly bit of reasoning. I was probably just tired.

So to track the problem down I start putting printf() statements through the code. The first thing that happened was that the hotspot errors changed, making the error appear a little later during execution. So that meant I had a stack smash. Obviously, one of the printf() invocations was leaving a parameter on the stack that helped the above stack trace avoid the sigsegv. OK, so now I'm getting some more info on the problem.

It all came together when I discovered that I was seeing output from just before read_dictionary() called advance(), and from just after it, but not from any of the code inside the advance() function. At that point I realised that the above stack trace didn't need a strange interpretation, and that the advance() that I was calling was coming from libc and not the local library.

Unfortunately, doing a "man advance" on my Linux system showed up nothing. Was I wrong about this method? I decided to go straight to the source, and did a "nm -D /lib/libc.so.6 | grep advance". Sure enough, I found the following:

  000b9220 W advance

So what was this function? Obviously something internal to libc. I could download the source, but that wasn't going to make a difference to the problem or the solution. I just had to avoid calling it.

My first approach was to change the function inside Link to advance_dict(). This worked perfectly, and showed that I'd found the problem. However, when the modules were all linked into a single executable it had all worked correctly, and had picked up the local function, rather than the one found in libc. Why not?

I decided that if I gave the compiler a hint that the method was local, then maybe that would be picked up by the linker. So rather than renaming the function to advance_dict(), I changed its signature from:

  int advance(Dictionary dict)

to:

  static int advance(Dictionary dict)

I didn't know that this would work, but it seemed reasonable, and certainly cleaner since it's always a bad idea to presume that your name is unique (as demonstrated already). Fortunately, this solution worked just fine.

DavidM explained to me that static makes a symbol local to a compilation unit (which I knew) and was effectively a separate namespace (which I also knew). He also explained that this "namespace" has the highest priority... which I didn't know, but had suspected. So I learned something new. David and I also learnt that libc on Linux has an undocumented symbol in it called advance. This is worth noting, given how common a name that is. As shown here, it is likely to cause problems on any shared library that might want to use that name.

There's more to write, but it's late, so I'll leave it for the morning.

Sunday, December 04, 2005

Blogging
I'm a little annoyed at myself for lack of blogging recently. This is particularly the case as I see mainstream media commenting on people's online presence more and more. I almost feel like I'm missing out on something. Yes, I know that's a ridiculous concern, but I'm allowed to worry about anything I want to. :-)

Other than the restrictions imposed on me by my recently expanded family, my main problem with blogging recently has been lack of material. I don't mean that I have nothing to say. Instead, I'm limited by what is appropriate to put into a public forum. It was much easier when I worked on Open Source software all the time.

For instance, this last week has had me reviewing software produced by a group of academics at another company. My review is for my employer, so I obviously can't publish it (otherwise, why would he be paying me?). Also, any review will naturally say both good and bad things. The good may be OK, but saying something bad in public is obviously inappropriate. After all, these guys are out to impress customers and make money too.

So I'm left having to write about what I do out of hours. That's all well and good, but having a young family reduces the time for that.

I could always write a few opinion pieces. Australian federal politics has had me feeling frustrated for some time now, and I definitely have things to say on the topic. But that's not what this particular blog is about. I could always start a parallel blog, but then, who would really want to know what I think about Brendan Nelson and higher education in Australia? It would be cathartic for me, but not so much that I think it's really worthwhile.

All the same, I might consider a second blog to contain random musings (like this one). Maybe one evening when I'm not feeling like going to bed, and I have something I feel I want to say. I could be a mix of my daily life, frustrations, and comments on the oft explored experience of fatherhood. I'm not sure it will be good reading, but I may have fun coming back to it in a few years to see just how naive I really was back in 2005. :-)

Grammar
Meanwhile, I'm back to grammar parsing, using Link. I was a little chuffed to get the JNI all working, particularly when I was able to rewrite some of the test code in Java and have it all run correctly. I still need to test that it runs fine on Linux, but I don't have any real concerns there. Making it run on Windows will be another story.

Ideally, I'll be able to use MingW as the compiler, as it should help keep the codebase and build process consistent. I just hope I won't have to jump through too many hoops to generate a DLL file.

I could always ask someone at work if we have an MS commercial compiler, but we may not. I have my own, but I'm notlicensedd to use it for work. It amazes me that people are concerned about the restrictions of Open Source licensing, when commercial licensing can be far worse.

Weather
I'm a little obsessed with the weather at the moment. I enjoy our sub-tropical climate here, and it's going to be a rude shock to land in Chicago in the middle of Winter. As a result, I'm enjoying every minute here that I can. I'm also comparing the weather between the two cities on a day-by-day basis. The huge difference fascinates me, but the guys at work are probably annoyed with me by now.

According to AccuWeather.com, Chicago is currently well below zero Celsius), and will be staying that way all week. The town I grew up in (Chinchilla) often goes below zero during Winter, but that only happens overnight. Chinchilla also hasn't had snow since the early 1900's (1915 rings a bell for some reason).
Brisbane has been my home for the last 17 years, and it has never been below freezing (at least, not in recorded history). So I really haven't experienced anything like Chicago before. Can you blame me for paying attention to the differences?

In the meantime, Brisbane is just starting on its first heat wave for the Summer. Fortunately, it's not supposed to get as high as 40C (104F) over the coming week, but it won't be far off. The prediction is 37C (99F). Not too bad, but unpleasant all the same. Overnight minimums are over 20C (68F), so Luc isn't sleeping too well. This is a far cry from Chicago, where the highest maximum for the coming week is -4C (24F).

This will certainly add some excitement to the move!

Thursday, December 01, 2005

Bytecodes
DavidM helped me to find a slew of bytecode libraries (many of which are here). Some of these are better than others, but I was surprised to discover that none of them work from the basis of an AST. That means a lot of work gluing an AST onto an appropriate bytecode library, which reduces the advantages of using third party libraries.

That leads me back to looking at an existing compiler, as these must already go from AST to bytecode. All I'm trying to achieve beyond this is the ability to persist and retrieve the AST in RDF, and a public API for modifying the AST. So maybe I should be going directly to a compiler like the one built into Eclipse?

Types and Computer Languages
One of DavidM's first suggestions to me was to use the expression library from Kawa. While it doesn't express an AST either, it does meet much of the criteria of what I'm looking for. However, it is really centered around providing support for Scheme in a JVM.

I don't really know Scheme, and my first thought was that this would be something else I'd rather avoid learning (there's so much to learn these days - I have to be discriminatory). However, Andrae was quick to point out that it's a type of Lisp, and that I already have the basics in the lectures on "Structure and Interpretation of Computer Programs" (which I have the video files for). He also pointed out that given the Lisp heritage of OWL that I'd do well to learn Scheme. So I decided to pull the lectures back out, dust them off, and watch them right through (I'd only seen the first 4 before). It's not always easy to find the time, but I have to say that I enjoy watching them.

Fortunately, I have a new iPod (yes, that involved some wrangling with Anne), so I've converted all the lectures over to it (courtesy of FFmpeg) and can watch it whenever I have a spare moment. It's just a shame that the battery can only handle a couple of hours of video.

While discussing languages with Andrae, he mentioned Types and Computer Languages by Benjamin Pierce. I've always been impressed with Andrae's knowledge of the theoretical underpinnings of languages, and had put it down to extensive reading on the topic. This book is apparently one of the central sources for the information, so I'm thinking I'd like to read it.

I went looking around the appropriate bookstores in Brisbane yesterday, and half of them did not even have access to a distributor for this book. When I finally found one, they told me it would take 6 weeks to come in from America, and would cost $185 AUD. Looking at Amazon, I'm able to purchase the book and its sequel for only about $165 AUD (and that's probably a similar shipping time). So I'd be buying the book from there, if it weren't for the fact that we're about to move! Would I ship the book here, to my in-laws near Melbourne (where I'll be working for a couple of weeks around Christmas), or to the office in Chicago?

The visa paperwork is very frustrating. It would be nice to know when I'm moving for real. <sigh>

In the meantime I'm reading Java Puzzlers in my Copious Free Time, and enjoying it thoroughly. It's the puzzles that I couldn't answer on my own that I enjoy the most. I've been bugging all my friends with them. :-) The best part is that it's finally encouraged me to read the JVM specification.

Knowledge Mining
Today I dropped into UQ for a short seminar by Osmar Zaiane, which he presented on Knowledge Mining. I'm glad I went, as it discussed several techniques of semantic extraction. In particular, it described the details of several methods based on Apriori.

It also served to remind me about how much I really remember about neural networks (more than I care to admit), and that they're still considered the best solution to some classification problems. Perhaps I should revisit them.

Some of these techniques, plus a few others that were mentioned, may help me to boost the level of semantic extraction I've been able to get so far at Herzum. I'll have to look into this area a little more.

While at the university I dropped in a form to defer my enrolment for a couple of months. With the new job, the new baby, Christmas, and the impending move to Chicago, I figured I might be able to use a short break. Bob agreed that it sounds like a good idea. So I'm officially off until April, though I expect to be working on Kowari and OWL a little in the meantime.

Working notes