Wednesday, July 28, 2004

Nearly There With N3
I was feeling tired and belligerent last night as I wrote my blog. Consequently I wrote that I was going to use a regex to parse N3. This isn't as bad to implement as it sounds, but I still had no real intention of doing it. Going about it in this way would mean re-inventing the wheel. Worse yet, it means that every little corner case of N3 that I hadn't considered was going to be plaguing me for months.

I wrote what I did last night in the hope that someone would say, "You moron! Why aren't you using the XXX parser?" I'd done a quick look for parsers, but the ones I'd found were too heavily tied to their applications and using them was going to be more work than they were worth, so I was hoping to be directed at something more portable. The response I received was from AN (and it was more pleasant than calling me a "moron") when he suggested I use the Jena N3 parser.

Fortunately, the Jena parser is event based, making it easy to hook into it. It only took me a half an hour to knock up some quick code that would parse and print an example N3 file (less than that to use the built in N3EventPrinter class, but I needed to try it for myself). So I ended up thinking that I should be able to make up for the last few days when I feel I haven't really done all that much. But it's now after 10:30pm, and I've only just now finished the main structure of it.

There was a LOT of glue to put this parser together with Kowari. I was able to base a lot of it on the RDF-XML parser that Kowari already has, but there were still significant differences. All up, it took over 700 lines of code.

The biggest hassle was converting from Antlr AST nodes to Literal nodes or URI references. I didn't have all of the Antlr code available to me (I just had the jar, but I could possibly find the source if need be), and I couldn't find any documentation easily. The N3EventPrinter offers some tantalizing hints, but it leaves a lot open to the imagination. Also there is a plethora of Antlr types, and I'm pretty sure that most of them aren't applicable to N3, but I couldn't work out which ones were. There are a few obvious ones, particularly anonymous nodes, literals, and URIs, but beyond that I don't really know.

At this point I'm thinking that the best approach might be to just "suck it and see".

I'm still a way off running though, as I have a compilation complaining that it doesn't know about classes I've included, and put into the class path. Once I got to that point I thought it worth blogging where I am and then taking a break.

1 comment:

Anonymous said...

Would you be able to post your code on to give an example of how JENA was used to parse an n3 file? There doesn't seem to be any good documentation on the web.