Monday, September 06, 2004

Proof Reading and Blogger Bugs
Well I came back and proof read this entry, only now I can't publish it. The publishing post keeps timing out with the following message:

    001 Connection timed out
It's very frustrating. Anyway, I'm having another go, with a new header (ie. this paragraph) in the hopes that it will start working.

Holiday Preparation
I'm back from 2 weeks break. Not all that relaxing, as it was mostly spent visiting relatives to introduce Luc. It was nice to see everyone, but I'm glad I'm back so I can relax. :-)

My last couple of days of work were the 19th and 20th. I had hoped to make a couple of blog entries for those days, but I wasn't able to keep up with everything, so it never happened. Those two days were spent showing AN what I'd been doing on inferencing with DROOLS, so he could finish off what I'd done, and working on the HAVING clause.

The HAVING clause needed to be done in 2 parts. The first part was the syntax, and the data structures created by it. The second part was the final filtering and grouping based on these data structures. I had enough time to get the first part done, but I only just started on the second before leaving. A lot of this time was spent talking about the design of the changes, and also for the hand over of what I'd done.

So then I got to have my time off.

Not a lot of time for work or study over the last 2 weeks, but I thought I'd mention a couple of things which I did do.

JRDF and Today's Work
During my first week I saw a message from AN about the in-memory implementation of JRDF. It turned out that there were a couple of little bugs, starting with comparing URIReference objects from 2 separate models. This didn't really surprise me, as I hadn't actually considered multiple models when I designed it. After all, I really just wrote it over the course of a few evenings, simply as an exercise to get me familiar with the interface, so I had some practice before doing a disk based JRDF in C.

It turned out that having an in-memory model like this was useful, and ML was using it. Only he was pushing it further than I'd ever really considered, and hence the bugs showed up. So I started thinking about the design, and I realised that the simple bug fixes that was doing only hid some more fundamental problems with the design. For instance, there could be a real problem if a model were serialized with all of its node IDs and a new model were to re-use those IDs. Node IDs had to be local to a model (meaning a URIReference which appeared in more than one model could have multiple IDs) or else they would be global, but only for the lifetime of a session. The latter would mean big changes to serialization to avoid the internal node IDs, but that seemed reasonable.

I mentioned these problems to AN, and he wrote back to tell me that he fixed it all. Whew. I didn't want to work on it all again.

When I got back this morning I learned that my little evening hobby is now being shipped as part of the production Kowari system. Eeek! :-) Oh well. I guess that what open source code is about.

There is still a remaining problem with the in-memory JRDF. The "remove" methods on the iterators are not implemented (I always threw an UnsupportedOperationException as I had no need to implement them before), but without them it is not possible to remove triples while iterating over a set of results. If Graph.remove() is called while iterating then a ConcurrentModificationException will be thrown. This is to be expected, but it's annoying for anyone who might want to do something like this. So this is what I spent part of today implementing. I have it all done on two of the ClosableIterator implementations, but I still have another two to go.

When remove is called it is a simple enough matter to pass the call down to the wrapped iterator, but that will only remove the final node (and hence the triple) for that index. The other two indexes also need to have their statement removed. Fortunately there was a private method for that on GraphImpl so I've promoted that to protected visibility, and I'm able to call it from the remove method on the ClosableIterators.

Actually, I just remembered that I missed something. If I pass down the remove call to the underlying map, then I need to check if it was the only entry in that map. If it was then I should remove the entire map. This check will then need to propagate up the maps to the head of the index if need be. I'll add this in the morning.

While discussing today's work, I should note that the HAVING clause was never finished while I was away. Once I have this "remove" code sorted out I'll get back onto that.

AUUG 2004
On my second week we went to Melbourne, where the AUUG 2004 conference was on. Though I was unable to afford the conference, I decided to take advantage of the free tutorial that Apple put on (thanks Apple!). When I showed up, someone had made a mistake and put me down as a full conference attendee! If I had the time I could have attended anything I wanted, plus I'd have been fed for free.

I didn't do anything that I shouldn't, but I did score a nice bag, with a decent T-shirt and the conference proceedings in it. I also got to reacquaint myself with a few people I haven't seen in some time.

I finally wrote out that paper on inferencing for Bob, just before I went to see him. He wasn't too concerned about the specific content, but had wanted me to take the time to think things through. It gave me a clearer idea on what I mean by inferencing, and how I think it can be done. When I get time I'll put it on a web site somewhere.

While I was in at UQ I was given access to the ITEE network, a key for 24/7 access to the ITEE building, and a key to an office where I share a desk with another part time student. I probably won't use it much, but it was fun to make Anne jealous with it, as she was never given an office when she did her Masters in Landscape Design. ;-)

I didn't do as much reading as I'd hoped, but I did get a little done. I've borrowed a copy of Bob's book to read a few useful chapters, and I'm hoping to get through that, along with some material on Magic Sets. This came up because I finally re-read "A Survey of Research on Deductive Database Systems", and I now understand a lot more of what it is talking about. It's very useful, but it is now 11 years old, so I will have to do quite a lot of my own research to get a more up to date view of the state of research in this area.

The most interesting thing about this paper was learning that we have re-implemented a number of things in Kowari which others have been doing for years. This is hardly very surprising, but it was strange to see. It is also useful to be able to put a name to things like semi-naive evaluation which is exactly the trick I'd used in the DROOLS code. The survey paper pointed out that several people had independently proposed this. The fact that I came up with it myself demonstrates the obviousness of this solution.

The only technology from the paper that I really know nothing about is Magic Sets, hence my desire to look into them a little more.

The single biggest thing I got from this is how big the field is, and how little I know about it. I'll confess that it has me a little intimidated.

I also noticed that all of the surveyed projects, going back to the 60's, are based on predicate logic. They expect their data to be in a predicate form, and yet almost all databases are in the standard tabular form, requiring a mapping from one form to another. I realise now that RDF is one of the few forms of data that predicate logic can be applied to directly. Given the huge body of inferencing research in this area, this makes RDF seem like an ideal tool for inferencing. I haven't seen this written down anywhere, but I suppose this might be one of the reasons RDF was designed in that way that it is.

Inferencing Rules
While doing my inferencing paper for Bob, and again during some of my reading, it occurred to me that almost all inferencing is done as a set of rules, often described with Horn clauses. These all boil down to: "If this set of statements, then this new statement."

OWL inferencing is a specific set of statements from this general structure. So if I consider the general structure then I can cover OWL. Incidentally, I've noticed that I can't find much on OWL inferencing. I've seen validation on OWL, and I've seen RDFS inferencing (or "entailment") but I've yet to see anything that is inferred with OWL.

Many projects refer to Intensional Databases as those which store these rules for inferencing. It occurred to me that we also want to store such rules in the database. The closest I came was putting the rules into an iTQL form and putting them into a script used by DROOLS, but that wasn't particularly flexible.

I'm thinking that we want to define a model (or a number of models) which hold inferencing rules. A rule can have a property which contains iTQL which defines the body and head of the rule (the head makes up the resulting statements). While the whole lot can be defined in a single iTQL statement, testing and insertion could be separated if the head and body were stored separately. When being executed the rule would be given a destination, and a source for the body to be applied to (ie. a FROM clause).

The result would be models which defined the rules (eg. an RDFS entailment model, an OWL inferencing model, or a domain specific inferencing model), and a new iTQL command to apply a model-of-rules to another model, inserting the results into a destination model.

I'll have to speak to AN about this in more detail, but it appeals to me, particularly due to its flexibility, and because it can be mapped directly from Horn clauses.

File Systems
Several people have already pointed out that the properties proposed for WinFS look surprisingly like RDF, with a few things missing (hi Danny, if you're reading). I've been thinking for a while that this seems like a strange restriction, when one could always tack RDF straight into the file system.

I've let it slide for a little while now, but I've been thinking that if I get back to this C implementation of an RDF store, then I should be able to build a filesystem with RDF built in. (How do I implement the interface for that? ioctl() calls? It should be an interface that makes sense and doesn't break anything.)

Now that MS has announced that they want to push WinFS back a couple of years I'm even more keen to do something with this. Once I've sorted out the RDF store, and the interface, then it should be reasonably easy to build it into a user-space filesystem for Linux. I've been meaning to have a look at how to do a user space system, and this would be a great opportunity to learn.

Besides, it would be terrible for MS to have something that Linux doesn't. :-) While I'm at it, I should see how filesystems for Macs work. Macs are already doing something in this space, but portability is still a desirable trait in any system. There is an ext2fs module for the Mac, so I should be able to use that as a template.

First things are first though. I'll have to get this RDF store working before I try and patch it into something like a filesystem. Still, I'm pleased to have a purpose for the code, so it isn't just a mental exercise.

No comments: