Wednesday, February 16, 2005

Cold
When you say that you have a "cold" people tend not to think much of it. After all, they're "common", right? Most people I know say that they have the "flu" instead, because everyone knows that the flu is worse. :-)

Well, Luc brought home a cold from daycare. Yes, it's a cold, but it's still a nasty little virus. Luc got over it pretty fast, but Anne and I are still suffering. Unfortunately for me, I went on an 80km bike ride on Sunday morning (wondering why I was struggling so much), which seemed to depress my immune system just in time for the virus. I ought to know better than to do heavy exercise when Luc is sick.

Now that I'm into the fourth day I thought that I might wake up feeling better, but instead I felt worse. I was just starting to feel sorry for myself when I realised that I feel worse because all of my joints are aching. That's when I remembered how the immune system works... and aching joints means that it's kicked in and doing it's job! Yay! That should means that I'll feel better soon. :-)

Anyway, for those who were wondering, that's why I haven't been blogging. I'd rather get to bed early when I feel this way, rather than blog. Since I'm here today, I'd better preemptively explain that my head feels "muggy", so I apologise in advance for anything here that doesn't make sense.

Work
So what was I doing for the last few days? Well I've kind of started work again.

For anyone reading the comments on my blog, you may have seen a message from someone called GregM suggesting that he may have some consulting work for me. That seems to have panned out, and I'm now spending the next few months doing just that. Greg seems to want exactly what I'm trying to build as my Masters project, and so he is funding me to continue working on that.

I'm more grateful than he knows, because I've been wanting to concentrate on this for some time. At Tucana we had a lot of commercial pressures which kept pulling me away from doing this work. In fact, much of it had to happen in my spare time, even though the company wanted it implemented.

Now that Tucana is no more, I expected to find a completely unrelated job, and work on this stuff in my spare time. I was going to do that, but it would have been frustrating.

So I'm pretty excited that I have the chance to concentrate on OWL full time for a while. Who knows? Maybe one day I'll even get a full time job doing this stuff. (I'm really keen to work overseas for several years, so hopefully one day I'll find someone will want someone who knows about semantic web technologies).

Anyway, based on the promise of payment from Greg, I've gone and turned down two other jobs this week (it's a little scary turning down full time employment for a job that looks like fun when you're getting as broke as Anne and I are). I know that payment is out of Greg's hands now, so Anne and I are hoping that we'll see something soon.

As a final word on the consulting... Greg reads this blog, so I just want to say thank you. :-)

Mastering OWL
So what have I done for the last few days that I haven't blogged? Two things. First, I've been designing an RDF Schema for the Kowari Rules system. The second thing is reading papers. It's still early days on the schema (and the code to read it), but I've made quite a bit of progress on the papers.

I've quoted the work I'm doing for Greg as just the "development" of the OWL inferencing engine. However, it's really a "research and development" project. Greg pointed out that he is aware of that distinction, which I'm pleased about.

The problem with any R&D project, is that it is much harder to estimate time than a simple development project. After all, if you knew everything you'd be doing then it wouldn't be research, would it? :-) Fortunately, I have done the lion's share of the research at this point. I still have to get through a little bit, but at this point it's mostly a justification for what I'm doing, rather than finding out how I'm going to do it.

An important aspect of the "research" component is reading papers. Because I'm concentrating on the "development" part for Greg I'm trying to keep my working hours to mostly development, but I still need to read a little. The research helps drive the implementation, while the implementation helps direct the research.

I've managed to get through several papers in the last few days, and I'm pleased that I have. Andrew's suggestions have paid off again (thanks Andrew). The best one is a paper called Taking I/O Seriously: Resolution Reconsidered for Disk. This paper uses set-at-a-time techniques rather than tuple-at-a-time, and proves that the efficiency is equal. This is important for validating my approach. I also think that there is an equivalency between my planned constraint resolutions in the rules network and the subgoals in their breadth-first approach. Their system is DL based, while my own has broken DL up into rule components, but the important principles appear to be similar.

As for the rules implementation... I've built a heap of RDF as ball-and-stick diagrams so that I can see what I'm doing. I'm missing some important features, but the basic structure is there. I'm still in the process of converting this into parsable RDF-XML (I need more practice at this - I can read it, but my writing isn't so good, particularly when using blank nodes). I'll probably post it here for feedback when I'm done with it.

Scalability
Lunch was early and longer than usual today, as I was kicked out of the ITEE building when the fire alarm went off. So I went out to have lunch with a friend who works in the Institute for Molecular Bioscience. I took advantage of the opportunity to quiz her on the details of DNA and gene research that I've been curious about. It's fascinating how all of these processes break down into simple mechanical operations when viewed at a low enough level. Of course, no one really understand how the complex systems emerge from these simple interactions, but that is the same in any emergent behavioral system.

Once we got into discussing gene sequences this reminded me of the proposed usage of Kowari in the life sciences arena. Kowari scales very well, but it doesn't yet scale that well. That can be fixed, but I'm not sure how we're going about it now. DavidM was going to use a skip list for the next version of the data store (to take advantage of the fact that hard drives are very good at streaming through blocks), but he now has a full time job (for the same company I turned down a job with). He was also going to address some scalability issues while doing this work.

I can certainly do this work instead, but I really need to concentrate on OWL for the moment. Besides, David would probably do a better job. So I'm not sure where the next level of scalability in the datastore will come from.

The other issue for working with life sciences is clustering. This is not a trivial extension, though there were certainly some plans to implement it. Again, I could work on it, but I won't be able to for some time.

The reason I think that scaling like this is important is because life sciences needs a scalable RDF database, and RDF needs to work in a practical area that really needs it like this. Kowari doesn't (yet) have the inferencing features of the other systems out there (like Jena and Sesame), but it certainly scales much better (anyone who is skeptical of this statement has obviously been using the Jena interface).

So to get RDF working at this level, I believe that Kowari is the only system which will do it (for the moment). That means that Kowari needs some funding to make it scale to this level. Can we get it from the life sciences? Maybe, but it will depend on how they view our scaling so far. If we are too short of the mark, then they won't be interested in us. Ideally, we will already scale well enough to be useful, and that will encourage them to fund us to create real scalability (ie. efficient clustering).

I know that DavidW is interested in this. I should talk to him about it.

Indexing
Long time readers will be aware that I've worked quite a bit on the indexing that we use in TKS and Kowari. We originally designed and built this system in 2001. The result was made public when we open-sourced Kowari in 2003. A few months later I discussed several aspects of it on this blog.

I started by talking about some equivalent MySQL indexing, and a few days later I discussed our own indexing in more detail. Of course, this is only an adjunct to what we actually did in Kowari (which is the definitive reference for what we did). Thinking about it again in August, I posted some comments on a more mathematical foundation for indexing in this way, and how it maps into a set of vectors in an N dimensional space (where N=4 for Kowari: with the dimensions of subject, predicate, object and meta).

I also took some of the original index modelling and in August I turned it into the in-memory implementation of JRDF. This indexes the data the same way that Kowari does, only it uses tuples of nodes paired with node sets for speed and efficiency. Kowari expands these tuples from single nodes paired with a set of nodes, to pairs of nodes, with the first node duplicated for each element in the set it was paired to. The effect of this expansion is to make the full data set rectangular, which facilitates efficient storage on disk. However, the indexing theory for the two systems is identical.

My point is that I've thought this problem through quite heavily, along with a little help from my colleagues at Tucana. This indexing is a fundamental feature for the efficiency of the rules engine that I am building, so it will figure prominently in my Masters.

I was searching for papers which would discuss indexing in all directions like we do for Kowari, and Andrew pointed me to Yet Another RDF Store: Perfect Index Structures for Storing Semantic Web Data With Contexts. I hadn't paid attention to him talking about it when he first found it, but I was interested now.

Initially I was not too surprised that someone had duplicated our scheme, as it makes sense (or it did to me when we designed it). I was disappointed that we were not unique in our design, but pleased that I would have someone to cite. However, as I read further I started to become very agitated.

In the introduction, the authors claim that of the other databases in existence, almost all of them, "...are built upon traditional relational database technology." (Kowari was a member of their list). Of course, this does not apply to Kowari at all.

Also, in their "Contributions" section, they claim, "Our paper identifies and combines several techniques in a novel fashion from the database area to arrive at a system with superior efficiency for storing and retrieving RDF."

This would all seem fine, except that they go on to describe an identical indexing scheme to the one used in Kowari. This makes it far from "novel".

There are two differences in their implementation to our own. The first is that they used B-trees rather than the AVL trees that Kowari uses. That is not a significant difference, and I blogged about this several times before. The main advantage of the AVL tree is that they are cheaper to write to the structure than B-trees are, though a little slower to read. They also pointed out that they wanted to use an existing library for the index, and while B-tree libraries are common, we never found a completely working AVL tree library (deletions were always buggy).

The second difference is a set of special statements which count the number of triples in parts of each of the indexes. This is certainly novel, but not an approach I would use. I believe that counting is still very efficient in Kowari (O(log(n)), and the space overhead they incurred would seem prohibitive. More importantly, writing is always the slowest operation, and their system would incur a large writing penalty for using this scheme.

Beyond these two points, the entire indexing scheme is an exact description of Kowari, without the more thorough analysis we undertook. Given that it was submitted to WWW2005 at the end of last year is very poor timing for them.

Perhaps this is a monumental coincidence. However it would indicate particularly poor research on their part if it is, as they explicitly mention Kowari as a part of their paper, and yet fail to acknowledge that it uses the same indexing scheme. They also compared their project to several other listed databases, but claim, "We tried to install Kowari, but failed to get a running version." I'm very surprised at this, as I believe that it is simple to run. I don't believe that they asked the mailing list for advice.

Admittedly, they used the Jena interface on Kowari. I've never used this, but I've worked in the code. Anyone using the Jena interface will definitely see a poorly performing system. They also said that their JVM died on one machine. I can only presume that they were using a Mac, as this is the only JVM I have seen having a problem in recent years.

This hardly matters, because even without a performance comparison, they were certainly aware of the Kowari project.

I've since been told that this paper was rejected due to its similarity with Kowari. I'm pleased to hear it, but it has left me with a realization that I need to publish what I've been doing (all the way back to the indexing work from 2001), or else this could happen again. Besides that, the ITEE department will be very happy with me if I publish something.

Unfortunately, I can't really write papers on Greg's time, so I'll be working lots on weekends and evenings. Oh well, I guess that's what I signed up for when I enrolled at university.

2 comments:

Andrew said...

Is there somthing that you're adding above SWRL in your RDF based rules?

The concrete RDF syntax is:
http://www.daml.org/rules/proposal/rules-all.html#6

Paula said...

I'm definitely building something SWRL like. It would be a good idea to work with SWRL, but unfortunately what I'm proposing takes advantage of several constraint types which are built in to Kowari, with no analogy elsewhere. A good example is the transitive constraint.

Also, SWRL is much more RDF aware than Kowari queries are. This makes it easier to use SWRL, but applies a layer of meaning before it can be applied to Kowari. The result of this, is that it prevents access to some of the underlying Kowari structures which are essential for efficient operation of some of the rules.

However, I should probably try to template my RDF off SWRL as much as possible.