Monday, February 28, 2005

Proof Reading
Recently I decided to skip proof reading these entries. Yes, I know that is annoying (particularly when I change thought mid-sentence and fail to make any sense at all), and leaves me in the realms of the semi-literate. However, I'm usually just using this blog to organise my own thoughts, and I don't normally take the time to re-read the entries.

My main reason for being lazy like this is because the time taken to read these entries can usually be better spent in bed, which is where I really need to be tonight. :-)

At least I run the spell checker! However, I usually use English/Australian spelling, which may have some Americans wondering why I use the letter "s" so much instead of "z", in words such as "realise" and "organise".

Versions
The other day I realised that I had made the same mistake as Simon. I had checked out "kowari" from Sourceforge, instead of kowari-1.1.

I had actually seen both when I moved to using the Sourceforge CVS, after leaving CVS at Tucana. However, I reasoned that the "main" checkout would be "kowari" without modifiers. Apparently I was wrong.

The code I've worked on was not affected by any of this, so it hasn't set me back, but it's created a small hassle. The "kowari" checkout builds and runs fine on my PowerBook, but "kowari-1.1" causes a HotSpot error while unjaring a file in the build process. The exact error is here:

embedded-dist:
[unjar] Expanding: /Users/pag/src/kowari-1.1/lib/xercesImpl.jar into /Users/pag/src/kowari-1.1/obj/xerces
[unjar] Expanding: /Users/pag/src/kowari-1.1/lib/xmlParserAPIs.jar into /Users/pag/src/kowari-1.1/obj/xerces
[unjar] Expanding: /Users/pag/src/kowari-1.1/lib/jsr173_07_api.jar into /Users/pag/src/kowari-1.1/obj/xerces
[unjar] Expanding: /Users/pag/src/kowari-1.1/lib/jsr173_07_ri.jar into /Users/pag/src/kowari-1.1/obj/xerces
#
# HotSpot Virtual Machine Error, Internal Error
# Please report this error at
# http://bugreport.apple.com/
#
# Java VM: Java HotSpot(TM) Client VM (1.4.2-38 mixed mode)
#
# Fatal: java object has a bad address
#
# Error ID: /SourceCache/HotSpot14/HotSpot14-38/src/os_cpu/macosx_ppc/vm/os_macosx_ppc.cpp, 311
#
# Problematic Thread: prio=10 tid=0x00501170 nid=0x1802600 runnable
#
./build.sh: line 125: 2683 Trace/BPT trap ${JAVABIN} ${ARCH} -Xms64m -Xmx192m -Dant.home=${ANT_HOME} -DJAVAC=${JAVAC} -Darch.bits=${ARCH} -Ddir.base=${BASEDIR} -classpath "${CLASSPATH}" org.apache.tools.ant.Main -buildfile "${BUILDFILE}" "$@"

I've reported this to Apple, but obviously it will be a while before this gets fixed. I'd rather see Java 1.5 made available anyway.

This moved me back to running on my Linux server. I need to do this for scalability testing anyway, but it will be a problem if this is the ONLY machine I can run the Kowari tests on. It means that I will need almost permanent internet access with my notebook, which limits my portability. Maybe I should run VirtualPC. :-)

The problem here is that I'm running Java 1.5 on my Linux box. I had Kowari building and testing fine on 1.5 (and still do), but this was a CVS checkout from Tucana. Now that I have the new Sourceforge checkout I don't have all the modifications needed to build and run it with Java 1.5.

Ideally I'd use CVS to find out all the changes I'd made from the standard, and then move those changes into the latest build. However, the Tucana CVS doesn't exist anymore, so I have no idea where those changes are anymore. Fortunately, most of the work is in the Xerces classes (which I can copy over), and this blog will provide some hints, but I expect it to take a couple of hours to get it right. It's really annoying.

In the meantime I'm back to using Java 1.4 on the Linux box. Thank goodness the Kowari build script isolates the JAVA_HOME variable from the environment. Most systems can't handle a different version of binaries existing the path.

Ontology Validation
I reworked some of the rule ontology, and I'm happy with it all now. I guess I should post it somewhere for other people to look at it. :-)

I have some of the code needed to read the data that fits this ontology, but I've decided to create some example rules in RDF to test what I'm doing as I develop the code. That RDF hasn't come a long way yet.

The reason that RDF hasn't come a long way yet is because I spent quite a bit of time validating the OWL before I got to writing it. I started with the validator by Bechhofer and Volz, but it kept telling me that a " had not been closed. I cut down the XML significantly, but seemed to be getting the problem in the opening rdf:RDF tag. So I thought I'd try another online validator. This took me to the "BBN OWL Validator", but the link to this is broken (the server would not respond). I looked at some other options, but none of them were online, and I had no desire to install a new piece of software just to validate my OWL.

Finally I went back to the plain old RDF validator. This also said that the RDF was in error, but this time it said that the rdf:RDF tag needed to be closed. This gave me enough of a hint to go searching for a tag that wasn't closed properly. Sure enough, I found a couple of opening tags which each ended with a trailing "/". Once these were fixed the RDF validated, so I went back and successfully validated the OWL.

The XML was obviously wrong here, and it would have been a simple matter to find and report the error when it happened. I was disappointed that the XML parsers used be these validators would give such unhelpful messages.

This whole thing makes be think of using an XML editor. I've never used one before, but it might be worthwhile finding out how useful they are.

Meeting
I also met with Bradley Schatz today. Brad is a PhD student at QUT and is using Kowari as infrastructure for his project. Bob knows him from his time at UQ, and recommended him highly.

Brad is very keen on hearing of my progress with OWL, and has offered to help when/if I need any. That reminds me that I need to get to a point where I can offload some of the less involved work to the people who work with Greg. I'd love the boost to the project, but I don't really have anything like that available yet. Still, I'll try and remember it when I find myself in any code that is tedious. :-)

Anyway, I gave Brad a synopsis of my approach. He seemed impressed, and confessed that he thought it was quite novel. While it was inspired by Rete, he really couldn't see a lot of similarity to that algorithm. I do see how the two relate, but often they relate by how they do things differently, rather than how they are the same.

Brad had a couple of ideas, but most of the benefit came from just having to articulate what I'm doing. Putting something into words like that can often crystallise your concept of a design, and describing it several times only reinforces that. Unfortunately, very few people are familiar enough with the background material for me to explain it to them, meaning that my audience is very limited. Maybe that will improve after I've presented my confirmation. I really need to start writing that. It's been easy to put it off with so much to do lately.

Brad also presented some of what he's trying to use Kowari for. He'd like to store some of his own structures in the StringPool, but after looking at it I explained that I think that is losing many of the benefits of RDF. Still, his user-defined data types are an interesting idea, and certainly achievable (we were hoping to create an interface for exactly that), but it makes querying harder. He's going to have a look to see if moving the data structures out into RDF will help at all.

Rules
I also gave a little time to a pre-print by Guido Governatori on using defeasible logic rules on models with an open-world assumption. It cleared up an item of syntax in Description Logic for me, but my real gain was from the definitions it provided of defeasible logic. I'd seen a lot of this before, but on those occasions I had come away with only a general idea, rather than specific knowledge.

The paper raised the question of using defeasible logic with OWL, given the open world assumption of RDF, but I don't think it's applicable. In particular, OWL talks about what you can infer. There are no constructs (that I can think of) that talk about what you might or can't infer.

The real reason I have this paper is to follow his references. I've yet to get to them, but it may be worth going through them (at least superficially) before starting my confirmation report.

There are also a few grammar errors and typos which I should probably feed back to Guido. Can't hurt to help him out, since he was kind enough to show me the paper.

Swimming
The only other thing of note today was a 3km swim. I set the timer on my watch for an hour, and just kept going until the alarm sounded. I went a little over 3km, but I lost count so I don't know exactly how far I went. :-)

I've swum further than this in the past, but this was the first time I've ever swum that time or distance without pause. I'm also a little less fit than usual, after taking time off during that nasty cold. As a result, I'm shattered tonight, and really need to cut this short and get to bed early.

Location
I worked at the house today, the idea being that I'd set up my monitor to use with the notebook, since I'm starting to get neck strain from having to look down at the screen all the time. Somehow I ended up being to busy to do anything about this. This also explained the swim, as I thought it might loosen my neck (it did).

Maybe I should organise a monitor at my desk at UQ. We can always live in hope that I will be able to get some equipment allocated to me.

Saturday, February 26, 2005

Planet Humbug
I was added to Planet Humbug the other day. While it's nice to get some interest in my blog, I'm not sure how appropriate it is for this aggregation site.

As you can see, I wax lyrical about all sorts of things, from pure speculation through to sometimes-detailed descriptions of what I do day-to-day. The original point of this blog was to keep a journal of my work, so I could go back and see what I'd been doing and why. It was also to give a detailed description of my work to the other developers at Tucana. I was doing open source work, so I made the blog public for the hell of it. :-)

(That should explain my surprise when I discovered people outside of Tucana reading it. My first thought was "Why?". I guess I know now, but it still caught me off guard.)

The result is a blog that is generally several hundred words long, most days of the week. That just seems too long for an aggregator that brings in the entire blog!

Maybe I should suggest to C1inton to restrict the amount of text brought in.

Comments
Not everyone reads comments, and as a general rule there aren't many here, but my last post is a bit of an exception. That will probably be the norm now, as Rob and Andrew don't get to make comments in person any more.

Andrew suggested that I look at Michael Grove's OWL Class Loader for Mindswap. I really like what it does, but I have two comments to make.

The first comment is how it relates to what I'd like to do. It's not quite the same thing, as it relies on the classpath of the current JVM in order to find the class files. That's great, but I'm actually thinking about a system that stores a blob with the classfile image. The classloader then loads the class out of the database for instantiation. Why do this instead of relying on the filesystem? Well it is far more dynamic. It allows new classes to be brought in a runtime. This means that classes could be transferred from a remote location, or generated on-the-fly.

I like the idea of generating a class on the fly. That way a class could be described in OWL (or UML) and then built internally. It could then be stored as a blob until needed.

Now only if Sun had chosen to make those compiler classes available in Java 1.5. I read an interview a couple of years ago where James Gosling was talking about making available those classes which model the language tree, along with methods for emitting the corresponding binaries. Obviously, that never happened.

Maybe I can find something like it in an open source compiler implementation, like Kaffe or Jikes. On the other hand, JavaCC probably does exactly what I want, but I'll have to check. It compiles from text to binary, but I want to be able to create an edit the parse tree directly, with no language involved. Of course there are several compiler projects which will do this, but most of them don't give public access to the internal tree.

TBox and ABox
The second comment I have about the OWL Class Loader has very little to say on practicalities. That is to say, the project works exactly as advertised, and meets its objectives perfectly, but I disagree with the structure.

As it stands, the RDF does not appear to be "OWL" at all, and is instance data instead. It is not entirely bereft of ontology data, but it mixes it in. Take the example of the following:

<cl:Object rdf:ID="myHash">
<cl:hasClass>java.util.Hashtable</cl:hasClass>
<cl:hasConstructor>
<cl:Constructor>
<cl:parameterList rdf:parseType="daml:collection">
<cl:Parameter>
<cl:hasClass>int</cl:hasClass>
<cl:withValue>20</cl:withValue>
</cl:Parameter>
<cl:Parameter>
<cl:hasClass>float</cl:hasClass>
<cl:withValue>1.2</cl:withValue>
</cl:Parameter>
</cl:parameterList>
</cl:Constructor>
</cl:hasConstructor>
</cl:Object>
From an ontology perspective (TBox), this code describes a class, with fully qualified name of java.util.Hashtable, and a constructor which accepts two parameters (an int and a float). However, there is instance data mixed in here too, such that a single instance of this class can be constructed. In fact, the whole structure doesn't seem to describe the class so much, as describe an instance of a class.

From a data perspective (ABox), this code simply references a class that was defined elsewhere, and creates an instance using one of the constructors (the one that takes an int and a float). Rather than using the predicates hasConstructor and hasClass to define the class, it uses these predicates to uniquely identify a class that was defined elsewhere (in the class library).

The giveaway here is when you ask the question, "What is the syntax to define a second instance?" This code all needs to be repeated. This would not happen in an ontology.

So why go on about this? Well I think the name is inappropriate. It really looks more like an RDF Class Loader instead of an OWL Class Loader. The documentation keeps referring to OWL code, when it is really RDF code.

Perhaps in the sort of systems that Michael is using this really dictates behaviour, so describing class definitions like this could be at a slightly higher meta-level than usual instance code, but it isn't all the way up at the OWL level. Maybe he needs the MOF. :-)

Friday, February 25, 2005

Update
So where have I been, and what have I been doing? Why have I not blogged for the past few days? Have I been accomplishing anything? The answer to the final question is, "Yes." The others require explanation...

Over the last few days I've had various evening engagements. Last night was the opening of the French Film Festival, along with party. I'd provide a link, only the site contains a frameset with navigation via cookies. Very messy, and it can't be consistently deep-linked to. Try Palace Cinemas and click on "Festivals and Events" if interested.

Unfortunately, this clashed with the opening night of Flickerfest, which I'd rather have seen (we had tickets with friends for the French film, so we had to forgo Flickerfest). We have Flickerfest passes, so at least I'm getting to see the remainder. Tonight's screening was really enjoyable. I particularly enjoyed "7:35 De La Manana", which was a romantic song and dance number, performed by a suicide bomber and his captives. You'd have to see it to get it. :-)

Other nights this week were just a matter of falling asleep early, due to exhaustion.

The reason this impacts my blog, is because I write at night. I had no ability to post during the day, due to the bug I described where I log into other people's accounts while on campus. Even if I wrote during the day, it would be difficult to post at night, simply because I haven't been turning the computer on after hours.

Rule Structure
After a few days of experimenting I now have a first draft of an OWL description of the rule data. Now if only I had an OWL engine to test the validity of the RDF I write to this specification. :-)

I'm reasonably happy with this draft, but I already want to make some changes.

I started out by building a structure in a ball and stick diagram to represent a simple rule. Unfortunately, I found that this approach has problems. While a ball and stick diagram is effective for illustrating the main elements of a graph, it quickly gets out of control when types and inheritance are included. Consequently, I stuck to the main points, and used RDFS/OWL to flesh it out when I came to write the RDF-XML.

In fact, I realise now that I only really like to use ball and stick diagrams for ABox data. TBox data may look fine in UML, but in RDF it isn't so pretty. The biggest problem is that it gets merged in with the ABox data. After all, RDF is good for merging data in this way. But having all the data in one place makes the diagram unreadable.

Structure
The biggest problem I had with this design was how to structure things. By this I mean decisions like where to use blank nodes, where collections are appropriate, and so on.

I had first thought that I would be navigating my way through the RDF programmatically. For this reason I thought that the JRDF interface might be the way to go. However, after letting myself be fooled like this for a while I started realising just how many queries would be required for the simplest parts of the graph. I felt like a fool when you consider that I am using these rules to perform set-at-a-time operations on the very same database. All I need to do is use iTQL to do all the work in one fell swoop.

To help here I can do a couple of things to make the queries easier. The first is to avoid collections where possible. The next is to be careful about the choice of blank nodes.

The only problem I have is that a variable list from a select clause must be ordered. This means that I have to use a sequence, meaning the use of the rdf:_ predicates. I don't believe that DavidM got around to prefix matching in the stringpool, so there is no way to easily query this data.

There are several approaches I can think of here. The most obvious is to use the programmatic approach. This works, but is a horrible way to go. I'll only use it as a last resort.

The next idea is to implement prefix matching myself, including a syntax in iTQL. I know how to go about this, although the syntax would probably become a debating point, like it always does. The main problem is finding the time for this.

The last idea is to build a resolver that can find these values. It may be a decent halfway measure, though it may include some hacks. For instance, it could return a tuples of rdf:_0, ... rdf:_100 to match against set of less than 100. A terrible idea for the general purpose, but fine when you know that you will never have more than half a dozen values.

Before I commit to anything I will look in the string pool and consider what I need to do for prefix matching. I can then work from there.

There are still some corners of the structure where I'm trying to determine programmatic control versus iTQL. The best example of this is the selection of RDF structure versus an iTQL string. Since I allow both, I need code which can handle either. So do I find all rules together and sort them out as I iterate over them, or do I select them separately? This decision has an impact on the structure, so I have to consider it. At this point I think I'll be building them separately.

All in all, I'm relatively happy with the ontology. While it's not "diagrammatic" it still provides me with a solid template to build the RDF rules with. Since every rule is structured uniquely, it gets tempting to fiddle with the structure to suit the occasion, but an ontology like this prevents me from getting off track like that. It will also make me carefully consider the structure any time I decide that a change really is necessary.

Constraints
One thing I hadn't expected is that I got to rename a few classes. While the RDF will need to map into the Kowari query system, the names I give the RDF nodes need have no bearing on the Kowari class structure (although it tries to reflect it). As a result, I was able to name the constraint classes in a way that I'm happier with.

In particular, I'm using the name Constraint rather than ConstraintOperation. In the same way, I'm using SimpleConstraint instead of ConstraintImpl. ConstraintConjunction and ConstraintDisjunction are both subclassed from ConstraintOperation which in turn is a subclass of Constraint. Very similar to the java code, but just enough different that I feel more comfortable about it.

OCL
At the request of a friend, I'm spending a bit of time out of hours learning and working with OCL. I haven't learnt all that much yet, but it makes sense so far. I'll certainly be picking up more UML as I go!

Working with OCL, and considering the MOF, I'm starting to look for these layers in OWL and RDF. Sometimes OWL seems really restrictive since so many meta-layers have been collapsed into one.

Object Database
I had a weird thought the other day. I imagined a class ontology stored in RDF, including class images stored as literals. A ClassLoader implementation could then create instances of a class directly from the datastore, as opposed to using the file system for the classpath.

The examples I can come up with to use this seem pretty contrived, but it seems useful nonetheless, particularly when bringing in new (trusted) RDF with class definitions at runtime. The nice thing is that the class definitions could come with their own ontologies. (Is there a standard for serialised UML into RDF? Could OWL do the job? Where would OCL fit in?) Does anyone have any ideas on this? Do I just want to do it because I can? (To much software suffers from this).

The only problem I have is that I don't think we can store an arbitrary blob in Kowari. The string pool will take it, but I don't know if the interfaces will (I'm pretty sure they don't). Of course, I can always uuencode a binary, but putting the blob in directly would be better. Maybe I need to put a new datatype into the string pool.

Weekend
Well it's late, and I want to ride in the morning, so I'll leave it here. I plan on working a bit this weekend, so I'll see how far I get.

Aside from work, I also need to start writing the confirmation on my weekends. I was putting this off until I'd finished reading some more papers, but I seem to be constantly accumulating literature, so at some point I'm going to have to stop. Maybe I should draw the line where I am now.

Wednesday, February 23, 2005

Too Busy for Blogging
There are a few things I want to talk about tonight, but I have to confess that I'm wrecked. I haven't exercised much in the last month, and not at all when I was sick. Now I've just run 8km for two days in a row it's starting to catch up with me. Yeah, I'm a wimp.

I'll do a blog write up tomorrow. But not from the university, since Blogger has acknowledged the problem I described the other day, and they don't have a work around yet.

In the meantime, I've spent my evening getting my department home page written. It rambles a bit (and I haven't proof-read it), but it gives some background to my project.

Monday, February 21, 2005

To Collect, or not to Collect?
I've decided to come down in the middle and implement the full query structure in RDF as well as provision to do everything in iTQL. That way it will be possible to pick up subqueries and other more complex structures without having to deal with them up front, but I will get the full functionality I require in 90% of cases with the complete RDF.

Now that I've determined that, I'm left in a quandary about using RDF collections. I really dislike collections. The whole rdf:_n sequence structure is unpleasant, but the rdf:first and rdf:rest cons list structures are a real pain. It's no big deal when described in RDF/XML, but in N3 the whole structure becomes painfully obvious. They make sense when it is possible to duplicate an entry in a list, but other than that, they require the complexity of a trans to find all the elements, and that should not be necessary.

As a result, I don't want to use collections when describing groups of objects. For instance, a conjunction takes several different constraint arguments. While it is possible to describe this with a collection, it is just as easy to keep re-using a predicate like hasArgument. This makes selection of all the arguments of a conjunction as simple as a constraint which says:

  [ <domain:myConjunction> <kowari:hasArg> $c ]
So what is the problem? Why don't I just use collections? Well just take a look at OWL. The range of every predicate that can refer to a group of objects is a list. Since I was talking about conjunctions a moment ago, let's look at a logically analogous predicate, owl:intersectionOf:
  <rdf:Property rdf:ID="intersectionOf">
<rdfs:label>intersectionOf</rdfs:label>
<rdfs:domain rdf:resource="#Class"/>
<rdfs:range rdf:resource="&rdf;List"/>
</rdf:Property>
Obviously I should be trying for consistency, if not standards conformance (which I'm already doing poorly at since I'm not using SWRL). So if systems like OWL use collections all the time, then shouldn't I? It would make the rules engine code more obtuse to write, and would take longer to run, but the overhead is a one-off expense, so it shouldn't be too big a deal.

In fact, I'd started using collections already. I'd been away from Kowari for too long over Christmas (and Hawai'i) and since I was back into RDF/XML I'd forgotten that I don't like collections when stored as triples. Fortunately Andrae asked me about it, which made me pause to give it some thought.

At least now I've thought about it, regardless of the direction I choose. It is almost better to choose a poor direction after carefully considering the reasons than just stumbling over the correct one.

In the end I decided to go with the non-collection approach (re-doing some of my previous work). It's cleaner, and as I've already said, when you don't need collections then I don't think they should be used, simply due to the poor use of predicates that they often entail.

Object Creation Order
I've been looking at the algorithm for object creation in the rule set. My initial instinct was to follow through the network and build things as I came to them. However, this would be fatally flawed.

The rule network contains many loops and interdependencies. If I only built objects as I came to them, then it would be impossible to set up all of the "triggers" links. Obviously I need to find all inter-linked objects, build them, and only then can I link all of the triggers together.

After considering this carefully, I started to think about the whole linking and triggering scheme. I remember that RDFS entailment rules link almost every rule to trigger almost every other rule. Since the rules will avoid execution if the constraints have nothing to add, then it almost makes sense to trigger everything all of the time. This would make writing rules much easier. For instance, I would not need to work out which OWL rules trigger which other OWL or RDFS rules.

The reason to avoid this is due to efficiency. The whole point of the network is to prevent unnecessary execution of each rule. If every rule could potentially trigger every other rule, then the only possible efficiency to be gained would be in constraint evaluation indicating that a rule was not needed. While this would work, there is some expense in doing constraint tests. So have rules trigger each other is still valuable.

On the other hand, this consideration has me thinking that it may be useful to have a rule type which triggers all other rules. There are several rules which do this, and it would be useful to describe it in on fell swoop, rather than with dozens of "triggers" statements. It will certainly make the RDF for those nodes simpler and easier to read and write.

Constraint Counting
Triggering and execution of all rules is controlled by semi-naïve evaluation on individual constraint nodes (which are effectively rule tests in the parlance of a rules engine). This means that we need fast counting of a constraint's results.

While counting is done with a significant speed multiplier (~180 times, on average) it is still done with linear complexity. DavidM was going to add subnode counts to the AVL tree to change this to O(log(n)), but he put it off since it was not going to be necessary with the new XA2 datastore.

Now that XA2 is not happening (at least, no time soon), I may need to factor in some time to do this. After all, a system can't scale if it does anything with linear complexity, no matter how fast the multiplier is (fast multipliers just hide the speed problems for small sets).

Confidence
While looking for some OWL documentation today I saw the current list of OWL Reasoners (look for "Reasoners" on the main OWL page). I have to say, it does shake my confidence sometimes. There are some great systems in there, so why should I think I can do it any better?

At the very least, I know that Kowari needs a reasoner of its own. Kowari scales better than other RDF stores, so it is important that it has a reasoner than can scale with it. Also, Kowari works with set-at-a-time operations, and I don't believe that any of the listed reasoners do this (though there are one or two that I don't really know... I should look them up). Hence, using a Prolog or DL system is not going to reason very well for Kowari at all. They would be slow for a start, but more importantly, most DL and Prolog systems use main memory, which means that they can't scale. Those which do use disk use their own data structures, meaning that data from Kowari would have to be duplicated.

So while I may not end up building the fastest engine available, I'll build the fastest engine that could be used on Kowari. Given the scalability of Kowari, I feel that this is still worthwhile.

Hey, if all else fails, at least I can stick on an OWL reasoner that is known to work! :-)

Thesis Structure
I finally sorted out a broad outline of what my thesis should look like. I still haven't done my confirmation, so I know I'm getting ahead of myself, but I'm pleased to have a plan anyway.

It should break down into the following:
  1. Introduction (obviously).
  2. Literature review: Along with the introduction, this will be built partly from my confirmation paper.
  3. RDF Indexing: This will describe the structure behind the indexes, starting with the mapping to tuples of nodes/sets and how this maps into the rectangular structures that Kowari uses. The evolution to 4-tuples along with the subsequent discoveries on the number of required indexes will be included here. This describes why we moved to 6 indexes, and how to expand the system if more than 4-tuples are needed. Also describe the implementation, with the writing advantages of using AVL trees over B-trees. As I mentioned the other day, this is all new work and I designed and analysed most of it, with some help from others at Tucana. I had some early concern over the relationship of this work to the subsequent OWL work, but since the subsequent sections rely heavily of the effectiveness of the indexing, it is worth going into.
  4. Transitivity: Describes the algorithms and structures used for the trans keyword. This appears to mirror the DISK_TC algorithm of Hirvisalo, Nuutila, and Soisalon-Soininen. The interesting thing here is that the choice of the index structure along with the on-disk tuples implementation (thanks Andrae!) leads naturally to efficient implementations like this.
  5. Set-at-a-time Rules: This will describe the set-at-a-time rules engine I am building. It borrows some ideas from Rete, along with several other features. Again, the choice of the indexes leads to efficient constraint resolution, allowing set operations and semi-naïve evaluation to really pay off here. It also appears that this operation may be analogous to the set-at-a-time XSB engine. A proof of equivalence would lend legitimacy to the technique, as well as prove that it is as efficient as a Prolog engine. Again, it demonstrates how the original index selection has led us naturally into an efficient implementation.
  6. OWL, and DL to iTQL Mapping: A formal description of how Description Logic maps into iTQL. When taken along with the OWL to DL mappings, this gives legitimacy to the operations which are done to implement OWL DL.
  7. OWL Full in iTQL: Introduce the extra features of iTQL which let it go beyond description logic, such as full cardinality. Also introduce the OWL Resolver, which does the OWL description work in a transparent way.
  8. OWL Implementation: Present the full OWL rule set as RDF.
  9. Conclusions.


The nice thing about this is that I should be able to get a few papers out of it. Nothing too fancy, but having read a lot of papers recently the ideas would seem good enough to be published. I'm inclined to create separate papers for items 3, 4, and 5, while items 6, 7 and 8 would go together as an OWL/DL/iTQL paper.

While I'm here, I should note that I want to carefully compare our algorithm to that of DISK_TC, to see if there is anything that I've missed that would be useful. However, I believe that the basic structure is the same, meaning that anything required modification would just be a set of tweaks.

Similarly, I need to build a proof that the XSB engine is in fact equivalent to my rules engine design. Again, this proof may help me spot any shortcomings in the design.

So it's all coming together. The following is not all in order, but now I just need to:
  1. Write the code.
  2. Write my confirmation.
  3. Write the papers.
  4. Do the proofs.
  5. Write the thesis.
Easy, huh? :-)

Frying Neurons
Thursday/Friday proceeded at a similar pace as the rest of the week. I read more papers (re-reading one or two that I went through a few months ago), and working on the rule structure. By the time I got to the weekend my brain felt fried. It's more difficult thinking that hard than I thought. I'm looking forward to writing some real code, just to clear my mind a bit.

I was also fortunate enough to spend a little time with Andrae to talk a few things over.

Rule Structure
At this point I'm considering making the rule RDF a little simpler, as a interim step. The idea is that I insert an iTQL statement rather than the full syntax tree in RDF. There are a few reasons for this.

I had initially thought to build the rule system in a similar way to the earlier implementation in DROOLS, where complete iTQL statements were string literals attached to each rule node. I had opted away from it for three reasons. The first reason is that individual constraint objects will need to be held by the system, and it is easier to get their structure from an RDF graph than from parsing iTQL (especially since it is already inside an RDF graph). The second reason is that it will be possible to re-use individual constraints against several different rules. The third reason is a little less important, as it is because the mathematical modelling of this system is based on matching "rule conditions", which means the "constraints". So having the constraints being described explicitly like this makes it easier to match the implementation to the theory.

Now that I'm considering the implementation more carefully, I'm reconsidering my position. This started when Andrae wanted to know why I hadn't just stayed with using iTQL strings in the graph. While explaining the constraint re-use, it occurred to me that it won't be a simple task to do this. The rules about when a constraint can be re-used will be very strict and limiting. I'm starting to think that an implementation without constraint re-use may be more expedient, and I can add this efficiency in later. Besides, constraint resolution is fast in Kowari. That's the point.

Without constraint re-use, then it the need to individual constraint node management is greatly reduced. With that in mind, it may be better to go with the easier path and use iTQL for the moment. The code to build a query object would be modularised effectively enough that there should be no duplication of work.

I don't quite need to make a decision yet. I'll keep building the structure, and see what appears more effective as I get close to completion.

Blogger Bug
I've had trouble posting my Blogger messages recently. I've tried to put a couple of them up while I've been at the university, but there are some significant login problems.

When I go to Blogger, my browser cookies take me straight to my list of blogs (I only have one). However, I've been getting OTHER peoples blog instead! Last week it was a medical student. This week it's a girl in Arts. I can edit the profile and everything, only when I try it takes me to edit the profile of yet another student! There is something seriously wrong here.

I've tried to log in again, but with no effect. I also tried logging out and back in, but now I'm in a continuous loop of having to enter my name and password (the password is correct... if I give a wrong password it tells me so).

I won't be able to post anything until I'm on another network. In the meantime I'm in a text editor.

Wednesday, February 16, 2005

Cold
When you say that you have a "cold" people tend not to think much of it. After all, they're "common", right? Most people I know say that they have the "flu" instead, because everyone knows that the flu is worse. :-)

Well, Luc brought home a cold from daycare. Yes, it's a cold, but it's still a nasty little virus. Luc got over it pretty fast, but Anne and I are still suffering. Unfortunately for me, I went on an 80km bike ride on Sunday morning (wondering why I was struggling so much), which seemed to depress my immune system just in time for the virus. I ought to know better than to do heavy exercise when Luc is sick.

Now that I'm into the fourth day I thought that I might wake up feeling better, but instead I felt worse. I was just starting to feel sorry for myself when I realised that I feel worse because all of my joints are aching. That's when I remembered how the immune system works... and aching joints means that it's kicked in and doing it's job! Yay! That should means that I'll feel better soon. :-)

Anyway, for those who were wondering, that's why I haven't been blogging. I'd rather get to bed early when I feel this way, rather than blog. Since I'm here today, I'd better preemptively explain that my head feels "muggy", so I apologise in advance for anything here that doesn't make sense.

Work
So what was I doing for the last few days? Well I've kind of started work again.

For anyone reading the comments on my blog, you may have seen a message from someone called GregM suggesting that he may have some consulting work for me. That seems to have panned out, and I'm now spending the next few months doing just that. Greg seems to want exactly what I'm trying to build as my Masters project, and so he is funding me to continue working on that.

I'm more grateful than he knows, because I've been wanting to concentrate on this for some time. At Tucana we had a lot of commercial pressures which kept pulling me away from doing this work. In fact, much of it had to happen in my spare time, even though the company wanted it implemented.

Now that Tucana is no more, I expected to find a completely unrelated job, and work on this stuff in my spare time. I was going to do that, but it would have been frustrating.

So I'm pretty excited that I have the chance to concentrate on OWL full time for a while. Who knows? Maybe one day I'll even get a full time job doing this stuff. (I'm really keen to work overseas for several years, so hopefully one day I'll find someone will want someone who knows about semantic web technologies).

Anyway, based on the promise of payment from Greg, I've gone and turned down two other jobs this week (it's a little scary turning down full time employment for a job that looks like fun when you're getting as broke as Anne and I are). I know that payment is out of Greg's hands now, so Anne and I are hoping that we'll see something soon.

As a final word on the consulting... Greg reads this blog, so I just want to say thank you. :-)

Mastering OWL
So what have I done for the last few days that I haven't blogged? Two things. First, I've been designing an RDF Schema for the Kowari Rules system. The second thing is reading papers. It's still early days on the schema (and the code to read it), but I've made quite a bit of progress on the papers.

I've quoted the work I'm doing for Greg as just the "development" of the OWL inferencing engine. However, it's really a "research and development" project. Greg pointed out that he is aware of that distinction, which I'm pleased about.

The problem with any R&D project, is that it is much harder to estimate time than a simple development project. After all, if you knew everything you'd be doing then it wouldn't be research, would it? :-) Fortunately, I have done the lion's share of the research at this point. I still have to get through a little bit, but at this point it's mostly a justification for what I'm doing, rather than finding out how I'm going to do it.

An important aspect of the "research" component is reading papers. Because I'm concentrating on the "development" part for Greg I'm trying to keep my working hours to mostly development, but I still need to read a little. The research helps drive the implementation, while the implementation helps direct the research.

I've managed to get through several papers in the last few days, and I'm pleased that I have. Andrew's suggestions have paid off again (thanks Andrew). The best one is a paper called Taking I/O Seriously: Resolution Reconsidered for Disk. This paper uses set-at-a-time techniques rather than tuple-at-a-time, and proves that the efficiency is equal. This is important for validating my approach. I also think that there is an equivalency between my planned constraint resolutions in the rules network and the subgoals in their breadth-first approach. Their system is DL based, while my own has broken DL up into rule components, but the important principles appear to be similar.

As for the rules implementation... I've built a heap of RDF as ball-and-stick diagrams so that I can see what I'm doing. I'm missing some important features, but the basic structure is there. I'm still in the process of converting this into parsable RDF-XML (I need more practice at this - I can read it, but my writing isn't so good, particularly when using blank nodes). I'll probably post it here for feedback when I'm done with it.

Scalability
Lunch was early and longer than usual today, as I was kicked out of the ITEE building when the fire alarm went off. So I went out to have lunch with a friend who works in the Institute for Molecular Bioscience. I took advantage of the opportunity to quiz her on the details of DNA and gene research that I've been curious about. It's fascinating how all of these processes break down into simple mechanical operations when viewed at a low enough level. Of course, no one really understand how the complex systems emerge from these simple interactions, but that is the same in any emergent behavioral system.

Once we got into discussing gene sequences this reminded me of the proposed usage of Kowari in the life sciences arena. Kowari scales very well, but it doesn't yet scale that well. That can be fixed, but I'm not sure how we're going about it now. DavidM was going to use a skip list for the next version of the data store (to take advantage of the fact that hard drives are very good at streaming through blocks), but he now has a full time job (for the same company I turned down a job with). He was also going to address some scalability issues while doing this work.

I can certainly do this work instead, but I really need to concentrate on OWL for the moment. Besides, David would probably do a better job. So I'm not sure where the next level of scalability in the datastore will come from.

The other issue for working with life sciences is clustering. This is not a trivial extension, though there were certainly some plans to implement it. Again, I could work on it, but I won't be able to for some time.

The reason I think that scaling like this is important is because life sciences needs a scalable RDF database, and RDF needs to work in a practical area that really needs it like this. Kowari doesn't (yet) have the inferencing features of the other systems out there (like Jena and Sesame), but it certainly scales much better (anyone who is skeptical of this statement has obviously been using the Jena interface).

So to get RDF working at this level, I believe that Kowari is the only system which will do it (for the moment). That means that Kowari needs some funding to make it scale to this level. Can we get it from the life sciences? Maybe, but it will depend on how they view our scaling so far. If we are too short of the mark, then they won't be interested in us. Ideally, we will already scale well enough to be useful, and that will encourage them to fund us to create real scalability (ie. efficient clustering).

I know that DavidW is interested in this. I should talk to him about it.

Indexing
Long time readers will be aware that I've worked quite a bit on the indexing that we use in TKS and Kowari. We originally designed and built this system in 2001. The result was made public when we open-sourced Kowari in 2003. A few months later I discussed several aspects of it on this blog.

I started by talking about some equivalent MySQL indexing, and a few days later I discussed our own indexing in more detail. Of course, this is only an adjunct to what we actually did in Kowari (which is the definitive reference for what we did). Thinking about it again in August, I posted some comments on a more mathematical foundation for indexing in this way, and how it maps into a set of vectors in an N dimensional space (where N=4 for Kowari: with the dimensions of subject, predicate, object and meta).

I also took some of the original index modelling and in August I turned it into the in-memory implementation of JRDF. This indexes the data the same way that Kowari does, only it uses tuples of nodes paired with node sets for speed and efficiency. Kowari expands these tuples from single nodes paired with a set of nodes, to pairs of nodes, with the first node duplicated for each element in the set it was paired to. The effect of this expansion is to make the full data set rectangular, which facilitates efficient storage on disk. However, the indexing theory for the two systems is identical.

My point is that I've thought this problem through quite heavily, along with a little help from my colleagues at Tucana. This indexing is a fundamental feature for the efficiency of the rules engine that I am building, so it will figure prominently in my Masters.

I was searching for papers which would discuss indexing in all directions like we do for Kowari, and Andrew pointed me to Yet Another RDF Store: Perfect Index Structures for Storing Semantic Web Data With Contexts. I hadn't paid attention to him talking about it when he first found it, but I was interested now.

Initially I was not too surprised that someone had duplicated our scheme, as it makes sense (or it did to me when we designed it). I was disappointed that we were not unique in our design, but pleased that I would have someone to cite. However, as I read further I started to become very agitated.

In the introduction, the authors claim that of the other databases in existence, almost all of them, "...are built upon traditional relational database technology." (Kowari was a member of their list). Of course, this does not apply to Kowari at all.

Also, in their "Contributions" section, they claim, "Our paper identifies and combines several techniques in a novel fashion from the database area to arrive at a system with superior efficiency for storing and retrieving RDF."

This would all seem fine, except that they go on to describe an identical indexing scheme to the one used in Kowari. This makes it far from "novel".

There are two differences in their implementation to our own. The first is that they used B-trees rather than the AVL trees that Kowari uses. That is not a significant difference, and I blogged about this several times before. The main advantage of the AVL tree is that they are cheaper to write to the structure than B-trees are, though a little slower to read. They also pointed out that they wanted to use an existing library for the index, and while B-tree libraries are common, we never found a completely working AVL tree library (deletions were always buggy).

The second difference is a set of special statements which count the number of triples in parts of each of the indexes. This is certainly novel, but not an approach I would use. I believe that counting is still very efficient in Kowari (O(log(n)), and the space overhead they incurred would seem prohibitive. More importantly, writing is always the slowest operation, and their system would incur a large writing penalty for using this scheme.

Beyond these two points, the entire indexing scheme is an exact description of Kowari, without the more thorough analysis we undertook. Given that it was submitted to WWW2005 at the end of last year is very poor timing for them.

Perhaps this is a monumental coincidence. However it would indicate particularly poor research on their part if it is, as they explicitly mention Kowari as a part of their paper, and yet fail to acknowledge that it uses the same indexing scheme. They also compared their project to several other listed databases, but claim, "We tried to install Kowari, but failed to get a running version." I'm very surprised at this, as I believe that it is simple to run. I don't believe that they asked the mailing list for advice.

Admittedly, they used the Jena interface on Kowari. I've never used this, but I've worked in the code. Anyone using the Jena interface will definitely see a poorly performing system. They also said that their JVM died on one machine. I can only presume that they were using a Mac, as this is the only JVM I have seen having a problem in recent years.

This hardly matters, because even without a performance comparison, they were certainly aware of the Kowari project.

I've since been told that this paper was rejected due to its similarity with Kowari. I'm pleased to hear it, but it has left me with a realization that I need to publish what I've been doing (all the way back to the indexing work from 2001), or else this could happen again. Besides that, the ITEE department will be very happy with me if I publish something.

Unfortunately, I can't really write papers on Greg's time, so I'll be working lots on weekends and evenings. Oh well, I guess that's what I signed up for when I enrolled at university.

Thursday, February 10, 2005

RDF
I just realised that my confirmation will be easier than I thought!

I've been thinking that while my ideas were new, there wasn't all that much to them, so I've been worried about filling in the bulk of the talk and the paper. But I was forgetting that most people don't even know what RDF is! I even had to explain to Bob that non-anonymous nodes are identified by URIs and not URLs (hence, the "location" mentioned in an ID that looks like a URL does not need to refer to a real web page). So I will start the talk with an explanation of just what RDF is, followed by a brief description of taxonomies and ontologies and how RDFS and OWL can represent them.

This all seemed like a silly waste of my time when there was more important stuff to talk about, but then I realised that this is actually a legitimate part of my literature review. It also leads nicely into the indexing scheme used by Kowari, and why I chose it.

I should start by presenting Jena, Sesame, and Kowari, with a brief overview of them all. Are there any other significant RDF databases yet? I'll admit that I've been lazy and haven't tried to keep up. I'll ask Andrew.

When I first started indexing, I really envisaged every node mapping to a set of nodes mapping to nodes, with the same pattern repeated for each direction. This works perfectly in memory (it's how I implemented JRDFmem for JRDF), but it does not map very well to a file representation. DavidM's idea of sorting three ways ends up being exactly the same idea, only it uses a different representation. For instance, the extent of a single subject for a set of predicate-object mappings defines the set of mappings I had modeled before.

Of course, I should also mention that the indexes are just numbers, and that the real data (literals and URIs) is stored in the String Pool.

The fact that Kowari maps 3 ways (OK, four ways, but I'll get to that) is something that Andrew calls "perfect indexing". He thinks I can find some papers on this, and I'll be doing that in the next couple of days. I can then explain why this type of indexing lets us find any group of data that we want, including some examples as I go. Once this is established, I can get onto models, and how this necessitates a fourth node in the index. I probably don't need to mention the whole spatial/combinatorial relationship of the number of indexes to the number of nodes, but I will explain that perfect indexing requires 6 indexes when using 4 nodes.

Now at this point, should I mention OWL first, or iTQL? I'm concerned that if I go straight to iTQL then the talk will start sounding more like a description of Kowari than my new work. However, it leads into the further work, so it makes sense to describe it. Maybe I should ask Bob.

Anyway, a description of iTQL will explain how constraints take a slice from an index, with conjunctions and disjunctions resulting in various join operations. It should be apparent that this allows queries to occur without having to iterate over the graph to test each statement. This is a major point, as it is where we get an advantage of most other systems (Jena has a big problem with this). It's also why it will be hard to do SPARQL efficiently in Kowari, as SPARQL sort of assumes that you are iterating over data, rather than doing a selection of a range from a database. It's not that we'd do SPARQL any worse than anyone else. It's just that we have the capacity to do something an order of magnitude more efficient.

Once RDF and the structures of Kowari and iTQL are established, I can move onto OWL. OWL inferencing can be done almost entirely in Description Logic (of course, OWL DL can be done entirely in description logic). From this point I can start citing Volz on many of the mappings into Horn clauses. Once I have this established, I should be able to describe iTQL in terms of the mapping from Horn clauses. I believe that this is important, as it gives the iTQL a solid mathematical foundation to work from.

That provides most of the background. So then I can move onto rules systems.

I'll have to give a brief description of a rules system (finding a formal definition should be useful). From there I'll describe the Rete system, along with some improvements and modifications. I can also explain how this system assumes that each statement has to be tested against each rule, and Rete gains its efficiency by avoiding as many tests as possible.

At this point I can propose my implementation of rules for Kowari. I'll explain how it will involve a similar network to Rete, but instead of storing memory for every node (a non-scalable solution, as it means that a lot of the database may need to be duplicated) I'll be doing selections from the index. These do take some time, but it is very fast (O(log(n))) and does not need to be done often. I'll also explain how the iTQL will get broken up for insertion into the rule graph, with each test in the rule graph containing just a single constraint (ie. a single selection from the database). The non-standard constraints (like trans, or some of the magic predicates) will need an explanation, which in turn may lead to describing resolvers. I should just mention this briefly, and put it off to the end of the talk where I talk about the ongoing work.

Once the basic structure of this rules engine is described, I can explain why it only iterates a few times (because almost all data is found in the first iteration, due to the fact that groups of statements are found instead of individuals), and how extra efficiencies, such as semi-naïve evaluation can help (including a citation to the technique). I can also mention that I've already built a test system with DROOLS, only with grouped rule tests (ie. the entire constraint clause operated as a rule test, gaining nothing from shared nodes in the system). It will also be important to note that because of the open-world assumption all OWL rules will generate new data, with no data being removed in the process. I will need to mention this to help explain how semi-naïve evaluation is a legitimate optimization. While I think of it, removing data will make for an interesting "future directions" comment.

With everything described, I can fill in a few of the gaps, including an explanation of some of the extra support needed for RDF and OWL, in particular, the OWL description resolver, and the transitive collections predicates.

At that point, I can plot out my timeline, and finish with ideas for future work. The main ideas here are how to work with changing data (not simply adding new data) and a changing ontology.

There, that ought to take up 10 pages, I'm sure. Now that I look at it, I'm more worried about having too much to present! That's disturbing, as there is a lot of description logic work that I really ought to read and include. I'm starting to think that I will need to work hard to keep it down to size.

Luc
In the meantime, I still need to look up those papers on logic implementations and rules that may work on grouped data. It's not easy this week, as I am still looking after Luc at home.

That said, I'm grateful to be here today, as it is his first birthday! I'm trying to give him a nice day, even though he won't have any idea why. :-)

Happy birthday Luc!

Wednesday, February 09, 2005

Coincidences
Another busy day today. Everything from QT to research papers. I'd like to have accomplished more programming, but until I've guaranteed money coming in (I'm pretty sure it's happening now) then I have a lot of other things to keep up with.

In the meantime, I was reminded of a big coincidence, and I thought I'd log it.

About 8 years ago I wanted to withdraw some cash against my Visa credit card. I used my card regularly for purchases, but never to borrow cash before. As a result, I had not been using the PIN for the card, and I discovered that I could not remember it now that I needed it. This necessitated a visit to a local branch of the Commonwealth Bank to reset the PIN.

In the branch I was informed that I needed a 6 digit number for the PIN. I've always been cautious of obvious numbers, but I wanted to choose something that I could work out again in case I ever forgot it.

As a way of exercising my memory, I occasionally try to memorize long irrational numbers. For instance, I know about 26 decimal places for π. I decided to use this as the basis for my PIN.

I took the first 6 digits of PI and wrote them out. I then took the first 6 digits of e, reversed them, and wrote them out underneath the digits for π.

So I had:

  3 1 4 1 5 9

8 2 8 1 7 2

I then added these digits. To avoid having these two 6 digit numbers would add up to a 7 digit number, I added just the digits, taking the result mod10. eg. (3 + 8) mod 10 = 1.

The result was:
  3 1 4 1 5 9

8 2 8 1 7 2
-----------
1 3 2 2 2 1

I was surprised at the regularity of this result, but one of the features of a random sequence of digits, is that it can look non-random. If it was unable to appear regular in places, then that would be a non-random influence. Mathematics if full of ironies like that. :-)

Once the new PIN was in place I was able to use the card to take cash from an ATM, and I thought no more of it.

A few weeks later, that card expired. The bank issued a new card (using the same PIN), and asked me to come into my local branch to collect it. When I arrived and asked for my new card, they presented it to me attached to a piece of cardboard with some notices.

The first thing printed on the cardboard, was to never leave the PIN written down anywhere near the card, even if it is disguised as a phone number or a date.

The second thing on the cardboard was my PIN!!!

I was horrified! How could they do this? I'd never heard of a PIN being printed in the clear like that. It was a clear breach of protocols at the bank.

I was about to make an issue of it when I looked at the card more carefully. In fact, what was written was: "In the even of a lost or stolen PIN, please ring 132221."

If you don't believe me, then go to the "Contact us" link on the bank's home page.

Obviously, I changed the PIN immediately. To quote Terry Pratchet, "One in a million chances occur nine times out of ten." :-)

Tuesday, February 08, 2005

Seminars
I went to two PhD confirmation seminars at UQ today.

The first seminar was on defeasible logic. Interesting, but ultimately unsatisfying. This was because defeasible statements do not have a lot of applicability to RDF. It's not like we say, "Subject usually has Predicate on Object."

The second was about using Description Logics on RDF and OWL. Now this one seemed much more interesting from my perspective.

I was a little surprised about the scope of this second presentation. I can quote exactly which papers went into the seminar. I know exactly where the images were "lifted" from. I'm intimately familiar with almost everything presented.

The main difference with this project and my own was that it wanted to express a mathematical completeness for all of the DL used for OWL. As such, it dwelled for some time on TBox and ABox definitions, and their interactions. In fact, it did this so much that the question time at the end seemed to get bogged down in issues of how these relate to each other. I didn't expect this, as I was waiting to hear this all get tied back into OWL, but it didn't happen. It's like OWL was just the excuse for the motivation, rather than a real goal.

Other than the mathematical description of the logic, one purpose of this PhD was to create a working implementation based on the logic. However, I felt that there was not a lot of description on how this should be implemented. He talked about potential improvements over existing algorithms, but very little methodology. I'm keen to learn more about tableaux reasoners and the like so I can compare them to the rules engine I'd like to build, but this information was not there.

I also felt that the work was disturbingly parallel to my own. The issue that I have with this is that I'm only doing a Masters. What gives?

One thing that I'm thinking as a result of this seminar is that I should find a more formal way of describing the whole rules process. I can sort of describe individual steps, as these often come down to FOL, but I'm wondering if there is some way that the whole process can be described cleanly. I'd rather not work out how to do this for myself (if I can avoid it), so I should go looking to see if anyone has already done it. Guido is a rules expert, so perhaps I should ask him. OTOH, I'm likely to embarress myself if I'm asking something so fundamental. :-(

Anyway, I got in touch with Stone after he presented his seminar. He and I should be catching up over the next few days.

Monday, February 07, 2005

Sultry
It's hot tonight. It's 9:55pm and according to the weather bureau the temperature in the city (only about 1km from here) is 27.1C (80.78F) with a relative humidy of 81%. Sure, it was hotter during the day, but you'd think it would have been cooler by this time of night. I shouldn't complain though, as it looks like getting worse for the next 2 days. Yuck.

Break
I've continued to extend my break, so I haven't really done a lot lately. The main reason is because I've been at my brother's wedding in Hawai'i for the last week, and I didn't really have access to a computer. I did get to read a few papers, but I spent most of my time sight-seeing. And when I did take the time for some significant reading, I just read "Down and Out in the Magic Kingdom". I loved it. :-)

I really did enjoy Hawai'i, particularly the big island. It's winter there now, so the weather was really mild, and even cold on some days.

My absolute favourite part was visiting the active volcano. Lava is a pretty amazing thing to watch.


Now that I'm back in Australia, all of the job applications I put in at the end of December have FINALLY started to pay off. However, while I know I have 2 guaranteed offers (with another 2 very likely ones about to be confirmed), no one has yet made a formal offer involving actual money. It's been over a month, and I've been available for almost the whole time, so it's a little frustrating. The credit card is starting to look very unhealthy. :-(

Work
Working or not, I'm feeling motivated to get something done again, so part of this week will be spent working on Kowari and OWL again. Tomorrow I'll be attending 2 PhD confirmation seminars on OWL and the semantic web, so hopefully they will be interesting and insightful. (I'll also be doing another job interview, which will probably kill the afternoon). In between seminars I'm hoping to catch up on some reading.

I currently have a question that I'm pursuing in the literature, and I'm not exactly sure where to start.

Rules systems (like Rete) appear to take individual statements and determine their consequences one at a time. Efficient systems (like Rete) minimize the work of these consequences. The thing I like about rules systems are that it is clear how they should be implemented.

Description Logic systems (and subsets or intersecting sets such as Datalog, Prolog, etc), are not so clear on how they should be implemented. (I need to establish how implementation should be done on an algebraic system). Also, with the exception of systems like "Persistent Prolog" most systems I've seen need to refer to their statements in memory. This means that they can't scale. It seems to me that implementation of a description log system can be done with a rules system, though I've been told that there are some statements which are incompatible (I don't really see it, but I'd probably have to try to implement it to see why it doesn't work for every case). I have seen it written in at least one place that a description logic system could be implemented with a rules system.

One feature that both types of evaluation systems seem to share in common, is that they both operate on one statement at a time. This is obviously a scalability bottleneck. It also completely ignores the fact that Kowari is designed to retrieve and work with large groups of related data. This ability of Kowari's is the basis of its speed (and why the Jena interface is so slow for Kowari).

My idea is to work with large groups of data which are returned from Kowari, and use them in a kind-of rules engine. It means working with groups of data, rather than individual statements... and that's where my problem lies.

I can't find any papers on rules systems which talk about working with large groups of data at once, rather than individual statements. Maybe nobody ever indexed data quite this way, so groups like this were not possible.

I also can't find any papers on algebraic systems which refer to working with large groups of data at a time, rather than individual statements. These papers possibly exist, but I haven't found them yet. The other issue is how an algebraic system is supposed to be implemented anyway. LOTS of papers talk about these systems, but the evaluation of them is always referred to in terms of algebraic equations. There must be some papers which talk about how this is done. If I can't find any, then I may just try looking at the innards of Vampire, though that will hardly be the definitive answer.

Finding papers on this stuff will really help me to move on. Conversely, if I don't find some papers on this, then I know that working with large blocks of data is a relatively new idea, so my working on it will be a big deal.

Well, I'll be at UQ tomorrow, so I'll do some more reading while I'm there.

Jetlag
It's only 4 hours, but I'm right in the middle of the jetlag period at the moment. It's 11pm and my body thinks it's 3am. Consequently, I'm barely able to type. If you see any glaring mistakes above, then I apologise now. In the meantime I'm going to bed and will finish this another time.