Thursday, February 10, 2005

RDF
I just realised that my confirmation will be easier than I thought!

I've been thinking that while my ideas were new, there wasn't all that much to them, so I've been worried about filling in the bulk of the talk and the paper. But I was forgetting that most people don't even know what RDF is! I even had to explain to Bob that non-anonymous nodes are identified by URIs and not URLs (hence, the "location" mentioned in an ID that looks like a URL does not need to refer to a real web page). So I will start the talk with an explanation of just what RDF is, followed by a brief description of taxonomies and ontologies and how RDFS and OWL can represent them.

This all seemed like a silly waste of my time when there was more important stuff to talk about, but then I realised that this is actually a legitimate part of my literature review. It also leads nicely into the indexing scheme used by Kowari, and why I chose it.

I should start by presenting Jena, Sesame, and Kowari, with a brief overview of them all. Are there any other significant RDF databases yet? I'll admit that I've been lazy and haven't tried to keep up. I'll ask Andrew.

When I first started indexing, I really envisaged every node mapping to a set of nodes mapping to nodes, with the same pattern repeated for each direction. This works perfectly in memory (it's how I implemented JRDFmem for JRDF), but it does not map very well to a file representation. DavidM's idea of sorting three ways ends up being exactly the same idea, only it uses a different representation. For instance, the extent of a single subject for a set of predicate-object mappings defines the set of mappings I had modeled before.

Of course, I should also mention that the indexes are just numbers, and that the real data (literals and URIs) is stored in the String Pool.

The fact that Kowari maps 3 ways (OK, four ways, but I'll get to that) is something that Andrew calls "perfect indexing". He thinks I can find some papers on this, and I'll be doing that in the next couple of days. I can then explain why this type of indexing lets us find any group of data that we want, including some examples as I go. Once this is established, I can get onto models, and how this necessitates a fourth node in the index. I probably don't need to mention the whole spatial/combinatorial relationship of the number of indexes to the number of nodes, but I will explain that perfect indexing requires 6 indexes when using 4 nodes.

Now at this point, should I mention OWL first, or iTQL? I'm concerned that if I go straight to iTQL then the talk will start sounding more like a description of Kowari than my new work. However, it leads into the further work, so it makes sense to describe it. Maybe I should ask Bob.

Anyway, a description of iTQL will explain how constraints take a slice from an index, with conjunctions and disjunctions resulting in various join operations. It should be apparent that this allows queries to occur without having to iterate over the graph to test each statement. This is a major point, as it is where we get an advantage of most other systems (Jena has a big problem with this). It's also why it will be hard to do SPARQL efficiently in Kowari, as SPARQL sort of assumes that you are iterating over data, rather than doing a selection of a range from a database. It's not that we'd do SPARQL any worse than anyone else. It's just that we have the capacity to do something an order of magnitude more efficient.

Once RDF and the structures of Kowari and iTQL are established, I can move onto OWL. OWL inferencing can be done almost entirely in Description Logic (of course, OWL DL can be done entirely in description logic). From this point I can start citing Volz on many of the mappings into Horn clauses. Once I have this established, I should be able to describe iTQL in terms of the mapping from Horn clauses. I believe that this is important, as it gives the iTQL a solid mathematical foundation to work from.

That provides most of the background. So then I can move onto rules systems.

I'll have to give a brief description of a rules system (finding a formal definition should be useful). From there I'll describe the Rete system, along with some improvements and modifications. I can also explain how this system assumes that each statement has to be tested against each rule, and Rete gains its efficiency by avoiding as many tests as possible.

At this point I can propose my implementation of rules for Kowari. I'll explain how it will involve a similar network to Rete, but instead of storing memory for every node (a non-scalable solution, as it means that a lot of the database may need to be duplicated) I'll be doing selections from the index. These do take some time, but it is very fast (O(log(n))) and does not need to be done often. I'll also explain how the iTQL will get broken up for insertion into the rule graph, with each test in the rule graph containing just a single constraint (ie. a single selection from the database). The non-standard constraints (like trans, or some of the magic predicates) will need an explanation, which in turn may lead to describing resolvers. I should just mention this briefly, and put it off to the end of the talk where I talk about the ongoing work.

Once the basic structure of this rules engine is described, I can explain why it only iterates a few times (because almost all data is found in the first iteration, due to the fact that groups of statements are found instead of individuals), and how extra efficiencies, such as semi-naïve evaluation can help (including a citation to the technique). I can also mention that I've already built a test system with DROOLS, only with grouped rule tests (ie. the entire constraint clause operated as a rule test, gaining nothing from shared nodes in the system). It will also be important to note that because of the open-world assumption all OWL rules will generate new data, with no data being removed in the process. I will need to mention this to help explain how semi-naïve evaluation is a legitimate optimization. While I think of it, removing data will make for an interesting "future directions" comment.

With everything described, I can fill in a few of the gaps, including an explanation of some of the extra support needed for RDF and OWL, in particular, the OWL description resolver, and the transitive collections predicates.

At that point, I can plot out my timeline, and finish with ideas for future work. The main ideas here are how to work with changing data (not simply adding new data) and a changing ontology.

There, that ought to take up 10 pages, I'm sure. Now that I look at it, I'm more worried about having too much to present! That's disturbing, as there is a lot of description logic work that I really ought to read and include. I'm starting to think that I will need to work hard to keep it down to size.

Luc
In the meantime, I still need to look up those papers on logic implementations and rules that may work on grouped data. It's not easy this week, as I am still looking after Luc at home.

That said, I'm grateful to be here today, as it is his first birthday! I'm trying to give him a nice day, even though he won't have any idea why. :-)

Happy birthday Luc!

No comments: