Monday, February 07, 2005

Sultry
It's hot tonight. It's 9:55pm and according to the weather bureau the temperature in the city (only about 1km from here) is 27.1C (80.78F) with a relative humidy of 81%. Sure, it was hotter during the day, but you'd think it would have been cooler by this time of night. I shouldn't complain though, as it looks like getting worse for the next 2 days. Yuck.

Break
I've continued to extend my break, so I haven't really done a lot lately. The main reason is because I've been at my brother's wedding in Hawai'i for the last week, and I didn't really have access to a computer. I did get to read a few papers, but I spent most of my time sight-seeing. And when I did take the time for some significant reading, I just read "Down and Out in the Magic Kingdom". I loved it. :-)

I really did enjoy Hawai'i, particularly the big island. It's winter there now, so the weather was really mild, and even cold on some days.

My absolute favourite part was visiting the active volcano. Lava is a pretty amazing thing to watch.


Now that I'm back in Australia, all of the job applications I put in at the end of December have FINALLY started to pay off. However, while I know I have 2 guaranteed offers (with another 2 very likely ones about to be confirmed), no one has yet made a formal offer involving actual money. It's been over a month, and I've been available for almost the whole time, so it's a little frustrating. The credit card is starting to look very unhealthy. :-(

Work
Working or not, I'm feeling motivated to get something done again, so part of this week will be spent working on Kowari and OWL again. Tomorrow I'll be attending 2 PhD confirmation seminars on OWL and the semantic web, so hopefully they will be interesting and insightful. (I'll also be doing another job interview, which will probably kill the afternoon). In between seminars I'm hoping to catch up on some reading.

I currently have a question that I'm pursuing in the literature, and I'm not exactly sure where to start.

Rules systems (like Rete) appear to take individual statements and determine their consequences one at a time. Efficient systems (like Rete) minimize the work of these consequences. The thing I like about rules systems are that it is clear how they should be implemented.

Description Logic systems (and subsets or intersecting sets such as Datalog, Prolog, etc), are not so clear on how they should be implemented. (I need to establish how implementation should be done on an algebraic system). Also, with the exception of systems like "Persistent Prolog" most systems I've seen need to refer to their statements in memory. This means that they can't scale. It seems to me that implementation of a description log system can be done with a rules system, though I've been told that there are some statements which are incompatible (I don't really see it, but I'd probably have to try to implement it to see why it doesn't work for every case). I have seen it written in at least one place that a description logic system could be implemented with a rules system.

One feature that both types of evaluation systems seem to share in common, is that they both operate on one statement at a time. This is obviously a scalability bottleneck. It also completely ignores the fact that Kowari is designed to retrieve and work with large groups of related data. This ability of Kowari's is the basis of its speed (and why the Jena interface is so slow for Kowari).

My idea is to work with large groups of data which are returned from Kowari, and use them in a kind-of rules engine. It means working with groups of data, rather than individual statements... and that's where my problem lies.

I can't find any papers on rules systems which talk about working with large groups of data at once, rather than individual statements. Maybe nobody ever indexed data quite this way, so groups like this were not possible.

I also can't find any papers on algebraic systems which refer to working with large groups of data at a time, rather than individual statements. These papers possibly exist, but I haven't found them yet. The other issue is how an algebraic system is supposed to be implemented anyway. LOTS of papers talk about these systems, but the evaluation of them is always referred to in terms of algebraic equations. There must be some papers which talk about how this is done. If I can't find any, then I may just try looking at the innards of Vampire, though that will hardly be the definitive answer.

Finding papers on this stuff will really help me to move on. Conversely, if I don't find some papers on this, then I know that working with large blocks of data is a relatively new idea, so my working on it will be a big deal.

Well, I'll be at UQ tomorrow, so I'll do some more reading while I'm there.

Jetlag
It's only 4 hours, but I'm right in the middle of the jetlag period at the moment. It's 11pm and my body thinks it's 3am. Consequently, I'm barely able to type. If you see any glaring mistakes above, then I apologise now. In the meantime I'm going to bed and will finish this another time.

6 comments:

Andrew said...

I thought both Syllog and Aditi (I think they used b-trees) are the old ones (> 10 years old) that have disk based indexation.

* XSB as an efficient deductive database engineSet at a time processing:
* Taking I/O seriously: Resolution reconsidered for diskSpecific optimization for transitive closure on disk:
* Transitive closure algorithm DISK TC and its performance analysisSomething like PerKMan is a more modern approach (G-Trees) (compares it with something called ECLiPSe up to 2 million facts):
* Efficient Management of Persistent KnowledgeThere was also late last year, that one the best paper DLDB-OWL:
* Scalable OWL Lite* Download for DLDB and HAWK (both disk based).

Syllog looks very interesting, especially backward chaining iteration:
* Reasoning about ER Models in a Deductive EnvironmentI think that maybe "parallel process" or "distributed processing" of rules is a better search term for these types of operations that you're talking about too.

Andrew said...

Should be "that won the best paper DLDB-OWL" (in ISWC 2004) and I'm not sure why I suggest "distributed processing" but "parallel processing" is probably worth a look.

Andrew said...

I wasn't sure whether this was true, but apparently AVL Trees were an in-memory technology:
History of File Structures

Paula said...

That's interesting, though it's not really a big deal. The important thing for on disk storage is that the records be the same length. The only systems I know of that doesn't follow are skip lists, and they are write-once.
AVL trees and B-trees both share the important property that individual nodes are always of the same size. This is of no particular advantage while in RAM, but on disk it makes a huge difference.
So maybe it's just a coincidence that they work well on disk. The fact remains that they DO work well, and well defined reasons. Since then people have come up with different attempts to index specifically on disk, but nothing has really overcome how effective these kinds of trees are. (With the exception of Lov Grover's ideas). ;-)

Paula said...

Actually, I should clarify...
B-trees do involve fewer trips to disk when reading a file. However, Kowari is very fast at reading already, and this is not perceived to be a problem anywhere.
AVL trees involve fewer writes to disk when adding data than do B-trees. An AVL re-balance is very cheap in this regard, while in the worst case a B-tree rebalance can touch most of the index. This is one of the reasons our bulk loads are good.
So, B-trees have definite benifits, but they do not completely overcome the features of AVL trees.

Paula said...

I'm finally getting to those papers that are posted up the top here. So far, the real gem is: Taking I/O seriously: Resolution reconsidered for disk.This paper describes techniques of logic evaluation in a tabling environment, specifically avoiding tuple-at-a-time evaluation in favour of set-at-a-time approach. This is to take advantage of the on-disk nature of many modern databases which are still stuck on the tuple-at-a-time paradigm.

This is EXACTLY what I'm trying to accomplish. It's a very different system to my design for Kowari, but it's the first reference to this kind of thing that I've seen. Thank you. :-)