Tuesday, August 03, 2004

Caveat Emptor
I normally try and take the time to proof read what I write here. But this is a little lengthy, and it's now 12:45am. You'll just have to take the spelling and grammatical errors as they come until I get time to come back and fix them up.

The first order of the day was to establish my priorities again. While we know that better externalization will increase RMI speeds for AnswerPageImpl, TJ doesn't want to get bogged down in it for the moment. So I'm back onto the inferencing code. I'll probably try and do the externalization in my own time soon. In the meantime I ran the full test suite for the current changes, and when they passed I checked everything in.

That reminded me that I have a checkout that continues to fail the Lucene tests. I guess I should blow that checkout away.

Because RDFS tests are mostly defined with N3 files, I went back to trying to load N3 again. The first step was to rebuild everything with the latest checked out code. Well it turns out that there have been build changes again and now TKS isn't building with the latest Kowari core. The build is complaining that it can't find a method that appears to be there. I'm not going to get bogged down in a problem like this again when the code belongs to someone else, so I decided to do the work directly on Kowari.

Since this is all for inferencing, I thought I'd get the inferencing code working with the current checkout. This is because I need it working when I get off the N3 task, and because I thought I could try saving the inferred data in N3. Only this ended up being easier said than done. The latest Kowari updates have changed version numbers on the jar files, and this ruined all the class paths. Worse, the contents of the jars appear to be different. When I extracted all the inner jars from itql-1.0.5.jar it had a shorter list than the contents of itql-1.0.4.jar, and I can't get the inferencing code to run anymore. Just like the TKS code, I had no desire to go chasing down jars, so I decided to leave it until I need it running.

N3 Again
I finally ran a Kowari instance, loaded some data into it, saved it to N3, and tried to reload it.

The first thing that happened was an exception which complained of bad tokens in the Prolog of a file. A check confirmed that the file was being loaded with the RDF loader, which was definitely the wrong thing. It turned out that the RDFLoader.canParse method is erroneously returning true on N3 files. I should check out what this method does for its test.

Since my test for parsability of N3 is really only interested in the filename extension for the moment, I figured that it can't hurt to test for N3 before RDF/XML, and if it matches then attempt to use the N3 loader. I know it's a hack, and just writing about it now makes it seem worse. However, it works for the moment, and the correct loader is being run.

Unfortunately the Jena N3 parser wouldn't run initially because of missing Antlr classes. This seemed strange, as I'd explicitly added this jar into the path in the build.xml. A little checking showed that I'd only done it for TKS, so I ported it back over to Kowari. However, while I found many of the appropriate build targets, I had to ask AN before I could find the last one, which has since been moved out into an external XML file.

After a fresh rebuild I discovered that the Antlr jar was still missing. I'm now looking directly in the main jars, at the list of all the sub jars. The build code is definitely including Antlr, but the resulting jar files do not have it included. I'm stumped. I'm going to have to start running Ant with full verbose output, and maybe even run the jar program by hand.

This sort of thing gets really annoying when you have coding you want to get back to. Yes, I'm overtired today, so I'm allowed to be irritable.

I had another meeting with Bob today. As usual, he had some great insights for me, and I have a couple of things I want to get accomplished in the next 2 weeks.

For a start, I need to sit down and come up with a concrete example of what I want to do, and why. "Scalable OWL Inferencing" doesn't cut it. I need to be able to say, "I have this data, and I want to be able to infer this other data, and OWL is what can provide it for me". Or maybe "OWL can't do it for me, but I need it in order to build the inferencing system that can do it for me".

Does anyone have a suggestion of some useful, real world information that OWL can either derive or help me to derive? If not, then what's OWL good for?

I have some ideas of my own, but I'd love some feedback.

The other thing that Bob pointed out is that I really need to look at a couple of texts, rather than just reading papers. The texts provide the foundation of knowledge, while the papers really only seek to expand on it. The first one I'm looking for is on First Order Predicate Calculus, by Chang and Lee. Apparently the field hasn't changed all that much since this book was first written. I've yet to confirm, but I think the book is "Symbolic Logic and Mechanical Theorem Proving".

Indexes and Hypergraphs
While thinking about role of directed Hypergraphs in modelling RDF (after posting about it last week), it occurred to me that I hadn't considered the "direction" enough. After all, it's a directed 3-hypergraph.

The direction forms a loop around the triangle of subject, predicate and object, but I hadn't really considered a starting or ending point. That's when I realised that Kowari's indexes reflect this already, as they are quite symmetric, with no emphasis on any point defining a beginning or end:

  subject predicate object

  predicate object subject
  object subject predicate
I also noticed that if this loop were in the opposite direction then it would form the other set of 3 indexes which can be used for these statements:
  object predicate subject

  predicate subject object
  subject object predicate
Now each statement is a plane intersecting 3 nodes in the graph (in a 3 dimensional space). Each plane has 2 normals, which are perpendicular. The direction of the loop that the statements form would correspond to the curl of those normals, with one normal defining the subject-predicate-object direction, and the other normal defining the object-subject-predicate direction.

I thought this was interesting, and so I speculated on whether it can be extended to higher dimensions. This is relevant, as Kowari moved from a 3-hypergraph to a 4-hypergraph some time ago.

I had speculated that each normal of a 3 dimensional plane was corresponding to a direction around the nodes, and this in turn corresponds to a set of indices that can be used to do a search. I've been wondering how many "normals" there are to a 4-dimensional plane, and if they would each correspond to a set of indices that can be used to search the space. To do that I'd need to know how many normals there are to a 4 dimensional plane (a question I haven't checked the answer to yet) and how many sets of indices there are which can search "quad" statements instead of triples. We had found about a dozen 3 years ago in an ad hoc approach, but I needed to search the whole set of possibilities exhaustively so I knew I had them all correct.

Indexes Calculations
Thanks to AM for helping me on the following.

There are 24 possible indices on the 4 elements of a quad (that's 4!. Note that I'm saying "4 factorial" and not just adding emphasis). It doesn't take long to show that to be able to match any individual node will take a minimum of 4 indices, and a pair of nodes will need a minimum of 6 indices (these 6 subsume the first 4). Each of the indices must be selected from one of 6 groups of 4 indices. To allow searching on 3 nodes restricts the choice from these groups, as certain combinations are mutually exclusive. Searching on 4 nodes need not be considered, as any index will do this for you.

Searching for 2 nodes in quad statements proves that 6 indices are required (no I'm not going to put a proof here - it's after midnight). So we know we need at least 6 indices. It's actually pretty easy to select a set (and we did), but for this exercise I'm interested in finding out how many sets are valid. If we consider all possible combinations of 6 indices from the 24 possible we get (24 * 23 * 22 * 21 * 20 * 19), or 24!/18! = 96909120. However, this includes a lot of index combinations which won't let you search for certain node combinations.

It took a little while, but we eventually arrived at the final figure. The number of sets of indexes which can be used to do any kind of search on 4-hypergraph is 3520. This was a LOT larger than I'd expected, but we went over it several times, and I'm now confident that it's correct.

At this point I decided that each set does NOT correspond to a normal to the hyperplane. I still think that some sets will correlate with 4 dimensional loops through the points, and perhaps there is a normal to describe this loop, but I no longer think that every possible set will correspond to a loop.

Many of the index sets are almost identical, with only two of the indices swapping the order of their least significant nodes. They would seem to be parallel in their most significant dimensions, and hence demonstrate a lot of redundancy.

AM wants to see if we can extend this calculation to n nodes. After what we had to consider to find this number for just 4 nodes (a fixed number of nodes for a start) then that would be a major undertaking. In the meantime I'm thinking I should write up what we did to find this value. I'll try and make a start on that on Saturday.

No comments: