Thursday, May 20, 2004

Anchored Queries
Anchored queries worked this morning as expected. Funnilly enough, I had thought they would work one particular way, but when I ran it I saw that they work a bit differently. Upon reflection I realised that they were working exactly as designed, I just hadn't realised that I designed them that way!

Spoke to TJ about how he wanted them working, and it turned out that I'd built it exactly as required. It seems that my subconscious designed it for me. I must really need sleep. I suppose late night coding and 3 month old babies don't mix too well.

After typing in a heap of examples to test this new query type I realised that parts of the syntax are redundant, so I spent some time cleaning it up. Unfortunately, the changes were structural in the query object, with consequences running quite deep, so it took a while. However it all seemed to be working when I finished tonight. I'll check it more thoroughly in the morning.

Another hassle was caused this morning when I tried to check the code in. Before checking in, I did an update, and ran the tests. Only the transitive queries failed after doing the update. This was due to Tuples objects being changed, such that they may not necessarily have any columns if they have no rows. My code was assuming that the columns would be there (after all, I did ask for them to be there). It used to work!

Once this was fixed I had to update again, discovered new changes, so ran the tests, updated again, discovered new changes so ran the tests... Sometimes the simple task of checking code in can be quite painful and time consuming. One person suggested that I should just check my code in without running the tests. Obviously he's not the first person in the office to think of that! :-)

It's late, so I'll just mention some random thoughts here...

While TJ often suggests it, inferencing can't really be done for each query. The task is just too big to make it fast. That probably means that we need to do inferences after every commit. That's fine for large transactions, but will hurt when autocommit is on. We may need to offer an option to suspend inferencing for a while.

Inferences can be made with existing iTQL queries. Do we implement them in this way? The transitive predicates (subClassOf, subPropertyOf) are definately implemented an order of mangitude faster with the new code. So should we consider this for the other rules as well?

Doing inferenced rules in iTQL is easy, particularly since the XML which is used for RDFS can be translated into iTQL so easily. However, the speed improvements available with doing most, or all, of the internally would seem to outweigh this. Besides, Jena has gone the way of providing Ontology support over the top of the existing interfaces, and while it works, it doesn't scale.

Once we have the inferred data, then where do we storeit? It can't go in memory for scalability reasons. At first glance it might seem to go in with the original data, but what if that data changes? We can re-infer from the data, but how do we know which statements should no longer be inferred if the data they were based on is no longer there?

The first solution would seem to be to reify all inferred statements, and statements used for inferences. While this provides great flexibility, and offers efficiency opportunities when minor changes are made, the overhead in time and space would be huge. I suppose it will depend on the size of the inferred data relative to the original data, but if this system is going to work at all, then the inferred data should be significantly larger than the original data.

Wouldn't it be nice to tell an ontology store a couple of axioms and have it tell you everything you ever needed to know about that ontology? :-) I'd ask it how to program this stuff....

The second solution is to put all inferred statements into a numerically adjacent model. Queries which restrict searches to a particular model could be trivially changed to look in an adjacent model as well and the code would take no longer to execute. This means that we would be able to consider the two models as one for most operations, but the differentiator would still be there when needed. Unfortunately, without reified statements the odds are good that the entire model would need to be dropped and rebuilt after transactions.

It might be possible to obtain a list of all statements inferred in any way from an original statement, and to use this list when that statement is changed or deleted. This would reduce work on an ontology. However, once the size of a transaction gets beyond a particular point there will be a lot of duplication of effort, and using the completed set of original data to create the whole set of inferred statements again from scratch will be quicker. Maybe we should do both, and find an approximate switchover point where we go from one technique to the other.

1 comment:

Quoll said...
This comment has been removed by a blog administrator.