Monday, October 11, 2004

Estimates
Today started by trying to work out time estimates for the new "Resolver" work.

I still need to speak to TJ to see if we want trans to be able to operate over models from different sources. If this is to be permitted then it would require tuples to be translated to and from global space, forcing a performance hit. On the other hand, perhaps it would be possible to keep the current code for trans, to be executed only when models in the same data store are being queried.

The other area I looked at is how the special Node-Type model will work in the new system. The current model reflects the contents of a single string pool on a single server, a concept which does not translate well into the global system.

After much discussion with AN and SR, I've come to the conclusion that the query code which accesses the resolver interface will need to create the virtual model. Every node which comes in via the resolver interfaces is either available to the local string pools (including a temporary string pool which belongs to a query), or is returned in global format, and is then stored to the string pools during localisation. These string pools then contain all of the information needed for the types of all the nodes to be returned from the query. They will be missing much of the data pertaining to nodes which were not returned from the current query, but that is a benefit for the efficiency of the system.

The only shortcoming of this approach is that it will not be possible to query for all of the strings in a system, but this was never a requirement anyway. Until now, this feature has just been a useful debugging tool.

Hybrid Tuples
I spent some hours with TJ trying to find the problems associated with the Node-Types test. We have been able to show that there appears to be no problems when a moderate amount of data has been inserted into the database. The problems only manifest when a large amount of data is present.

Selecting all literals from the system has shown that all of the expected data is present, which means that something is going wrong in the join code. More specifically, the requirement of a large amount of data means that the problem stems from a join against a file-backed HybridTuples object. This is because HybridTuples is specifically designed to change from a memory-only mode to a file-based mode once a certain quantity of data has been detected.

I took AM through the code that I am using to sort and join the literals with, and he said that it all looked OK at first glance. It was at this point that I discovered that my code had been changed. It used to select the doubles, dates and strings from the string pool, sort and then append them all. Now it gets two groups called "UNTYPED_LITERAL" and "TYPED_LITERAL". Typing this just now I realised that DM changed this as a consequence of the new string pool. This is because the string pool is now capable of storing many more types than were previously available.

With sorting and appending the literals checked, it would seem that it is the join that is at fault. AM does not know where the problem might be, but he concedes that the code is not well tested, as it is difficult to create the correct conditions for it.

In the meantime, I have been trying to help AM by creating a smaller dataset to demonstrate this problem. We tried changing the size that HybridTuples switches from memory to disk, but the minimum allowable size seems to be an 8kB block. This is larger than the data that many of our test string pools return.

I thought to load a WordNet backup file, and then output it to RDF-XML where I could trim it to a large, but manageable size. Unfortunately, the write operation died with a NullPointerException. I'm not really surprised, given the size of the data, but I should report it as a bug in the Kowari system.

A couple of attempts with the N3 writer also received the same error, but I decided that it might be due to a problem with the state of the system. After a restart I was able to dump the whole lot to N3. I'll try and truncate this in the morning.

Excludes
My first item of the morning was a discussion with AN on the excludes operation. It turns out that he has recently implemented a new semantic for it. When used in a query with an empty select clause it changes its meaning. It now returns true if the statement does not exist in the database, and false if it does. While I don't like the syntax, the effect is useful.

If this is applied in a subquery, it provides the function of filtering out all statements which do not fit a requirement. This is essentially the difference operator I have needed. It is not a full difference operation, as it will not work on complex constraint operations in the subquery, but to date I have not needed that complete functionality. Besides, there may be some way to have a complex operation like that expressed in the outer query with the results passed in to the subquery for removal.

Pleased as I am that the operation is now available to me, I still have a couple of concerns with it. To start with, I dislike changing the semantics of a keyword like this. Amusingly, AN suggested changing the name back to "not", as the new usage has a meaning with a closer semantic to that word. I've realised lately that the word "not" has numerous semantics when applied to databases (every developer has a slightly different idea of what dataset should be returned), so using a word of "elastic meaning" in a place which can change semantic based on context seems appropriate.

I'm also uncomfortable that this operation is done as a subquery. While it works, it has two drawbacks. The first is the complexity of the resulting query. It is a simpler construct than many I've seen, but it is still ugly and likely to land users in trouble. The second is the fact that the query layer must be traversed with each iterative execution of the subquery. This is made especially bad as the data gets globalised and re-localised in the process. The result is an extremely inefficient piece of code. If it were implemented as a difference operator instead, then all of the operations could be performed on localised tuples. This is about as fast as it gets, so it offers real scope for improvement if we need it.

In the meantime, it works, so I'll work with what we have. If it proves to be too slow, or the syntax proves too troublesome, then we know we can fix it easily.

No comments: