Wednesday, June 02, 2004

Resolver
Today was spent wrapping a Session object with the Resolver SPI. After a bit of intial confusion, it's coming along nicely. It turned out that the SPI code I've read over the last few days has been abandoned and new code written. Fortunately the differences are not great, and I was able to clear the confusion relatively quickly.

I expect that I'll have the wrapper finished some time tomorrow, which will leave me about a day and a half to get all of the testing done.

Walking
AN's document on ontology support includes sections on transitive closure, and also on a separate function he calls walking. I didn't get it originally, thinking that it was a duplicate of the trans function. This was partly because some people have referred to it as relating to transitive predicates (which it does not).

Walking involves following a non-transitive predicate through the graph in the same way that trans follows a transitive one from an anchored node. The result is a set of statements from the original graph, rather than a set of new inferred statements.

The point of a walk is to see where a relationship can take us. For instance, with a non-transitive predicate of "hasFriend" one could imagine a graph of:
  [<Alice> <hasFriend> <Bob>]
  [<Bob> <hasFriend> <David>]
  [<David> <hasFriend> <Eve>]
  [<Eve> <hasFriend> <Fred>]

Just because Alice has a friend of Bob, and Bob has a friend of David, this makes no statement of any relationship between Alice and David. Perhaps they have never encountered one another.

Following statements with a particular predicate in this way can allow such analysis as the infamous six degrees of separation (or the original Six Degrees of Kevin Bacon). One example that came to mind was when I heard about scientists trying to track the infection of SARS back to the original Chinese carrier. Each transmission was due to contact between individuals, and this path of contact had to be "walked" back to the Chinese point of origin. Of course, contact in this case was a non-transitive property. ("Contact" was also a difficult property to measure, given that the contact between the original Chinese sufferer and the people he gave it to was via the lift button in their hotel).

So being able to follow a chain of predicates in this way is an important tool. Fortunately, much of the work is done here, as the operations are very similar to following predicates with the trans statement. With my luck someone else will end up implementing it, and will decide to do it completely differently. But the principle will be the same.

JRDFmem
I rearranged some of the packages for this last night, and I'm starting to get happy with it. I also moved the reification operation from the NodeFactory and into the Graph. AN has now given me access on SourceForge, so I'll be putting it up soon, as a simple implementation of JRDF. Who knows, maybe someone will find a small triplestore like that to be handy. It's certainly indexed to be fast, and it is serializable.

It occurred to me that I'm creating Long objects quite a lot in JRDFmem. This is because Java 1.4 and below does not support Collections of base types like long. I've just been indiscriminately storing Longs as node IDs, and didn't think twice of it until today. However, it has occurred to me that each Long is a whole new object in the heap.

Unlike Strings, I don't believe that Longs get reused, even though they are immutable. Since Longs of the same value get used a lot in this graph, it could mean that anyone using JRDFmem for a largish graph could run out of memory sooner than they needed to. In fact, they will probably end up running out well before node number 4 billion is reached, so the use of Long as a node ID instead of Integer is probably unduely optimistic.

One idea I have is to cache these objects in an array of Longs. Nodes are rarely discarded, and they are allocated in incrementing order. I could simply store the references to them in the array, and when they are needed I can find the Long I want by indexing into the array. This might look like:
  Long l = longArray[id];
The array would need to be grown when needed, involving copying into the large array, and discarding the old. This would be costly, both in terms of time for the copy, and space for the array itself, particularly the unused portion. However, if many Long objects are being made for each long number then this cost might be saved by removing the redundancy.

Of course, using this scheme would mean that the Longs would need to be changed back to Integers, as arrays are indexed by int only. It's a shame that templates aren't available here, or at least macros!

Transactions
The other thing I wanted to get done is transactions. I'd like to create a new Transaction interface that extends Graph and has commit and rollback methods. Graphs can then be asked for a new transaction, and the resulting object is a graph that can be used just like the original. Changes get put into the main graph on commit.

I'll need to make sure that this works cleanly with the javax.transaction interfaces, but I think I have it now. I can just wrap the Graphs-as-Transaction objects inside an object that meets the javax transaction interface. This wrapping object can keep hold of an original graph and a map of XIDs to transaction graphs. When a transaction is committed it updates the main graph, and notifies all outstanding transactions of the change. Any transactions which have modifications incompatible with the new main graph can abort at this point.

Transaction graphs will be relatively simple. They can hold a reference to the original graph, plus their own data. As statements are added they will be put into their own index, and added to a record. As statements are removed, they will be removed from the local index, if they exist there, or a whiteout entry can be put into the record.

Queries on these transaction graphs are a combination of a query on the main graph (with checks for whiteouts), and on the local index. Then when a transaction graph gets committed it can play back its record against the main graph. At this point every other transaction should check its record against the main graph, to check if it should abort early.

I'm considering something like this for the disk based triplestore. It's based on an idea by DM. In particular, he suggests that each transaction be an entire copy of the graph, as copies of files are relatively quick (particularly with operating systems which support write-behind). This would change the description above only a little. Instead of holding a reference to the original graph, and queries being applied to the original plus data local to the transaction, each transaction would just have to query its local data, and whiteouts on the original data would not be needed.

In this system, when a transaction gets committed it can just become the main graph. All other transactions would then take a new copy of the new main graph, and do a replay of their modifications on this new data. Afterward, the next transaction to be committed gets to become the new graph.

This scheme is simpler than what I'm considering for JRDFmem, but at the expense of disk space. However, disk space is cheap and expandable, while RAM is not, and that is the primary consideration for JRDFmem.

Maybe I shouldn't get carried away. It's only meant to be a simple triplestore!

No comments: