Working notes: 06/01/2004

Wednesday, June 30, 2004

Mozilla
I finally fixed the problem with mozilla-xremote-client. It ended up being associated with the Gnome preferred browser. Not sure how this setting was ever lost, but it's fixed now, and I can launch new Mozilla windows with impunity.

Now that Mozilla 1.7 provides access to all the editing widgets for Blogspot (including spelling) I'm finding that I'm pretty happy with it again. Though I'm still a little impatient for that Pango version! Any Linux users who don't know what I'm referring to, should give the 1.6 version a try.

Remote Resolver
Not a lot to report for today. The JUnit output logs indicated that a QueryException was being thrown from the Database class. I included a log statement just before this exception was thrown, so I could see what was happening. However, I didn't see anything come up in the logs. The line number of the thrown exception had changed accordingly, but the log was no where to be found, even though other logging statements from the same class were showing up.

It was only after abandoning this tack that I realised that the class was being run on both ends of an RMI connection. The log file I had was being generated at the server side of the connection, but the JUnit code that was failing was not generating a log at all. I finally got to see what I was looking for when I put my debugging log into a string and tacked it onto the end of the thrown exception. I should have done this much earlier, but sometimes a seemingly simple problem can entice you into spending more time on it than it is worth.

The problem ended up being due to the need to load a file into a model in order to have queryable data in it. This was done with the setModel SPI call which accepts URIs for the file to load and the model to insert into. My mistake was that the "file to be loaded" was not really that at all. Several weeks ago I'd asked "how do I load a file into one of these things?" and was told to use setModel, passing the URL of the file. I made the mistaken inference that this URL could only be for a file, where in fact it is for any model type at all. It wasn't all that long ago that this would have been right, but the resolvers were specifically written to make this kind of change, so I shouldn't have been surprised.

The Remote Resolver I'm testing had only registered that it can handle SOAP and RMI protocols (as these are the only remote protocols we support for the moment). There seemed to be no need to register any other protocol. But in order to load a file it also need to register that it can handle the "file" protocol. I suppose this makes sense in the more general case, as it will allow a user to use a resolver to query a file on a remote machine that has not yet been loaded into a model.

Of course, this was only the first step. I'm now trying to work out just why none of the tests are able to commit their queries. The problem is being caused by an exception that says: "Unable to get an appropriate session". I was able to see the code where this happens, but I've yet to determine why the correct type of session is not already available.

I may save myself some time if I ask AM for some more help. He wrote a lot of this framework, particularly the test framework, and his help today in working out my protocol issue was invaluable.

Tuesday, June 29, 2004

Mozilla
I've been so caught up with things tonight that I nearly forgot to post. Then I discovered that a recent crash of my Linux system (no idea why... uptime was about 3 months at that point) had corrupted Mozilla somehow. So more delays while I upgraded to 1.7. I'm missing the Pango port, so I might go searching for that soon and see if it's available in 1.7 yet.

Unfortunately the new installation of Mozilla didn't fix my problem of mozilla-xremote-client failing to do anything. It does work when I ssh to this box with X forwarding on and execute it, but it does nothing when run locally. There is almost no documentation for mozilla-xremote-client anywhere, and I've yet to see anyone suggest what can cause it to not do anything when run. It's really annoying. Do I resort to strace? Do I download the code and debug it? I'd rather spend my time on other problems.

Resolver Paths
I spent a lot of today struggling with the modifications that occurred in the last week. Once I had everything sorted for a clean build I was back to my 11 day old JUnit errors. While not immediately apparent, the problem turned out to be due to classpath problems again, this time with reflection code, so the Exception information was often misleading.

I then spent some time in a cycle of running the test, finding the missing class in the exception traces, using jar/grep to find the library with the missing class, inserting it into the path in the ant script, and re-running the test. I ended up with a LOT of jar files in my path before things finally started to run.

Once running, I found that I had promoted all the errors up to failures, except for one which passed. The failing tests are describing a misconfiguration of a subsystem I'm not familiar with, so I may be on a steep learning curve in the morning. It would feel like progress if it weren't for the fact that there is still so much to be done.

Indexes
I'd love to spend some time talking about this, but I have a thesis proposal to write, and I promised myself I'd finish it tonight. So I'll be short.

For a little while now I've been considering B-trees as an index for a triplestore. After all, every major database uses them, and they certainly scale well for searching. It was almost accidental that Kowari started using AVL trees, and I've been considering what the tradeoffs are between the two approaches. This led to a significant discussion on it this afternoon, so I figure it's about time I log it.

B-trees of any reasonable order have significantly less depth than AVL trees. This is because AVL trees only have left and right children while B-trees have many. While trees are mapped into memory the depth is of little real concern, but now that we are looking at indexes of several GB this starts to become more of a concern. Unfortunately, B-trees start to lose out a little here, as they do not fill up their nodes completely, with each node having "some number" of entries between N/2 and N, where N is the order of the tree. Having slack space like this makes the index about 25% larger than it needs to be. This pushes B-tree indexes out of a range that can be memory-mapped or cached, much sooner.

Inserting into a B-tree can be a very cheap operation if the target node is only partially full, or it may be more complex than an AVL tree insertion if it is completely full. At this point Kowari differs from the standard AVL tree. Instead of storing a node for every triple, Kowari stores indexes into a block file. This file contains blocks of up to 512 sorted triples (8kB). If a triple is being inserted, the correct AVL node is located and used to find the appropriate block. The block is then read, the triple inserted, and then the block is written back to disk. If the block is already full then it is split into two blocks, and a new AVL node is created to point to it. AVL nodes also hold the maximum and minimum triples of each block, so comparisons can be done on the node without having to follow an indirection.

Having each node refer to a list of sorted data, and having full lists broken up when inserted into, is almost exactly the process used by B-trees. The more I look at it, the more I realise that we have merged certain concepts from both B-trees and AVL trees, often gaining in the process. For instance, the size of the index is orders of magnitude smaller than a standard AVL tree due to the indirection into a block of triples for each node, and it uses its space more efficiently than a standard B-tree, keeping the index smaller still.

Conversely, there are associated problems with the hybrid approach. The restriction to just 2 children makes the tree significantly deeper, which makes lookups slower once the index gets too large to keep in memory.

To be really scalable we do need a new index. DM will be working on using skip lists to allow for fast loads of large datasets, but this approach leads to read-only files. Future inserts create new files, and a background task will be merging files when possible. This has the major advantage of using sequential reading of a hard drive to a much greater potential, but adds complexity, and does not easily allow for simple modifications of the graph.

For this reason I'm thinking I'd like to try B-trees, only with the current block files, and the triple range comparators we currently use. This would end up storing lists of triples in the blocks, and lists of blocks in each node. Certainly not a groundbreaking design, but it would be interesting to see the performance impact it provides. Making the order of the B-tree nodes a tunable parameter might provide some interesting results as well.

Monday, June 28, 2004

Back at Work
A lot of today was spent catching up on the last week. There was quite a bit of an email backlog to get through, though I'm holding off on the RDF mailing list until I feel ready to tackle it again... probably straight after I post this. If I'm lucky then half of the messages will just be calls for papers to conferences I could never attend.

The first thing I asked TJ this morning was whether or not the RMI reconnection code had worked. Unfortunately it hadn't. That's what you get for trying to implement something in a rush. It turned out that I had kept a remote reference to the Session Factory and was asking it for a new Session when the old one failed. Obviously if I can no longer talk to a remote Session due to a server restart, then I can't talk to the remote Session Factory either. Fortunately it was an easy fix for TJ last Monday morning, and the rest of the code worked fine.

That left me fixing the remote resolver JUnit error that I was left with just before my break. However things change in a week. Several changes were made which had a significant impact on the new resolver code, and yet not all of these changes were put into the resolver classes. Since the resolver code only gets built when explicitly asked for, the missing code got missed in the build process, and the classes were all checked in. These classes were missing methods, contained typos, and had other errors that you might expect from files that had never been compiled. While the changes really belonged to others, I worked on making the required fixes on my own checkout so I could proceed. Until I can make it compile I can't even get back to doing the tests I was unsuccessfully running over a week ago. I had all but 2 files finished by the time I left, but I may not need to do those last couple of files as it looked like AN will have most of it done independently by tomorrow.

Once I get the remote resolver tested and debugged I've been told that I should be able to get back to the inferencing code. I'm looking forward to that.

Notebook
Today I heard the frustrating news that Apple Australia have no 12" PowerBooks left in stock. They also have no idea when they expect the next shipment.

One reason that this situation has occurred in the past was when Apple were about to release a new PowerBook range. With WWDC2004 being this week then that is definitely a possibility. However, Apple claim to avoid doing that anymore due to decreased sales in the lead up to each WWDC, so a new range of Apples might just be wishful thinking.

Triplestore
I received an email asking where I am with the triplestore I'm trying to build in my own time. Now that I'm back from my break I'd better get some more code written! Still on the file based hashtable, so I'll aim to have it done by end of the week.

Speaking of out-of-hours occupations, this blog held a surprise for me. Over the weekend I looked at the hit tracker for this blog, expecting to see very few hits as I hadn't been posting while I was away. To the contrary, I had more hits than I've had in weeks, and it was still growing. It turned out that it was almost entirely due to Danny Ayers. That was quite a commendation Danny (and very surprising). Thanks!

I was teased today for my newfound whuffie rating (which I'd never heard of before today). :-)

Friday, June 25, 2004

Holidays
For those who didn't know, I've been on holidays this week. No work meant no need to blog. I'd meant to mention this in my entry for Friday evening when I finished last week, but I was caught up with trying to get away and didn't write it.

Anyway, I thought I'd better write what I remember of that Friday before I get back to work on Monday.

Reflections on SessionFactoryFinder
My morning was spent tracking down this week-old problem in the resolver code. It became apparent very quickly that the problem was in the path being provided by Ant, but at first glance the XML configuring this looked fine. What I didn't realise was that the jar files mentioned for the path were all in the obj directory. I hadn't really looked at this directory before, but each of the jars in there all hold only a small portion of the system. They get glued together to create the distribution jars.

After much experimentation and grepping through jar listings, I was able to get the junit tests running. They reported an error, but that's a lot further than they had been getting. :-)

The plan is to track down this error on Monday.

RMI Retries
At the moment, if a server is restarted then any iTQL command line programs will also need to be restarted. This is because iTQLInterpreter objects keep hold of an RMI session object, and use this object for all queries. If the server is restarted then this remote object is destroyed, and all subsequent attempts to use it will result in an RMI failure.

While correct in one sense, this behaviour is really annoying. Since sessions are stateless, it is possible to obtain a new session when an old one dies, and to use that session instead. In fact, all methods on a session are effectively static, but static methods can't be described in an interface. More importantly, RMI does not support the concept of a class method.

The solution here is to catch all exceptions when calling a session object, and find those which were caused by an object no longer existing. At this point a new session can be instantiated, and the original call retried. It may be better to do this with a proxy class, but for the moment I did it with brute force.

To prevent infinite recursion on failed connections I introduced a retry count. This is set to the default value on creation of the remote session class, and is decremented with each reconnection attempt. When it hits zero a reconnection is not attempted and the original exception is thrown. After a method is successfully called, or an exception is sent back to the caller, then this value is reset to the original retry value.

Most of this functionality happens in one private method, but the reset of the retry count has to be done in each individual method invoking the remote session. This was boilerplate for most of the methods, which is why I started thinking that a proxy class may have been better. However, the need to get it running before I left, and the intricacies of several methods which didn't behave in the same way as the rest, led me to continue changing the remote session class without introducing any new ones.

The changes appeared to work initially, but after a couple of restarts of the server I had a strange exception come up. It was due to some of the changes caused by working with the resolver classes, but that wouldn't be a problem for others as my code is not checked in yet. Unfortunately this was at the last minute on Friday, leaving me with no time to try the changes on a clean checkout. I left it all for TJ to integrate, and according to the early tests it should have worked fine for him. I suppose I'll find out in a couple of days.

Notebooks
The reason I was in a hurry to get out on Friday was because someone had arranged to come over to check out the Dell notebook I was selling. Turned out that he liked it, and I sold it on the spot. So in some of my free time this week I ordered a new PowerBook from a friend at Apple. It doesn't have many of the features I'd like (backlit keyboard being first among them), but it's mostly there. In the end I opted for the 12" PowerBook as it is powerful enough for what I want to do, and is portable enough to be useful. Of course I love large screens, but that is counter to the desire for portability. Besides, large screens are what desktop machines are for.

Now I have to see if I can get MythTV going under OSX.

Masters
While I was away during the week I've had several items show up for my application to UQ. CQU have sent a corrected set of results, and our company has provided a letter of support for my study (unlike some universities, UQ does not require this, but I'm told it might help). I'd better get that proposal finished soon!

Reading
I did no where near as much reading as I thought I would this week, and most of it was on subject matter that I hadn't planned on. Holidays can be like that. I'm part way into a book called "Teach Yourself Mathematical Groups" by Tony Barnard and Hugh Neill. Kind of an "Idiots' Guide". I'm about a third of the way through, and I haven't encountered anything new yet, but I think I'll be covering new ground in the next chapter. It's been a while since I've looked at the syntax the authors use, so it has been helpful covering the old ground, plus it helps stroke my ego before it gets shredded with the tough content later.

I also started "C++ GUI Programming With QT3", by Jasmin Blanchette and Mark Summerfield. I've been meaning to start this for a while now. I already know a little GTK/Gnome, but I wanted to learn more QT to program my Zaurus (which I'm sure is feeling neglected by now). I originally opted away from QT because it only really supports C++, but who am I kidding? I was going to program GTK in C++ anyway!

I'm only 4 chapters into the book so far, but I'm already starting to form opinions. Those parts which describe pure code implementations are really well written, while those describing QT Designer leave me unsatisfied. The pure code descriptions provide useful examples, with each element described in detail. Alternatives and unused options are also described, answering all of the questions I've had so far. Conversely, the QT Designer sections tend to use examples with explanations that only cover the basics, with no or deferred explanations of other options which are clearly available. I suspect this is due to the dual authorship. Overall though I'm still learning from it, and it appears to be a useful book.

Thursday, June 17, 2004

Mozilla and Blogger
OK, this is weird. I've been taking it for granted that editing a blog with Mozilla meant that I had no way to "Preview" the page, nor could I use any of the buttons used for formatting. But I'm using my work machine, with the same build of Mozilla, and I get all those options. (Spelling does not work though. Does anyone know if this can be made to work on Mozilla?)

The only difference I can think of is the fact that my home machine can't run mozilla-xremote-client. I have no idea why. It used to work, and then I tried to install Java support, and then it didn't work anymore, even when I removed the Java link. This problem might be related to having trouble creating windows, and that might then explain the missing buttons on Blogger pages.

Interestingly, if I connect into my machine via ssh with X forwarding enabled, then I can run mozilla-xremote-client and it will work. The difference may be due to windows and widgets that are opened by mozilla-xremote-client all coming from remote machine. But I have no idea what is really happening.

Maybe I should take DM's advice and reinstall Linux. I've been running the same drive image since late 1999, so it's getting on a bit. As I've upgraded the hardware in the computer I've kept the drive, and just reconfigured the device drivers. Whenever I've upgraded the hard drive I just do a "tar -a" (with appropriate options for links, etc) from one drive to the next, and then run LILO to boot to the new drive. Every time I do this I find myself impressed once more with Linux. :-)

5 years is probably long enough though. I'll think about it...

Unclosed Tuples
I finally got them all! Then I ran the full gamut of tests and everything was successful. I was feeling pretty good. So I did a CVS update, and discovered that there have been quite a few changes lately, so I re-ran the tests. Sure enough, there were new Unclosed Tuples to fix. Everyone agrees that this needs to be fixed properly, but I'm still stuck on fixing them one at a time, so I went back in to work on it.

The problem was with a close() in the Query class. DM commented it out to get certain types of queries to work. The Answer object being closed was sometimes closed in other places, and sometimes not. Since closing killed the running query, and not closing only caused an error in the log, the option of commenting out the close statement made sense.

The Answer passed as the GIVEN clause for a Query is always owned by the resulting Query object. This answer is then closed by closing the query. At this point I discovered that most parts of the code do not close Query objects. That is a mistake, so I added the query closure statements, and promptly got a NullPointerException.

Looking at the Query.close() method revealed that it iterates over its variable list. Only it is possible to create a Query object with a null variable list, which would guarantee the close method would fail. In other words, whoever wrote the close method only intended to call it under certain circumstances (ie. when the variable list is not null). I don't know exactly what those circumstances are, as there may be other preconditions, and there is no documentation to describe it. So I made the close method safe by doing a test for null on the variable list, and then I made sure that Query objects always get closed. Naturally, this wasn't immediately successful, but I've found all the offending code.

I now have it working again, and fully checked in. I'll give it until next week before someone else introduces a new "Unclosed Tuples" error message.

This really needs to be fixed correctly. Maybe if someone else fixes it next time they will complain enough that TJ will decide to tell someone to do it. :-)

Wednesday, June 16, 2004

Unclosed Tuples
Finding that final unclosed tuples took longer than expected. I should re-read The Mythical Man-Month.

I needed to insert a LOT of logging before I found the missing close statement. I finally found it in the loop that performs subqueries. Subqueries use a tuples as a GIVEN clause for the query, and then take the tuples resulting from that query as the given clause to the next subquery. This continues until there are no more subqueries or the tuples has no elements in it. The final value of the tuples is then the final result to be returned.

The structure of the format for this code misled me into thinking that the tuples would be closed because it was returned. However, I missed the fact that a tuples object was abandoned in every iteration of the loop. This happened when the reference to the tuples was set to the result of the next subquery. Once this was fixed I was able to perform all of my queries with no warnings. There is one more query I should probably do in the morning, but I believe that this won't be testing any new code paths.

Statement Walks
AN had some trouble with the walk statement today. His code is based on my trans code, so he asked me to help for a bit.

I started out by describing some of what I had intended when I wrote the code, as the expression of these ideas may not have been very clear when read in Java. This may not have been as useful for AN as it was for me. It's often difficult to remember exactly what you were doing several weeks ago. I think I got it right on about the fifth attempt. :-)

Most of AN's code was fine, and I think that most of his problem was just due to staring at it for too long (yes AN, I did believe your excuse... this time). He had just started the walk from the wrong place. He also had a pair of column names in the wrong order for some query types. These were both easy fixes, and it wasn't long before his code was working fine and I was abandoned to my itinerant Tuples again.

Witty Worm
Today's Crypto-Gram has an interesting story about the Witty Worm. I've been hearing lots of things said about worms over the years, including the fact that they are always poorly written, that they do not usually spread efficiently, and the ones which are any good are normally not malicious. Witty adheres to none of this. It was just fortunate that it only worked on a BlackICE/RealSecure vulnerability, and not on the myriad vulnerabilities found in most Windows servers.

If this is a taste of things to come, then it's scary. Maybe it will force Microsoft to make sure Windows boxes are more secure, but I'm skeptical. I'm glad I run much more secure systems, but we all suffer from internet slowdown when the majority of Windows computers get infected with something virulent.

Tuesday, June 15, 2004

Friday
Big family committments over the weekend, so I didn't get the chance to blog on Friday night, nor any time over the weekend. We had the "Queen's Birthday" long weekend here, hence no post for Monday either.

TKS Release
Friday was the latest Beta release. AM still needed assistance, so I spent the whole day on that.

After squashing some of Thursday night's bugs I spent much of Friday testing, and generally being available to help with further bugs. AM kept finding problems with various queries, and I ended up staying late fixing them.

DM was also helping out, and I was embarressed to learn that one of his bugs was caused by me. I had a typo in my Thursday night fix for serialization of the Answer object used as the GIVEN clause for GlobalizedAnswer. The code I wrote was supposed to store the old Answer, copy it into a serializable answer and then close the original. Only I closed the new one. This made it serializable, but if there was anything in it to be used, then accessing it resulted in a NullPointerException. My own tests needed to serialize this object, but I had no GIVEN clause, so it was never accessed. DM's tests had a GIVEN, and so he saw the exception. Oops.

At the end of the day the biggest problem was with unclosed Tuples objects showing up everywhere. Closing things isn't all that common in Java coding, so I should explain what this means. To be fully scalable, Tuples may be backed by disk: either the actual index, or a temporary file. We have a couple of mechanisms to allow us to transparently use either Mapped IO or standard IO, and one of these is the BlockFile class. This lets us ask for "blocks" in a file, and then just access these blocks as a java.nio.ByteBuffer or related class (such as an IntBuffer).

Early on, we found that allocating new objects each time we wanted to access a block in a file was becoming a significant overhead. We addressed this with object pooling. As a result, whenever a block is finished with it must be "released" back into the pool. If blocks are held onto then we end up with excessive memory usage, and have even seen OutOfMemoryExceptions on occasion. So closing Tuples and anything that wraps them (such as Answers) is essential. To help in this we have a finalize method on TuplesImpl which logs an error if the object was not already closed. We used to have a call to close in this method, but it was removed once it was no longer needed.

Unfortunately, SR doesn't always think of the expense of using objects, and he uses clone extensively. He also tends to create objects inline for temporary use. This is usually fine for Java, but it is a really bad idea when applied to objects which need closing like this.

Back in February I gained some practice finding missing close statements, so I was able to find a number of these problems very quickly. By late Friday night I had found and fixed all but 3. I eventually realised that I wouldn't get these last ones in time, so I changed the error level log back down to a debug level and checked it in for TJ (who had already gone home to his family by this time).

Due to time zone differences with the States we sometimes get another crack at a release. When I got in to work this morning TJ said that he could put it out again, and asked that I confirm that all was working well. On further inspection I realised that close had been removed from the TuplesImpl, so I added this back in before handing it on to TJ. The rest of the day was then spent looking for the 3 remaining unclosed Tuples. TJ hopes to do another sub-release soon, in order to catch anything small that we missed.

The tuples in question ended up being wrapped by a Given object which were being used as a constraint expression. Constraints do not have a close method, nor do we want to add one, so I had to hold the reference to the original Tuples and close it when the constraint was finished with - but only when the constraint in question was of concrete type Given. Unfortunately, the constraint was being passed off to an ExecutionPlan which then cloned the GIVEN clause, reordered the clause (which did more cloning) and then did some internal copying of objects. I had to make sure that all of these copies were closed, but again, only when they were of type Given. This all got very messy, very quickly.

I was tempted to give up on a few occasions, and rework the problem properly, but that was going to be a big engineering effort, and there wasn't time for it if I wanted to fix things for TJ's next minor release. I finally fixed the two which were involved with the ExecutionPlan, but I still have one left. I've been adding more logic, and I think I will have it within the first hour tomorrow.

Fixing Tuples
On Friday, AM, DM and myself all had a discussion about these unclosed Tuples. They are a major problem, and all it takes it one small change to cause another leakage. The sensible solution is to automatically register all Tuples in some central place (associated with the current session), and then close them all off at the end of a query or transaction. This will take some time to implement correctly, but it is going to make the system much more reliable, and save us a LOT of time in future.

At the moment, everyone is starting to adhere to a standard whereby any use of a Tuples in a new object will clone the Tuples. This still has problems, but at least it puts any new Tuples in a defined place. The problem is that there are still a lot of classes in the system which don't do this, so the system can be very confusing when you try to find where an unclosed object was allocated.

The more time I spent tracking these bugs down today, the more I realised how essential this change is. Hopefully we'll be allocated the time for it.

Security for Remote Queries
TJ asked that I look into the current remote query code to see if it presents credentials to remote servers. In the last incarnation this definitely happened, but SR may not have had time in his last week. I didn't get as much time as I'd have liked for this, but I couldn't see it anywhere, so this is what I told TJ. It's safest to assume that it's not secure anyway.

We'll have to get the whole "resolver" interface put in soon, and we'll definately make sure that credentials and authentication are covered adequately.

Meanwhile, I've just spoken to SR on AIM, and he says that credentials are being handed over. He points out a couple of problems when witching between multiple servers, but the foundation is definitely there. I'll have to take more time to look at it in the morning.

Power5 CPUs
While waiting for builds and tests on Friday I spent a little time reading two articles in the latest IEEE Micro magazine (sorry, summaries only. Subscription or payment is required for the full articles). They were the articles on the design of the Itanium 2 6M from Intel, and the Power5 from IBM.

Both are impressive chips, and would certainly seem to represent the state of the art at the moment. However, the Power5 seems to me to be a much more elegant, flexible, and scalable design. IBM really do seem to be ahead of Intel on this one.

The Itanium 2 6M is a very hot CPU, though to Intel's credit it is no worse than the previous Itanium 2, coming in at a peak of 130W. This CPU also devotes about 2/3 of its area to 6M of L3 cache, as opposed to IBM who have no onboard L3 cache, but more processing logic.

The Power5 has put its L3 cache off chip, but has kept the index onboard, meaning that lookups are not really much slower, and cache misses cost no extra. There has been a redesign of the memory architecture, and the memory controller has now moved on chip (like the Itanium 2 6M has). This reduces load on the fabric controller, allowing them to scale up to 64 processors, from the 32 that the Power4+ can do.

The slightly slower off-chip L3 is mitigated both by the onboard index, and the larger size: The Itanium has 6MB onboard, while the Power5 has 36MB. The Power5 also has much more L2 cache at 1.875-MB, vs the Itanium at 256kB. Similarly IBM have L1 cache of 32kB and 64kB for data and instructions, while Intel have 16kB for both. Finally, registers are similar for both lines, with Intel having 128 fixed and 128 floating point registers, and IBM having 120 of each.

It seems that Intel have put a lot into having L3 cache onboard, but at the expense of having less memory at all 3 cache levels. Intel also miss out on using that space for processing logic.

The Power5 includes 2 processing cores (like the Power4+ did), with each core executing 2 symmetric threads. This means that each processor can run 4 threads natively. The load balancing features for these threads is really impressive, and makes each core able to use each of its components quite effectively. It's like Hyperthreading, only it's done right. ;-) Each core has been designed to run with a single thread with high efficiency, but there is still a performance gain to be had with a second thread.

One of the most impressive features (for me) was the set of power saving features on the Power5. The chip uses clock gating to reduce power consumption at a fine grained level. This means that if the circuitry can guarantee that a module will not do anything in the next cycle, then the local clock for that module will be gated off, and that module will not get a clock pulse. So if a multiplication unit is being written to, but not read from in the next cycle, then the read ports will not get their clock pulse (leveraging their use of CMOS, which doesn't require power to keep its state). This is all automatic, and has no impact on performance. This seemed to be far ahead of Intel, who chose to use gate types which consume less power in those places where they felt they could afford the differing characteristics.

One thing I'd have liked to see in the IBM article would be power consumption figures, but I'm sure they will be forthcoming. The fact that the Intel chip has more transistors that any other commercial chip in the world (410 million) seems a fair indication that they will be drawing more power.

Overall, I found myself drooling a little over the Power5 chip. Here's hoping that Apple choose to start putting these little monsters into their PowerMac lineup!

Thursday, June 10, 2004

Reflections on Resolvers
Reflection code came easily. All that time implementing JMX for Enhydra I guess. (Was that really 4 years ago?)

Of course actually running the tests showed up lots of things not available. SR has been trying to keep these packages insulated from the rest of the system, but it seems that he's been a little too successful.

Anyway, the tests start successfully, but then the SessionFactoryFinder is failing to find the remote session, and I'm not sure why yet.

Distributed Queries
The next version of TKS is coming out tomorrow. It does a lot more, and it does it all better, which I'm pleased to see, but the resolvers didn't make it this time around. For that reason the lack of distributed resolver tests hasn't become an issue yet.

The old method of doing distributed queries had to be reintroduced, since the plan was to take this task over with the distributed resolver. SR only got onto this task a few days ago, which was poor timing given that he has left for an 8 week trip through the States today. Consequently, what code SR was able to finish has been dumped on AM, to have ready for tomorrow afternoon. Needless to say, AM is now swamped!

As a result, this afternoon several of us in the office were pulled in to give AM a hand with his bugs. Mine was to work out why constraints were not resolving when applied across multiple remote models.

I quickly narrowed the problem down to the constraint operation that joined the two models. Once I got there I started adding logging messages to see exactly what was happening, but I had to laugh when I saw the following:


  if (i == null) {

    logger.warn("ce2riMap.keys() = " + ce2riMap.entrySet());

    try {

      throw new QueryException("Failed to find register matching constraint: " + ce +

          "generateJoin(" + term + ", " + destReg + ", , " + first + ")");

    } catch (QueryException e) {

      logger.warn("In ExecutionPlan.generateJoin", e);

      throw e;

    }

  }

So a message was logged, an exception thrown, and then immediately caught just so it could be logged and then rethrown! Surely this code looked different once upon a time. I can only guess that it got this way by incremental changes accumulating over time, with each coder in too much of a hurry to notice what was happening. This view is supported by the first log message, as it says that it is displaying keys to a map, when it is really displaying entries.

It was the ce2riMap map mention in this code which was the problem. This is a mapping of ConstraintExpressions to an index into the list of registers which holds them. This list is used by the query optimizer to resolve queries. The problem was caused by a type of constraint expression that was being used in this context for the first time. This object did not have defined hashCode or equals methods meaning that it could not be used as a key in a map.

This ConstraintExpression object represents the results from a model, but it doesn't seem to hold its provenance. This means that the only way to compare this object with another for equality would be to iterate over the entire set. Now that would be a bad idea.

Fortunately constraints are only supposed to be resolved once, meaning that if a constraint is going to be compared as equal to another, then it must have been a clone of the original. I was able to take advantage of that behavior and give each constraint an ID in its constructor (which is based on the address of the object, obtained from System.identityHashCode()), and then made sure that ID was duplicated in the clone method. This worked fine.

Fixing this just served to show the next bug. Sometimes this job is like playing with Russian dolls.

The first bug showed up in the iTQL shell with a lot of debugging information describing the problem (in fact, the message shown was a list of the contents of ce2riMap). The new response in the iTQL shell simply said that the query could not be resolved. Log files soon showed that there was an RMI problem.

RMI was throwing an exception when it tried to serialize a GlobalizedAnswer object, for transport from one of the models' servers to the other. RMI exception traces are a pain, as they stop short of telling you where they came from with a line like: "and 21 more...".

I went through every place that an Answer interface gets returned over RMI, and confirmed that on each occasion the object was remotable or serializable. I also added in a heap of logging to tell me exactly what kind of objects were being returned over RMI. The resulting logs consistently showed that the returned objects were all fine, and yet I was still getting the exception trace.

At this point I remembered reading the javadoc for ObjectOutputStream last week (while writing the serialization code for org.jrdf.graph.memory.GraphImpl. Note to self: put this on Sourceforge). It explains that you can prevent serialization by implementing writeObject and throwing a NotSerializableException. This seemed perfect, as it would allow me to log a complete stack trace during serialization.

Of course, the modifications I made seemed to do nothing, until I realised that GlobalizedAnswer now appears in two packages, and having VIM find the class for me with ctags was taking me to the new one which is used for resolvers. Hopefully I'll remember to keep an eye out for this until resolvers are fully integrated and the obsolete packages go away.

Once I had GlobalizedAnswer throwing fully logged serialization exceptions, I quickly realised the obvious. The serialization was happening as the object was being sent to another server, rather than being returned. This is because Query objects hold an optional Answer field as their GIVEN clause. This field is almost never used, except in the context of remote queries, which is exactly why the problem only just started showing up again.

The solution is to make sure that any queries that are to be sent to a remote server, convert any Answers they include into a serializable form. To prevent any unnecessary conversions unless they were required, I added a writeObject method to Query and if the Answer representing the GIVEN clause is not an instance of Serializable then it gets replaced with a serializable Answer built from the old GIVEN. After that the default serialization method for Query gets called.

If the Answer is too large then it should be sent in a remotable form. I didn't bother with that part, partly because GIVEN clauses are typically small, and partly because this will all go away soon when resolver framework becomes available.

It all works now, but it has started logging a few Tuples as not being closed when their finalizers run. The finalizers are not there to do any work, they just clean up when it is needed, and report the need. The logged messages are probably just because SR can be a little sloppy with closing Answer and Tuples objects, and I'm sure it was just made harder for him with the deadline he was under. No wonder he hates programming in C or C++: all those free and delete calls.

CQU Results
Late last night I noticed that my academic transcript from CQU says that I failed a subject... only I didn't. In fact, the transcript says that I failed the very last subject I did in that degree, which is an obvious mistake, as I would never have been allowed to graduate if that had been the case.

So this morning I was on the phone finding out what happened. No one knows for sure, but it seems that CQU recently changed over their whole computer system, and one of my subject entries didn't make the changeover. So now I have to wait until the lecturer find the original paperwork so it can be corrected, and THEN I have to order another transcript. So much for a quick application.

Wednesday, June 09, 2004

Resolver Testing
I believe that I've made all necessary changes, but building has been a separate issue. The tests have a number of dependencies that the resolver packages themselves don't have. Since the packages are to be kept as independent as possible, some of the packages needed for building the tests are not available. Hence the trouble building.

Without introducing some real hacks, the answer is reflection. I'll be doing that tomorrow.

Joining on Transitive Constraints
I forgot to mention yesterday that BW had trouble doing a join against a transitive constraint. This turned out to be because the row count had not been set. After reviewing LocalQuery for where this should have been done I made two discoveries. First, the row count on constraints is being set redundantly in several different places. There's no performance penalty to this, but if anyone tries to change it in future they may find that things don't change as they expect.

The second surprise was that the row count on transitive constraints was being set. I put some debugging log statements in and discovered that the transitive constraint's row count was not set, but the constraint that it wraps was being set. This would be because it is this interior constraint that does the work and actually resolves its size. The solution was to just pass the question of the row count down to the wrapped constraint object when the transitive constraint's row size is not known. Unfortunately the parent Constraint class keeps the rows variable private, so I had to promote this to being protected.

After discussing all of this with AN this morning he made a comment about how transitive constraints should only meet an interface, and not extend the Constraint class. I agree wholeheartedly, but many methods expect a Constraint, without any consideration for the ConstraintOperation interface that it implements.

Masters
No coding tonight. I'm doing my application to UQ.

Tuesday, June 08, 2004

Resolver Testing
Today was a combination of reading papers (this one among others) and modifying AN's tests to work on remote resolvers. The original tests are designed to use RDF files as models, while the new ones need to refer to models on a TKS session. This means creating the models, and then loading the files into them, before anything can proceed. It's not as easy as it first seemed, but it's starting to come along.

Indexing
Had a chat with DM this morning about indexes. His skip lists have been proceeding apace, and I'm impressed with some of the ideas. As we talked about it I started to realise that some of the ideas for the systems that we each want to build are portable. I may end up borrowing from him. I'd suggest that he could borrow from me and put some hashtables back into his design, only the hashtables were his idea in the first place. Very cluey guy.

I started coding the hashtable implementation last night. At this point I just want to get it up and running, so it has a very posix feel to it (mmap, lstat, etc), but I plan on making it portable across to Windows. DM reminded me that he has an old library that implements basic mmap functionality for Windows, and voluntered its use. There's no point in rewriting the wheel, so I'm happy to use it.

In the meantime I was both pleased and frustrated at C. It's been some years now since I've coded in pure C (I've edited code in C, but it's different to writing your own code from scratch). Of late everything has been C++ or Java, and the shift was a bit jarring. No exceptions, no method overloading, and building my own structures to be passed as the first parameter of functions (instead of implicit this parameters). It is easy ebough to code around these things, but it feels like I have to work harder to get the same thing done. Conversely, it was an absolute pleasure to cast a pointer returned from mmap() directly into a stucture, and to modify a file that way. Made it all worthwhile really. :-)

If I steer clear of some of the less efficient features (like vtables) then C++ would give me the best of both worlds, but I'm trying to write simple portable code, and I keep coming back to the OMS project back in 1999. OMS was written entirely in C++, which most people just accepted. Then Allan Cox came onto the list and volunteered to help, only he asked if it might be possible to port the code to C because he didn't know C++. This brought forth a number of people who had the same problem. It was not changed (except perhaps in some lower level modules), but ever since then I've been more inclined to do open sourced code in C rather than C++.

The world has moved on since K&R, so I should check out C99 so I will not be restricting myself from using all of the latest and greatest features (eg. I like single line comments). I'll Google for it tomorrow. Hopefully I'll find a free copy. In the meantime both AM and DM have suggested gcc. If it were just Linux and OSX then I would, but I feel weird about using it for Windows. Maybe I shouldn't be so biased. I'll have to set aside some time to reboot to Windows and test it out.

Monday, June 07, 2004

Resolver
Everything builds, but the tests are not done yet. I'm using a set of tests built by AM, but they are taking a bit of adapting.

This resolver is to be used over the front of a RemoteSession, making it the basis of the new distributed queries. For the moment, distributed queries are a part of TKS rather than Kowari, so none of this code will be open sourced in the forseeable future. At this stage remote queries will provide simple functionality, but over time efficiency rules for remote joining and resolution will be added in.

Because the code is to be part of TKS rather than Kowari I had to get a TKS checkout, and make package modifications to satisfy the different directory structure. It is not a part of the TKS build process yet, but it does provide a place to check the code in.

Trans Modifications
I looked over AN's shoulder a little today, trying to offer a little help for his modifications to trans. He had a good grasp of it, but I figured that it doesn't hurt to know what the original author was trying to accomplish when he structured the code. I tried to be relatively straightforward, but it was quite apparent today that there were some subtleties that weren't so obvious.

Nonetheless, AN seemed fine with the modifications, improving some of the structure in the meantime. I was concerned about the overhead of testing for the existence of every triple to be generated, but we were able to determine that AN could use an existing set of tuples to search with greater efficiency than going back to the original graph.

Hashmap Indexes
I'm hoping to spend a little time on doing some honest-to-goodness coding on the hashmaps tonight. In the meantime I've been giving some consideration to the indexes and space usage.

The problem of 32 bit "next" pointers ends up being relatively easy. The end of each linked list will be indicated by a NULL pointer. Almost all other nodes will point to the next node in list. If the offset of one node from its predecessor is too large, then this will be indicated by setting the offset to a reserved value. When this is seen, the code traversing the linked list will go back to the hashmap and ask for the key again, only now with an integer which I'll call the overflow count. When the hashmap finds the required key, it will pretend that it has instead found a collision, and will keep looking in subsequent hashbuckets. For each pretended collision the overflow count will be decremented and the search continued, until the key is found and the overflow count is equal to 0. Insertions duplicate values into the hashmap should be reasonably simple as well.

It is highly unlikely that a next pointer will overflow the 32 bit value, so this scheme won't burden the hashmap unduely.

My remaining concern is what to do about the limited size of a hashmap. I'm starting to think that a hierarchy of indexes might be useful. Unfortunately, I haven't yet worked out how to go about this. In the meantime I've been reading various web pages about indexes, and searching for others. If anyone knows of any good ones then I'd love to hear about them. Unfortunately my background hasn't really covered many of the data structures that most software engineers probably take for granted, so I still have a lot to learn.

I've noticed that most descriptions of indexes describe structures which work quite well in memory, but are not so appropriate for disk. For instance, the skip lists page that AN pointed me to today is great for building data on disk, but deleting becomes quite messy due to the variable size of each node. This is not insurmountable, as a free list could allow for node reuse, but the problem does get harder. Similarly, many other indexing solutions describe structures which use variable sized data, with lots of jumping around in the data. The trick to efficient disk access for these systems is usually to find a fixed sized representation.

According to DM, Lucene gets away with skip lists like this because it relies entirely on write-once lists, with deletions done as white-outs, and additions going into new files. Later optimization is done as a merging of the multiple files. This is the approach that he wants to take, and it looks like it might work well. The biggest advantage is that it will scale well.

Consequently, I'm considering having a hierarchical structure with hashmaps at the top, and skip lists underneath, but I've yet to work out the best place to split data effectively in this way.

Friday, June 04, 2004

Resolvers Still
Well the basic structure is all written, but it's far from working. There are missing jar files, and all sorts of other problems with the build process, so I'm left sorting out what are the coding problems, and what are the configuration problems.

I think I'm close, but I was called away a little early tonight, leaving me unable to finish it until later.

Transitive Queries
AN spoke to me today about the trans statement. Since I got pulled onto resolvers he has started working on inferencing where I left off. He's making progress, and I'm sure it will be fine.

In the meantime it feels a little frustrating, because I know that if I were to make a change here and restructure something there then I could make the necessary changes reasonably easily. The only reason I didn't do so before was because I needed to speak with AN before I could commit to making those changes. Now that we've spoken I'm not getting the chance to work on it.

On the brighter side, I'm learning about a new subsystem that I hadn't had a lot of contact with before.

UQ
The meeting today went well. Yes, UQ is very interested in this kind of work. They are happy for me to apply, and Assoc. Prof. Colomb is prepared to be my supervisor. I now need to put my application into the university administration. Fingers crossed that they are happy to let me in, though ITEE thinks I should be fine.

As a part of the application I need to write a proposal of what I hope to do. I'm thinking something along the lines of "Scalable Ontology Server", but that sounds like hubris. :-) However, it does provide a general direction for my study, and I've been assured that my original proposal is a "write-only" document which is filed away, never to be seen again.

I have a LOT of reading ahead of me.

Indexes
DM is making progress on his new triplestore design. Taking his inspiration from Lucene, his index plans are all based on skip lists. New data gets put into new files, so nothing needs to be moved, and an optimise method will merge files together.

The principle of this design is simple: Use sequential access on disk to its greatest advantage. Hard drives operate at speeds which are orders of magnitude faster when data can be read sequentially.

So while skip lists means that more data has to be read by the process, the overall performance should actually be faster. Looking at it I can see his point. Making all writes an append operation on new files, with file merging on the fly, also sounds like a lot of work, but again, it's completely sequential access.

Scaling
The other advantage of using skip lists is that the data size is effectively unbounded. This has made me reconsider the hashtables I've been planning, and my opinion is now not so favourable.

The most expensive index from my previously discussed design will be based on the hash of the full triple. Let's assume 32 bit identifiers. 64 bit is actually more useful for a large system, but since this might not scale so well 32 bits are worth considering. Each element of the hash table will be 20 bytes long. That's 3 x 4 byte identifiers, and a 64 bit file index. A 2GB file can therefore only hold 214.7M entries. Presuming that I let the hash table get heavily loaded, say at 0.75, then that cuts me back to about 161M triples.

One of the big advantages of using a hashmap is that I could just map it into memory and look it up with a memory indirection. A 2GB file is more than I can really map on a non-64bit system, and this has not even considered the other indexes. In other words, I would be lucky to get to 100M triples if I use mapped hashtables. That isn't to say that I can't use them, but things may not be as rosy as they first appear.

On the other hand, I have had some ideas for the data file. As triples are written, they will simply get appended to the file. It would be feasible to set aside large buffers for these triples to be put into, and to wait until this buffer is full before writing it out. This prevents lots of small writes, and more importantly, it may help to prevent numerous writes to the indexes. For instance, if the two triples:
[23, 45, 34] [23, 45, 87]
were written, then the (subject,predicate) index would normally get updated twice for the (23,45) entry. In fact, so would the (subject) and (predicate) indexes for (25) and (45) respectively. Postponing the index writes means that only the latter entry would need to be written.

Of course, writeback filesystems might allow the first writes to be postponed until they are invalidated, but this is not guaranteed, and still invovles another system call.

Unfortunately the final data file will contain linked lists with all of the indexes pointing backwards to places in the file with lower offsets. This is the complete opposite of what is needed here. One possibility is to take the Lucene idea of running "optimise" to rewrite the file. In that case the whole file could get rewritten backwards, which would make traversing the linked lists significantly faster. I believe that it would be fine leaving the blocks alone, as all operating systems read whole blocks at a time, so back-indexing within a block will just be going to data that has already been read and is guaranteed to be in the same page of buffer cache.

It's a shame that I can't write a block to the front of a file.

Data Size
The final advantage of DM's idea is that the triples (or quads) being stored are just that: Triples only.

The data I'm planning on storing will be triples and linked list pointers. Each element will be 3 identifiers, and 6 "next" pointers. Presuming that linked lists can always do a 32 bit relative offset (which they may not without elements being moved) then this is already 36 bytes per triple. For full 64 bit file offsets this moves up to 60 bytes. If back pointers are included then it goes up again, to 108 bytes.

One of the big killers for Kowari's performance with huge data sets is when it moves out of buffer cache and into greater reliance on the hard drive. Obviously, this occurs when the data being manipulated is too large to fit into RAM. The performance degrades as cache misses become more frequent with the growing dataset.

The easiest way to put off this diminishment in performance is to use smaller data elements. An arrangement with 12 bytes for a triple will perform much better than one with 108 bytes.

I'm thinking I'll code this anyway, so I can see and evaluate its performance. However, I'll be planning on throwing it away once I've seen what it can and can't do well. In the meantime I'll be using my new student card (hopefully) to borrow books on indexing from the UQ library.

Thursday, June 03, 2004

Blogger
I'm starting to find Blogger a little annoying when I use Mozilla on Linux. For a start, I don't get a spellcheck button, but worse is the lack of a preview button. The same page on Safari has both.

Then there is the occasional incomplete page. That was a listed bug for Blogger, but it didn't get fixed with the latest update. I'll admit that I see incomplete pages less frequently now, but when it happens there is no way I can get the page to load.

The latest problem came earlier tonight as I attempted to publish my latest update. There was no response to the "Publish" button at all. I typically get a page telling me that publication is happening, but not tonight. After pressing the button several times, I finally restarted Mozilla... only to see that each publication had actually happened. So if anyone saw 10 separate posts of the same message, then now you know why.

JRDFmem and Trove
I'm not doing anything on JRDFmem tonight, but as usual I've been thinking about it.

After my expressed frustration at the use of Long objects in JRDFmem, AN reminded me of Trove. Doh! I know this, and yet I didn't think of it once. This is why I need more sleep.

The only hassle is the need for another jar file, but I can possibly cut it back to the essentials. After all, why reinvent the wheel when you don't have to?

Resolver
Not a lot to talk about today. The resolver is coming along reasonably well.

I've had to build Query objects from scratch, which in turn has meant that I've had to build a heap of other objects with them. However, the parameters are only defined by interface, and the javadoc isn't as helpful as it should be.

While I agree that parameters should be defined by interface, I had no idea what classes should be instantiated as parameters. Grepping the source for one interface name led to other interfaces which extend the first, and a couple of abstract classes for good measure. Grepping for these found even more interfaces and abstract classes. Eventually the required classes would show up, but occasionally I'd get lost, and have to go back up the inheritance tree again.

Building the Javadoc would help this, only I have a non-compiling build at the moment, so I couldn't. Given the time wasted I should have just got a new checkout specifically to build the javadoc. Funny how you don't see the obvious when you're working too close to it. Alternatively, I could just ask SR, but I'd end up doing that so often that it would be like him writing the code rather than myself.

Ideally we'd properly document the important parts of the class hierarchy, for use by developers who are working in that code for the first time (like me). However, that takes a lot of valuable time, and with all the pressures we're constantly under, there's no way we'd be allocated the time for something like that. Theoretically I could do it in my own time for Kowari. That's one of the advantages of it being open sourced. However, I have more interesting things to spend my own time on. :-)

Masters
Tomorrow I have an interview at UQ about research topics. Wish me luck!

Wednesday, June 02, 2004

Resolver
Today was spent wrapping a Session object with the Resolver SPI. After a bit of intial confusion, it's coming along nicely. It turned out that the SPI code I've read over the last few days has been abandoned and new code written. Fortunately the differences are not great, and I was able to clear the confusion relatively quickly.

I expect that I'll have the wrapper finished some time tomorrow, which will leave me about a day and a half to get all of the testing done.

Walking
AN's document on ontology support includes sections on transitive closure, and also on a separate function he calls walking. I didn't get it originally, thinking that it was a duplicate of the trans function. This was partly because some people have referred to it as relating to transitive predicates (which it does not).

Walking involves following a non-transitive predicate through the graph in the same way that trans follows a transitive one from an anchored node. The result is a set of statements from the original graph, rather than a set of new inferred statements.

The point of a walk is to see where a relationship can take us. For instance, with a non-transitive predicate of "hasFriend" one could imagine a graph of:
[<Alice> <hasFriend> <Bob>] [<Bob> <hasFriend> <David>] [<David> <hasFriend> <Eve>] [<Eve> <hasFriend> <Fred>]
Just because Alice has a friend of Bob, and Bob has a friend of David, this makes no statement of any relationship between Alice and David. Perhaps they have never encountered one another.

Following statements with a particular predicate in this way can allow such analysis as the infamous six degrees of separation (or the original Six Degrees of Kevin Bacon). One example that came to mind was when I heard about scientists trying to track the infection of SARS back to the original Chinese carrier. Each transmission was due to contact between individuals, and this path of contact had to be "walked" back to the Chinese point of origin. Of course, contact in this case was a non-transitive property. ("Contact" was also a difficult property to measure, given that the contact between the original Chinese sufferer and the people he gave it to was via the lift button in their hotel).

So being able to follow a chain of predicates in this way is an important tool. Fortunately, much of the work is done here, as the operations are very similar to following predicates with the trans statement. With my luck someone else will end up implementing it, and will decide to do it completely differently. But the principle will be the same.

JRDFmem
I rearranged some of the packages for this last night, and I'm starting to get happy with it. I also moved the reification operation from the NodeFactory and into the Graph. AN has now given me access on SourceForge, so I'll be putting it up soon, as a simple implementation of JRDF. Who knows, maybe someone will find a small triplestore like that to be handy. It's certainly indexed to be fast, and it is serializable.

It occurred to me that I'm creating Long objects quite a lot in JRDFmem. This is because Java 1.4 and below does not support Collections of base types like long. I've just been indiscriminately storing Longs as node IDs, and didn't think twice of it until today. However, it has occurred to me that each Long is a whole new object in the heap.

Unlike Strings, I don't believe that Longs get reused, even though they are immutable. Since Longs of the same value get used a lot in this graph, it could mean that anyone using JRDFmem for a largish graph could run out of memory sooner than they needed to. In fact, they will probably end up running out well before node number 4 billion is reached, so the use of Long as a node ID instead of Integer is probably unduely optimistic.

One idea I have is to cache these objects in an array of Longs. Nodes are rarely discarded, and they are allocated in incrementing order. I could simply store the references to them in the array, and when they are needed I can find the Long I want by indexing into the array. This might look like:
Long l = longArray[id];
The array would need to be grown when needed, involving copying into the large array, and discarding the old. This would be costly, both in terms of time for the copy, and space for the array itself, particularly the unused portion. However, if many Long objects are being made for each long number then this cost might be saved by removing the redundancy.

Of course, using this scheme would mean that the Longs would need to be changed back to Integers, as arrays are indexed by int only. It's a shame that templates aren't available here, or at least macros!

Transactions
The other thing I wanted to get done is transactions. I'd like to create a new Transaction interface that extends Graph and has commit and rollback methods. Graphs can then be asked for a new transaction, and the resulting object is a graph that can be used just like the original. Changes get put into the main graph on commit.

I'll need to make sure that this works cleanly with the javax.transaction interfaces, but I think I have it now. I can just wrap the Graphs-as-Transaction objects inside an object that meets the javax transaction interface. This wrapping object can keep hold of an original graph and a map of XIDs to transaction graphs. When a transaction is committed it updates the main graph, and notifies all outstanding transactions of the change. Any transactions which have modifications incompatible with the new main graph can abort at this point.

Transaction graphs will be relatively simple. They can hold a reference to the original graph, plus their own data. As statements are added they will be put into their own index, and added to a record. As statements are removed, they will be removed from the local index, if they exist there, or a whiteout entry can be put into the record.

Queries on these transaction graphs are a combination of a query on the main graph (with checks for whiteouts), and on the local index. Then when a transaction graph gets committed it can play back its record against the main graph. At this point every other transaction should check its record against the main graph, to check if it should abort early.

I'm considering something like this for the disk based triplestore. It's based on an idea by DM. In particular, he suggests that each transaction be an entire copy of the graph, as copies of files are relatively quick (particularly with operating systems which support write-behind). This would change the description above only a little. Instead of holding a reference to the original graph, and queries being applied to the original plus data local to the transaction, each transaction would just have to query its local data, and whiteouts on the original data would not be needed.

In this system, when a transaction gets committed it can just become the main graph. All other transactions would then take a new copy of the new main graph, and do a replay of their modifications on this new data. Afterward, the next transaction to be committed gets to become the new graph.

This scheme is simpler than what I'm considering for JRDFmem, but at the expense of disk space. However, disk space is cheap and expandable, while RAM is not, and that is the primary consideration for JRDFmem.

Maybe I shouldn't get carried away. It's only meant to be a simple triplestore!

Tuesday, June 01, 2004

Resolver SPI
Read more of this today, including the Lucene and XML resolvers. I found one amusing piece of in the XML resolver's next() method. It had obviously evolved from a very different method, and for a 30 line bit of code it was quite obfuscated. At first glance it contained at least 3 inappropriate control structures.

Both SR and AM had their names on it, so I asked AM. After laughing he offered a quick fix that didn't work. I tested it, watched it fail, then fixed it properly.

These things are no one's fault, as changing requirements and tight deadlines can lead to code being changed a little at a time, with no one getting the time to go back and refactor. However, it can be funny when you are asked to read it.

Now that I have a decent handle on this code I'll be writing a "Remote Resolver". This means taking the current remote session code, and wrapping it in the Resolver SPI. This means that remote sessions will look just like every other type of data store.

A significant aspect of this code is that it will be for TKS and not Kowari. That feels a little strange, as I haven't worked on purely TKS code for a little while.

I also fixed the JXUnit tests which had been failing on the transitive queries. Yes, it was due to the order of the returned data.

Inferencing
I described what had been going on with transitive statements for AN this afternoon. As I suspected, it was not quite what he'd had in mind for it, though on further reflection he says that what we have is still useful.

When trans() is used to describe a transitive predicate and two variables, then it will infer every possible statement with that predicate. This works exactly as it is expected.

The problem is when trans() is used with either the subject or object set to a fixed value. AN had planned for this to result in every possible inference from the graph that was reachable from the fixed value. To illustrate, consider the following example data:
  [a predicate b]   [b predicate c]   [c predicate d]   [x predicate y]   [y predicate z]

The list of possible inferred statements for a transitive predicate is:
  [a predicate c]   [a predicate d]   [b predicate d]   [x predicate z]

The list of possible predicate reachable from a is:
  [a predicate c]   [a predicate d]   [b predicate d]

Note that x, y, and z are not reachable from a. The reason for having such a thing is purely for the sake of efficiency. If a graph is modified in a part known to be disconnected from other parts, then being able to restrict re-inferencing to only that part will be a major improvement in performance.

At the moment what we have is the set of inferred statements which has a subject (or object) of a. ie.
  [a predicate c]   [a predicate d]

This is not entirely useless. For instance, if the predicate happened to be subClassOf, then this kind of statement would allow us to say, "give us all classes that a is a subclass of". This is needed for backward chaining, and once AN realised it he seemed a little mollified.

The reason I wrote anchored transitive queries to be like this and not what was originally desired was sort of accidental. The inferred statements which include a given node is a subset of all the inferred statements reachable from that node. I wrote code which could find the former, with the intent of extending it to find the latter as well.

Making sure I was on the right track, I ran an intermediate test. Once I saw the results coming out from this test I recalled that TJ had mentioned that he wanted something that returned this particular set of statements. With AN away, I went and asked TJ, and he agreed that this was exactly what he wanted and that I should leave it there.

I now realise that TJ has always been an advocate of generating inferences on the fly with the use of backward chaining. Hence his approval of this kind of function. Had AN been there to discuss it then I'd have realised that he and TJ were not after the same thing, and that the other type of function is also needed. I had a suspicion that this was the case, and so I've been keen to talk to AN about it. I'm glad I've done so now.

Ironically, AN had yet to consider backward chaining, how we would do it, and the syntax to use. By showing the current trans() statement with the modified syntax (I discussed this modification last week), he has discovered that we already have the tools for backward chaining, and that the syntax is already done for him. It might not be very pretty syntax, but it makes sense. Most importantly, it already works, and since it was a needed step to doing what he wanted all along, we haven't really wasted any time be doing it.

Once the distributed resolver is working I'll be getting back to implementing this.

JRDFmem
The graph is now serializable, which was the last major obstacle to usability. I also spent some time last night moving directories around, separating packages into separate jars, and cleaning the whole build file up a little. I've spoken to AN about it, and I think I'm on the right track to have it included with JRDF on Sourceforge soon.

While I want to move on I keep having ideas for improvements, and I keep wanting to add them in. I might do some of them, but I'll have to cut it off soon, or else I'll accomplish nothing.

Masters
Assoc. Prof. Bob Colomb from UQ has agreed to see me this week to discuss the possibility of doing my Masters. I was starting to get frustrated, as the only responses I've had so far have been from QUT, and on each occasion I was told that they didn't specialise in my area, and maybe I could talk to someone else. I received my latest response this evening from Dr Raymond Lau, which was encouraging though not helpful. He is going on leave, but he says that QUT is definately do research in the area, and I should have success if I keep asking. Oh well, Rome wasn't built in a day.

A Masters isn't essential for me, but I think it would be helpful for future work overseas. I've discovered that Bachelor degrees in some countries are really inadequate for professional work, and that candidates are not considered without Honours or Masters degrees, spending on location.

An example of this is Engineering in the UK vs. Australia. In the UK there is a 3 year degree which gets you exactly nowhere. The IEE only accept graduates who have also completed an extra year of Honours. In Australia the standard engineering degree is already 4 years. I'm on the Queensland committee for the IEE, and discussions with professors of 3 Queensland universities have revealed that the increasing scope of engineering and technology today has led to engineering courses packing in more content than most other degrees, and reducing the credit allotted so that students don't exceed the maximum study levels set by the universities' adminstrations. The result is that a standard Australian undergraduate engineering degree is deemed by the IEE to be at least the equivalent of an honours degree in England. Some even suggest putting it on a par with a Masters, though this is not an official position.

If a Masters can be considered the equivalent of the undergraduate degrees that I'm familiar with, then it is no wonder a Masters is considered mandatory in some places. So even if I meet the standard, without an appropriate piece of paper I may encounter difficulty in the future.

For anyone who doesn't know, I have a Bachelor of Computer Systems Engineering, and a Bachelor of Science (with Distinction), majoring in quantum physics. I started postgraduate Honours in physics at the beginning of 2002, but family circumstances led to me withdrawing. I hope to get back to it one day. To that end, having a Masters, even in a different discipline, may be of use to me.

Working notes