Friday, June 22, 2007

Camping

I'll be away camping for the next week. This will be my first time without any kind of access to a computer in a few years. I'm looking forward to it.

I'm also looking forward to the company. Not only will I be able to spend some uninterrupted time with my family, but we'll be with friends whom we haven't spent much time with lately. I'm sure the occasional conversation may end up on physics or philosophy, but in general it will probably be about children.

Outside of Anne and the kids, I've been a bit starved of company lately. We have some good friends here, but aside from them being perpetually busy (what is it with people around here that they have to plan their every waking moment) I don't actually know anyone local who shares my interests. Consequently, I think I've been pestering some of my contacts online a little too much lately, and for any of you who read this and have experienced that, I apologize. Hopefully I'll be a little less annoying after my break.

Slow Debugging

My notebook computer is now officially too slow for me. Unfortunately, I can't afford a new one - at least, not until our house in Australia sells (fingers crossed). In the meantime, it is the only computer I have access to at work.

In February I realized that my own machine was too slow for me, and that it was about time that the company provided me with something I could get work done on. The guy in change of these things agreed, and a new purchase was approved. However, it's now late June and I'm still waiting. I was told last week that they decided to lease the computer instead (which doesn't really impact me at all, since I was never going to own it), and just this week they put the application in. Now they're waiting to hear back from the finance company. Hopefully I'll have something in July, but I'm not counting my chickens just yet.

The reason I've gone into this, is because I spend my days on my current notebook, and have great difficulty getting any work done. I often have to wait for nearly a minute while switching windows, and can regularly type far ahead of the display (it's like being on an old modem). Consequently, using a system like Eclipse to do debugging is just impossible. Because of this, my boss suggested I could work from home more often.

I'm a little hesitant to be isolated from the office, and my notebook is too slow to get away from the kids down at the coffee shop, but getting to use my home PC has caused these concerns to fade into insignificance. I've worked at home every second day this week, and on these days I managed to get an enormous amount of work done. I'd been feeling ineffectual at work recently, and I hadn't realized that my computer was contributing so much to that.

Mulgara 1.1

I'm not supposed to do any Mulgara development when I'm at work, but I managed to get away with it this week. I'm supposed to be building a library on top of Mulgara. However, since the next version of Mulgara is due very soon now (I know, because Andrae and I said it!) then I figure that it doesn't make sense to code the library to the old released version. But the only way there will be a new release is if I do it, meaning that my work on this new library is blocked on my being able to release the next version of Mulgara. This is my justification anyway.

I've spoken with the other guys I work with, including the CTO, and no one disputes my approach. I know that a lot of people have wanted this done, so I'm pleased I've had the opportunity. It can be hard to do it all at night.

Unfortunately, we don't have a re-implementation for the graph-renaming code. Brian had expected to do it, but other commitments have prevented him. Either Andrae or I will do it now. It needs to happen soon, so I'm hoping that Andrae might have it done by the time I get back. Otherwise my evenings in early July will be heavily booked.

Other than that, we have a fix for all the critical bugs in the system. We already had several new features (the new transaction architecture, distributed queries, etc) but we've just been holding off on the next release until we'd fixed the most critical problems. Andrae did a great job with several, and I'm very appreciative to both him and Topaz for making that happen.

Blank Nodes

Just in the last week I've worked on a problem we've had with inserting blank nodes. Both the N3 loader and the distributed resolver had issues with this. In both cases the code would fail to recognize the reuse of a blank node ID, and we would get separate blank nodes in the system. In both cases, the solution needed a similar approach, though it had to happen in very different places.

The N3 loader just needed to make sure that it remembered which blank nodes it had already seen. It turned out there were two problems, with the first being that the loader expected anything after the special "_:" namespace to be a number. This is not true, even with our own internal identifiers (which use a format of _:node###, where each # is a digit). The second problem was that there was no production of new nodes when they were needed, though a reference to a BlankNodeFactory was commented out. Apparently, this is from an earlier iteration of the code, where it appears that the code worked correctly. However, someone changed it without adequate testing.

It turns out that it is not trivial to test for this with our standard testing framework. Our scripted tests look for an explicit set of output values, which we look for exactly. However, blank nodes could be assigned any identifiers, so any check for them can't look for specific values. Instead it has to be clever and make sure that the nodes in the right places all match each other (and not any other nodes). Of course, this can be done, but it will take about a day (on my home computer - I have no idea it would take on my notebook), and I haven't had time to get to it yet.

The distributed resolver also needed to remember nodes it had seen before, but this time the code to do it is inside the internal Localization routine. Blank nodes coming from another computer always get negative identifiers (indicating they are temporary values), and the maps which convert these values into allocated nodes can only accept positive values. However, I couldn't just invert the sign of these values, as insertions of local blank nodes also goes through this code path (something I forgot at one point, but was immediately reminded of).

So then I spent 20 minutes coming up with an elaborate method of making sure that I was only inserting local positive numbers, or foreign negative numbers, only to remember that it is allowable to have both come in at once (I need to write a test for that - it will involved a disjunction across graphs). It was only when I went to fix this that I realized that a positive ID will always be the value of an allocated node. Indeed, it will always be the value of the very node you need, meaning there is no need to map it. Well, at least I realized that before I discovered it during testing (or worse - someone making a bug report on it).

The maps we have are all effective, but they result in file access whenever they are created. This was in the back of my mind as something I should optimize, when I started failing tests. It turned out that there was too much file activity, and I had created too many files. Doh. Some cleanup in the close() method fixed that. But that reminded me that I was still creating these maps, even when no blank nodes needed to be remembered.

The easy optimization here was lazy construction of the maps. Hmmmm, now that I think of it, I should make that lazy construction of only the needed map (since only one is usually needed: either map by numeric ID or by the string URI for the blank node). I noticed that at least one of these maps has a memory cache, but the file is already created. I'm thinking that we need an external cache, which can put off creation of the file until some pre-set mark is reached, like 1000 nodes, or some value like that. This localization routine is used everywhere, so avoiding disk usage is important.

Anyway, all the big bugs have been hit, and I've put it out in SVN as a release candidate. If it hasn't broken by the time I get back from camping, then it gets released as version 1.1.0. Then we have to do the model renaming (again) and that can be rolled in as a relatively quick release for 1.1.1.

Our time between released has been terrible, especially when you consider the fixes and features that have gone in, so a quick release of the next point release should be a step in the right direction.

After that, the next big features have to be in scalability and SPARQL. Frankly, I'm not sure which is more important at this stage. Real scalability changes will take some time to implement, but in the meantime we can do a few small things that will have a noticeable effect. But I don't want to supplant SPARQL development either, so finding the balance will be tricky. Fortunately, Indy is building the SPARQL parser at the moment using JFlex/Beaver (an LALR parser generator), so I know it's in good hands. I've promised to take him through our AST as soon as he's done, so we can hook the two together.

Trans

Tonight I got to spend a bit of time online describing the transitive functionality in Mulgara to Ronald at Topaz. It's a technique I came up with a few years ago, where you join a set of statements against itself iteratively. The result is that each join stretches twice as far as its predecessor across the graph. Consequently, you only have to do at most log2(n) joins to find the full transitive closure across the graph. You trade space for speed, but it's not too bad, as the only space you use up is for the statements you were looking for.

I looked up various papers on complexity of transitive closure on a directed graph, and they claim that the most efficient algorithm will be polynomial in log(n). Interestingly, all the papers I found presumed that the graph could contain no loops, but this is not a valid assumption in RDF. To get around this, I also kept a set of nodes that have already been visited, so as to avoid iterating indefinitely when a loop occurs. This increases the memory requirements beyond what I mentioned above, but with no increase in complexity.

Another nice feature is that "backward-chaining" does not improve the complexity at all. All it does is reduce the amount of memory being consumed, and the length of some of the loops. However, while having a possible impact on performance, neither of these considerations affect complexity.

Andrew told me that he basically copied my code and then modified it to make the walk functionality. I've never looked at walk but Ronald tells me that they look almost the same. While I never looked at walk, I did notice that Andrew's name is listed as an author of trans. I know I talked some of it through with Andrew, but I also recall writing at least a lot of it on my own. I should ask him about that.

Ronald is a bright guy, but trying to read the algorithm from code had him stumped. I believe that the code is clear about what it's doing, but the reason for doing it was not clear at all. I tried to explain it to him, but he thought it was operating linearly (and not logarithmically) until suddenly I got a message saying that he'd just been blinded by the light.

There are a lot of very bright people in the semantic web, and it can get a little intimidating sometimes, especially when you're off on your own like I am at the moment. It's reassuring to be told that you can do something clever every once in a while. :-)

I'm also thinking of taking this code, and using it again, only this time with an unbound predicate. This would let you find the shortest path between two nodes with the same complexity as transitive closure. The difference here is that the space requirements would grow significantly, but since we back these things with disk-based storage, maybe that's acceptable. I'm told that several people would like to see this feature.

2 comments:

Anonymous said...

re: @author tags, that's why many apache projects just do "the foo team" rather than listing individual names in the files.. skips the issue.

enjoy the offline-time!

Andrew said...

Trans definitely came before walk. See:
http://sourceforge.net/tracker/index.php?func=detail&aid=948377&group_id=89874&atid=591707
http://sourceforge.net/tracker/index.php?func=detail&aid=964655&group_id=89874&atid=591707

I remember the absolute pain of deciding which syntax to use. And I'm sure Simon, Andrae and yourself all had slightly differing views but no one liked the syntax chosen.

I remember that trans originally returned both new and existing statements and I was adament that it only produced new ones - I can't remember the justification then but it probably has to do with one being using for inferencing the other for querying (not mixing read and write operations).

I'm pretty sure I was out of the country (in America) when trans was being implemented by you so I have no idea why it has my name in the header - unless I started it and you finished it.

I also notice that walk has a test and trans doesn't - which also suggests that I didn't write it.

And the reason they look the same, again from what I can remember, was to try and factor out common code which was never finished (because they were very similar).