Working notes: 06/01/2007

Saturday, June 23, 2007

Talis

Lately I've been hanging out in #talis on IRC. It's a corporate IRC channel, but unlike other companies (like mine) they've put it on a public server where anyone can find it and join in. Not only that, but some of the staff blogs point out that they can usually be found there.

Well, that seemed like an open invitation, so I've been joining in for the last few days. :-)

It's exactly the sort of thing you'd expect, in that it is just discussion between staff facilitating the various tasks they are involved with. Of course, there is a lot of humor as well, which I've appreciated, since American humor is different to my own. We spell it differently, for a start!

I was particularly interested to see that they hold their daily Scrum meetings in IRC. I suppose this is necessary, given that they have people all over. The other thing is that I've enjoyed saying hello to people who's blogs or emails I've been reading for years, but have never interacted with before. Well, almost never. One of them responded to something I said at the end of a post a few years ago, and I've received very occasional emails from him ever since (and I occasionally pester him with stuff - he'll know who I mean if he's reading this). :-)

Overall, between IRC and their podcasts, they seem to be a bright group of people, and are destined to make waves in the semantic web. I'm curious to see where they go.

Dopplr

PaulM recently made one of his infrequent posts (like I can talk), mentioning Dopplr in the process. Then just last night IanD linked to an interview with Dopplr's lead developer, Matt Biddulph.

Dopplr sounds interesting. It's almost enough to make me regret not traveling enough to justify joining. Almost.

Back when I couldn't afford it, I used to enjoy traveling. I still enjoy it, but leaving the kids and Anne takes the luster off the experience. I'm looking forward to the day when the boys are happy for me to leave, and Anne and I can go somewhere on our own for a weekend... or a week.

I also note that Dopplr want to add in support for calculating your miles for Terrapass. Expedia pointed me to Terrapass on my last flight, and I was very happy to use it. I'm planning on making use of it as much as possible in future.

Most transport has alternatives that can use less emissions, but air travel is one of seems to have no solution. I'm not a big fan of "carbon offsets", as it's usually an excuse for pushing the problem off on someone else, and doesn't provide a solution (besides, carbon is only one of many problems with emissions). However, there are many occasions where there simply is no better alternative, in which case I fully endorse them. Carbon offsets are usually the domain of large corporations, but Terrapass makes them accessible to consumers. I don't think that lets the corporations providing to the consumers off the hook, but it does allow people to get involved.

I was also pleased to note in the news recently that the FAA are looking to clean up the airlines. Even if they're successful, I somehow doubt that Terrapass will charge less per mile. ;-)

Friday, June 22, 2007

Camping

I'll be away camping for the next week. This will be my first time without any kind of access to a computer in a few years. I'm looking forward to it.

I'm also looking forward to the company. Not only will I be able to spend some uninterrupted time with my family, but we'll be with friends whom we haven't spent much time with lately. I'm sure the occasional conversation may end up on physics or philosophy, but in general it will probably be about children.

Outside of Anne and the kids, I've been a bit starved of company lately. We have some good friends here, but aside from them being perpetually busy (what is it with people around here that they have to plan their every waking moment) I don't actually know anyone local who shares my interests. Consequently, I think I've been pestering some of my contacts online a little too much lately, and for any of you who read this and have experienced that, I apologize. Hopefully I'll be a little less annoying after my break.

Slow Debugging

My notebook computer is now officially too slow for me. Unfortunately, I can't afford a new one - at least, not until our house in Australia sells (fingers crossed). In the meantime, it is the only computer I have access to at work.

In February I realized that my own machine was too slow for me, and that it was about time that the company provided me with something I could get work done on. The guy in change of these things agreed, and a new purchase was approved. However, it's now late June and I'm still waiting. I was told last week that they decided to lease the computer instead (which doesn't really impact me at all, since I was never going to own it), and just this week they put the application in. Now they're waiting to hear back from the finance company. Hopefully I'll have something in July, but I'm not counting my chickens just yet.

The reason I've gone into this, is because I spend my days on my current notebook, and have great difficulty getting any work done. I often have to wait for nearly a minute while switching windows, and can regularly type far ahead of the display (it's like being on an old modem). Consequently, using a system like Eclipse to do debugging is just impossible. Because of this, my boss suggested I could work from home more often.

I'm a little hesitant to be isolated from the office, and my notebook is too slow to get away from the kids down at the coffee shop, but getting to use my home PC has caused these concerns to fade into insignificance. I've worked at home every second day this week, and on these days I managed to get an enormous amount of work done. I'd been feeling ineffectual at work recently, and I hadn't realized that my computer was contributing so much to that.

Mulgara 1.1

I'm not supposed to do any Mulgara development when I'm at work, but I managed to get away with it this week. I'm supposed to be building a library on top of Mulgara. However, since the next version of Mulgara is due very soon now (I know, because Andrae and I said it!) then I figure that it doesn't make sense to code the library to the old released version. But the only way there will be a new release is if I do it, meaning that my work on this new library is blocked on my being able to release the next version of Mulgara. This is my justification anyway.

I've spoken with the other guys I work with, including the CTO, and no one disputes my approach. I know that a lot of people have wanted this done, so I'm pleased I've had the opportunity. It can be hard to do it all at night.

Unfortunately, we don't have a re-implementation for the graph-renaming code. Brian had expected to do it, but other commitments have prevented him. Either Andrae or I will do it now. It needs to happen soon, so I'm hoping that Andrae might have it done by the time I get back. Otherwise my evenings in early July will be heavily booked.

Other than that, we have a fix for all the critical bugs in the system. We already had several new features (the new transaction architecture, distributed queries, etc) but we've just been holding off on the next release until we'd fixed the most critical problems. Andrae did a great job with several, and I'm very appreciative to both him and Topaz for making that happen.

Blank Nodes

Just in the last week I've worked on a problem we've had with inserting blank nodes. Both the N3 loader and the distributed resolver had issues with this. In both cases the code would fail to recognize the reuse of a blank node ID, and we would get separate blank nodes in the system. In both cases, the solution needed a similar approach, though it had to happen in very different places.

The N3 loader just needed to make sure that it remembered which blank nodes it had already seen. It turned out there were two problems, with the first being that the loader expected anything after the special "_:" namespace to be a number. This is not true, even with our own internal identifiers (which use a format of _:node###, where each # is a digit). The second problem was that there was no production of new nodes when they were needed, though a reference to a BlankNodeFactory was commented out. Apparently, this is from an earlier iteration of the code, where it appears that the code worked correctly. However, someone changed it without adequate testing.

It turns out that it is not trivial to test for this with our standard testing framework. Our scripted tests look for an explicit set of output values, which we look for exactly. However, blank nodes could be assigned any identifiers, so any check for them can't look for specific values. Instead it has to be clever and make sure that the nodes in the right places all match each other (and not any other nodes). Of course, this can be done, but it will take about a day (on my home computer - I have no idea it would take on my notebook), and I haven't had time to get to it yet.

The distributed resolver also needed to remember nodes it had seen before, but this time the code to do it is inside the internal Localization routine. Blank nodes coming from another computer always get negative identifiers (indicating they are temporary values), and the maps which convert these values into allocated nodes can only accept positive values. However, I couldn't just invert the sign of these values, as insertions of local blank nodes also goes through this code path (something I forgot at one point, but was immediately reminded of).

So then I spent 20 minutes coming up with an elaborate method of making sure that I was only inserting local positive numbers, or foreign negative numbers, only to remember that it is allowable to have both come in at once (I need to write a test for that - it will involved a disjunction across graphs). It was only when I went to fix this that I realized that a positive ID will always be the value of an allocated node. Indeed, it will always be the value of the very node you need, meaning there is no need to map it. Well, at least I realized that before I discovered it during testing (or worse - someone making a bug report on it).

The maps we have are all effective, but they result in file access whenever they are created. This was in the back of my mind as something I should optimize, when I started failing tests. It turned out that there was too much file activity, and I had created too many files. Doh. Some cleanup in the close() method fixed that. But that reminded me that I was still creating these maps, even when no blank nodes needed to be remembered.

The easy optimization here was lazy construction of the maps. Hmmmm, now that I think of it, I should make that lazy construction of only the needed map (since only one is usually needed: either map by numeric ID or by the string URI for the blank node). I noticed that at least one of these maps has a memory cache, but the file is already created. I'm thinking that we need an external cache, which can put off creation of the file until some pre-set mark is reached, like 1000 nodes, or some value like that. This localization routine is used everywhere, so avoiding disk usage is important.

Anyway, all the big bugs have been hit, and I've put it out in SVN as a release candidate. If it hasn't broken by the time I get back from camping, then it gets released as version 1.1.0. Then we have to do the model renaming (again) and that can be rolled in as a relatively quick release for 1.1.1.

Our time between released has been terrible, especially when you consider the fixes and features that have gone in, so a quick release of the next point release should be a step in the right direction.

After that, the next big features have to be in scalability and SPARQL. Frankly, I'm not sure which is more important at this stage. Real scalability changes will take some time to implement, but in the meantime we can do a few small things that will have a noticeable effect. But I don't want to supplant SPARQL development either, so finding the balance will be tricky. Fortunately, Indy is building the SPARQL parser at the moment using JFlex/Beaver (an LALR parser generator), so I know it's in good hands. I've promised to take him through our AST as soon as he's done, so we can hook the two together.

Trans

Tonight I got to spend a bit of time online describing the transitive functionality in Mulgara to Ronald at Topaz. It's a technique I came up with a few years ago, where you join a set of statements against itself iteratively. The result is that each join stretches twice as far as its predecessor across the graph. Consequently, you only have to do at most log₂(n) joins to find the full transitive closure across the graph. You trade space for speed, but it's not too bad, as the only space you use up is for the statements you were looking for.

I looked up various papers on complexity of transitive closure on a directed graph, and they claim that the most efficient algorithm will be polynomial in log(n). Interestingly, all the papers I found presumed that the graph could contain no loops, but this is not a valid assumption in RDF. To get around this, I also kept a set of nodes that have already been visited, so as to avoid iterating indefinitely when a loop occurs. This increases the memory requirements beyond what I mentioned above, but with no increase in complexity.

Another nice feature is that "backward-chaining" does not improve the complexity at all. All it does is reduce the amount of memory being consumed, and the length of some of the loops. However, while having a possible impact on performance, neither of these considerations affect complexity.

Andrew told me that he basically copied my code and then modified it to make the walk functionality. I've never looked at walk but Ronald tells me that they look almost the same. While I never looked at walk, I did notice that Andrew's name is listed as an author of trans. I know I talked some of it through with Andrew, but I also recall writing at least a lot of it on my own. I should ask him about that.

Ronald is a bright guy, but trying to read the algorithm from code had him stumped. I believe that the code is clear about what it's doing, but the reason for doing it was not clear at all. I tried to explain it to him, but he thought it was operating linearly (and not logarithmically) until suddenly I got a message saying that he'd just been blinded by the light.

There are a lot of very bright people in the semantic web, and it can get a little intimidating sometimes, especially when you're off on your own like I am at the moment. It's reassuring to be told that you can do something clever every once in a while. :-)

I'm also thinking of taking this code, and using it again, only this time with an unbound predicate. This would let you find the shortest path between two nodes with the same complexity as transitive closure. The difference here is that the space requirements would grow significantly, but since we back these things with disk-based storage, maybe that's acceptable. I'm told that several people would like to see this feature.

Thursday, June 21, 2007

FOAF

While waiting for Eclipse on my notebook to catch up the other day, I finally decided to get around to writing a FOAF file. I didn't go to enormous effort, and just used Leigh Dodds's FOAF-a-matic generator. I'm sure there are more people I can add in, but I didn't have a lot of their details. You can find the file over on the right here.

Once I had the file, I then had to wonder what to do with it? Danny's comment has some useful links, but I was a little disappointed to see how out of date a lot of it is, as some of the services that I tracked down aren't there any more. I was also surprised to learn that there's very little aggregation anywhere.

The FoaF Explorer site requires that you enter an individual's FOAF URL, and then you can just get the details from that file. It picks up on URLs that it finds in the file, and if the file was written to include rdfs:seeAlso links for other people, then it will let you navigate to those people's FOAF pages as well. However, it's all very brittle. Some of the pages suggested on the entry page (like Libby Miller) are no longer accessible. Many people also link directly to other FOAF files that no longer exist. There is no attempt to remember data, nor merge it together. For instance, Dan Brickley knows Eric Miller, and you can then navigate to Eric's page (due to a rdfs:seeAlso link). Eric knows several people, including Dan Brickley. However, even though the system has access to the information that could take you back to Dan (for instance, his email) there has been no attempt to do so. Consequently, navigating my way around on the site kept taking me to dead ends.

FOAfnaut looked more promising, but I can't get it to respond to anything I do. I submitted my own file, and tried my email as a parameter in the URL, but to no avail. However, I've also found that the SVG on the page is very unresponsive anyway, and won't let me enter any text (though I can change tabs).

There are other sites, and other problems. It all looks very promising, and with great potential, but several of these sites were last edited in 2003, and appear abandoned. It's almost like FOAF proved a concept, and now everyone has moved onto new things. I'd suspect that the idea was dead, except that Danny is still talking about it.

In the meantime, I've gone to the trouble of putting it up, so I should add some more detail to my file. For instance, of the few people I have in there, I haven't made use of rdfs:seeAlso for any of them, so I should add that in. Then at least my link will look more interesting in FoaF Explorer.

Sunday, June 17, 2007

Scaling Details

I have really wanted to record a lot of the details about scalability in Mulgara, but other commitments have conspired against me. I still don't have the time (it's nearly 1am and my head just ain't working right), but I wanted to get some of this information out there, so people know that it's actually real.

I got a couple of private requests to see this data, and so I put some of the details into point form (using Google Documents). This only gives the basic thrust of each item without giving useful details, but I thought it worth posting until I have the time to describe everything properly. A PDF of the document can be found on the Mulgara Wiki.

Saturday, June 16, 2007

After a recent spate of LinkedIn requests I decided to try expanding my own network. Apparently having a large number of connections is a good thing for one's long term career (or so a few pundits in IT have proclaimed), and it never hurts to look after that.

There's a note asking that you "Only invite people you know well and who know you", and until now I've stuck with that maxim. However, most of the invitations I've received in the past were from people I only knew through brief meetings or correspondence, stretching this guideline a little. Also, when requesting a link to a person, the LinkedIn site provides an option of "I don't know this person". So it seems that the requirement is observed more in the breaking than the observance.

I'm still trying to limit my list to people I've actually met, though there are one or two instances where I've simply conversed online (yes, I know you're reading this!). On the other hand, it could well be argued that I know some of those I've corresponded with better than those I've met in person. The end result is that within a week I went from about 5 connections to 30. This in turn made me realize that my profile was shockingly unrepresentative. After all, I filled it out in about 2 minutes when I was first invited to join. I'm still unhappy with it, but it's now a little more serviceable than it was.

I also invited Anne to join. Coming from industrial design and landscape architecture, she hasn't had as much luck finding people she knows, but she's still looking around. I can't imagine that it's penetrated her professional sphere yet, but once it does (if it does) then her connections should expand accordingly.

Twitter

Danny Ayers was the first to make me think about using Twitter, as he left a note implying that he is now posting tweets. This makes me imagine my mother asking, if Danny jumped off a cliff then would I jump too? Well, if it were Danny then I suppose I would. :-)

Twitter and micro-blogging have also been getting a lot of airplay recently by people referring to Web 2.0, not to mention a few of the Talking with Talis interviews, so I thought I'd check out what the fuss is about.

I tried browsing what strangers are doing at the moment, but I didn't find it particularly interesting. However, there was one post from someone at the Moscone Center that caught my attention, given that Apple's WWDC was on at the time (I'd love to go to this, but could never afford it). Guessing this guy must be involved in IT in some way, I added him to my list to see what he'd do. It didn't tell me much, but following the progress of an individual does have its charm. He was also kind enough to post a link to his blog, giving my some background on what he was doing, and why.

The real test of Twitter for me was whether I'd actually use it after signing up. I thought I'd try for a day or two and reassess at that point. The interface only allows for 140 characters, encouraging me to be brief, hence it didn't take much of my time. At the end of the first day I looked back at my log and realized that it was providing a reasonable breakdown of what I was doing during the day. Consequently, it appears useful for seeing how much time I devoted to certain activities, along with a reminder of just what I was doing and accomplished. My timesheet requirements at fourthcodex only involve a broad description for large blocks of time (as I'm doing internal development, and not customer billable work), but it looks quite useful for my own planning purposes, based on my path to a point, and how effective I was getting there.

I've been keeping at it, and even updated it with my phone a couple of times ("Going home"). Using the IM interface might be nice, but that service has been down since I first logged in. Twitter's technical support say they're working hard on getting that feature back up.

This may turn out like my blog: I started it for my own benefit, and thought that only my colleagues might read it. The idea was so they'd know what part of the code I was modifying, why I was modifying it, and the issues I had in the process. (I was inspired in this by John Carmack's .plan file, but I moved to a blog within a week or so). Since then, my blog has taken me to the other side of the world, and established some extraordinary contacts that I never expected.

This microblog seems to have a use for me as well. Perhaps if I keep it up, it too will evolve into something unexpected. If not, then at least it will help me fill in my timesheets.

Like LinkedIn, I pointed Anne at Twitter, and she now has an account as well. However, I suspect that she only uses it to see what time I'll be home from work. :-)

Linking Data

My LinkedIn flurry took me to a few blogs I hadn't followed before, including Dave Beckett (who is a rather infrequent poster). It's a few months old now, but I really liked his post on Webby Data.

A few weeks ago I mentioned that I think that too few people use RDF to link data together. Here Dave is pointing out that this is all the semantic web really is.

I initially came into this arena as an engineer trying to implement an efficient system for RDF, but not knowing much about the higher layers (RDFS/OWL, etc.). So for me, beyond describing structures, RDF was always about linking resources. Only later, when learning about OWL, and description logics in general, did I understand a lot of the underlying principles behind RDF, leading me to an understanding of what this concept of "semantics" was about.

But while all the academics are off working out how to best describe the world (and I wonder if I could be counted in that number) the real world has data everywhere that describes many of the same things, and yet that data is disconnected. This needs to be linked together in a web of information so that we all benefit, and it doesn't need fancy ontologies to do it. RDF provides a tool to do just this, and yet the central practitioners are all off working on another problem.

I'm not saying the domain of ontologies isn't valuable - I think that ultimately it's the way forward. But I also think it needs to be built on the web of information that we have the standards for, and yet have only just started to implement. This web is the Semantic Web (the semantics are few - but emergent), and "semantic technology" (like OWL) is the next stage that will take it forward. The former we can do today, while the latter is still being worked on (with some very useful results so far). I've noticed that many people lump the two together, and accord the deficiencies of the one still in development to both.

This is another case where I've recently discovered something that interests me, but only after it has been sitting on the web for months. I should be glad I'm catching up now, but it bugs me that it took so long to do it.

Description Logic Handbook

After relying on the PDFs of this book for years, I finally decided to bite the bullet and actually buy a copy. After all, I have to start my thesis soon, and PDFs are hard to read, while laser-printouts are awkward to sort through. So I went to Amazon to get one (along with some other things), only to be told that the book hasn't been published yet. This was confusing, until I realized that the second edition comes out next month.

I'm obviously not paying attention!

Friday, June 15, 2007

Debugging

It's a dilemma I've suffered for some time. Do I blog about getting stuff done, or do I get it done? This has had an influence on Mulgara for years, and in more recent times has also been a conflict for university.

Blogging about Mulgara is typically useful, as it gives me time to take stock of what I've done, and also which direction I should be going in. However, for university, it's really substituting one form of writing for another. Comparatively, my blogging lacks coherence and has no formal value at all. But it's not entirely useless, and when I factor in the advantages, particular for Mulgara, then blogging usually seems to win for me. I suppose this is the basis of my idea of micro-papers. At least the ideas would get written down.

All the same, there's no point in writing about work if you never do any. So tonight was a big debugging session, to follow the big debugging session I had today. Right now I'm running tests, so blogging is suddenly acceptable again.

N3

We've had an issue about N3 files not loading correctly. Every time a blank node appeared more than once, a new one was generated for each one, rather than reusing the original. I know that this code worked in the past, so this bug was really bothering me. But like everything else, it was getting pushed down the stack. However, I'm keen to get the Mulgara release out soon, and this was an important one. Another incentive was that the same non-reuse of blank nodes was happening when doing a select/insert between models on separate machines. Since I only just wrote the code to make that happen, I to get that code working too.

I started out by setting a breakpoint in StatementStoreResolver.modifyModel(). This gave me a starting point. I should mention that parsing occurs in a background thread with synchronization happening between appending triples to the queue, and removing them from the head. Apparently threading can be an issue, as Eclipse gave me some real grief trying to do simple things like "stepping" over a line. Sometimes pressing the "step" button would gray all the other buttons out, and you couldn't re-enable them withotu doing something weird, like moving up and down in the call stack.

Regardless of the vagaries of Eclipse, it didn't take long to discover that it was expecting all blank nodes in N3 files to be of the form _:### where the # characters are digits. However, all the files I had were actually in the form _:node###. I've put off fixing this expectation, since the code is expected to manage with non-numeric identifiers, and I wanted to check that it was working first.

It quickly became apparent that the author had expected to not see strings at all, especially when this possibility led to an insertion into a map with the following comment:

// don't expect to use this map

Eventually I noticed that after a blank node was put into the map, any attempt to retrieve it would always return a 0, indicating that nothing was there. But what was going in if zeros were coming out? It turned out that all the blank nodes being created had internal IDs of 0, so the number being stored was a 0, which was like storing nothing at all.

I had a hint that the blank nodes being created should have been given identifiers from the node pool, because the code contained some very suspicious comments:

  // need a new anonymous node for this ID
  blankNode = //BlankNodeFactory.createBlankNode(nodePool);
              new BlankNodeImpl();

Now it looked like this was wrong, but someone had apparently changed it for a reason. Could I change it back with impunity?

It took a long time to work out exactly what was happening with all of the data, but I finally tracked it all down, and was able to satisfy myself that I needed to create a blank node in the way it used to be done. This left me with two questions. First, who made this change, why, and when? Second, where could I find the node pool to allocated new node IDs?

As for the who/why/when, I can't really say. It occurred sometime between Kowari-1.0 and Kowari-1.1, along with a lot of other refactoring. Now that Kowari's CVS logs have departed for the big Sourceforge bit-bucket in the sky (ie. the project is now "inactive") then there's no way to find out who. And without the who, I can't really work out the why, since I can't see why it was done.

The appropriate place to get a new node is out of the string pool, since this is the object that manages the node pool. It makes sense too, since adding URIs and literals means allocating a node, and mapping it onto a value. Creating a blank node just means allocating a node, and not doing anything else with it. Unfortunately, the ResolverSession class which accesses the string pool for this kind of functionality doesn't provide the facility to allocate an unmapped node. I hate making changes across a large number of classes (ResolverSession is an interface with a lot of implementations), but once I satisfied myself that it really is a general operation I went ahead and did it anyway.

Of course, it's never smooth sailing, and the method name I chose (newBlankNode) conflicted with a private one already in JRDFResolverSession. So there was yet another class to be modified, but in the end it all worked properly. I can say that now, as the tests finished a few minutes ago.

Unfortunately, this is a hard test to automate. We can never guarantee what blank node IDs we will be given, and all our scripted tests rely on exact string matches. To avoid this problem in the future may need something a little trickier.

Hopefully, this has fixed the distributed insert bug as well. I'll just do a quick test of that, and then I'm off to bed.

Wednesday, June 13, 2007

Citeulike

The Talis interview series continues to be interesting. Today I listened to the interview with Richard Cameron about Citeulike.

I've recently come back from an "interruption of candidature" in my MPhil at UQ. I really needed that break in order to adjust to the new job, living in a new country, and a new baby, all at the same time. Believe me, all those things were hard enough without simultaneously trying to write a thesis. But I'm back now, and I'm trying to get my head back into gear so I can catch up before commencing the thesis (yes, I'm at that stage now).

So hearing about Citeulike at this stage is amazingly fortuitous. I haven't done much with it beyond browsing by tag, but it's impressive how quickly you can get to papers on topics you're interested in, especially if you can start with a given paper or author. I figured that Ian Horrocks would be a good person to start with. I was pleased to see that it easily took me to a number of relevant papers that I hadn't read before (just laziness on my part, as Ian does list his publications), but I also noted that there are numerous papers by Ian that weren't there. I suppose this is consistent with a comment in the interview that Citeulike is a useful tool to supplement a student's efforts, but it can't replace that work.

All of the papers I was interested in were in places like the ACM Portal. I'm told that I can get into it via my university, but that's going to take me a few hours to work out. But I find this interesting, as Ian does Self Archiving of all of his work (like many academics in this area), and yet Citeulike doesn't know about them. It also doesn't have links to the home pages of any of the authors, from which a self-archived, or pre-print copy may be available. In the case of someone like Ian, I could find his homepage and publication list with a simple Google search, so I was surprised there wasn't similar functionality on Citeulike. Maybe I missed something and it was there, but it wasn't immediately apparent.

Towards the end of the Talis interview there was a fascinating discussion of a possible future of peer-reviewing publications on Citeulike. Richard mentioned that this may be difficult to accomplish, as many academics play their cards close to their chest. However, he went on to describe a system practically identical to what I wrote last night. Only, in his case he was talking about rating pre-prints of full academic papers (which I didn't think would work, and even he is skeptical about), whereas I was talking about micro-papers.

Maybe there's something to the idea if two people are discussing similar things in public (not that this blog is very public. It's more of a "Dear Diary..." that a few other people read). My concept of a micro-paper might not work for anyone beyond me, but the other elements are definitely common.

Tuesday, June 12, 2007

Micro Papers

I've been catching up with the Talking with Talis series, and the other day I listened to the interview with Peter Murray-Rust. I found it interesting, as he talked about a lot of things that I'm already familiar with, but his perspective is completely different, as he focuses on academic papers, and started in chemistry rather than computing.

I was intrigued and pleased to hear about the greater body of research which is now being published as Open Access (personally, I rely heavily on authors doing Self Archiving). Peter also described how being in a peer reviewed journal is vital to one's academic career. More precisely, being cited in a peer review journal is important. The peer review process is significant because of the need for veracity, while citation means that your work was actually useful.

While trying to think of what makes Open Access expensive (for the publisher, who then passes this on to the author), it occurred to me that part of it is the review process. While not as reliable, it would be kind of cool to let the public review papers, and have the papers modded up and down accordingly. No one would submit themselves to such vagaries after putting in all the work that an academic paper demands, but it did get me thinking about something slightly different.

The other day I saw a comment in a blog (I thought it was Ian Davis, but I can't find it now) that hit the nail on the head for me. The writer made the comment that they had written fewer academic papers now that they blog more. This makes sense. The commitment to a blog is less, and the payoff much more immediate, but blogging cuts heavily into the time needed for formal writing. So this means that there may be a lot of people writing down a lot of stuff that could become the subject of an academic paper, if only they could get the time to sort it all out and structure it properly. Consequently a lot of very useful information may be sitting dormant in the blogosphere.

Putting it all together, I started speculating about a site for micro-papers. Kind of a cross between a site for academic papers and blogging. It would provide a forum for posting short reports, ideas, equations, observations, or anything that would be good to put into an academic paper - if only there was time for it. Each post should be tagged with the field(s) of study (I like tagging - it neatly sidesteps all those issues people had trying to figure out where something belongs in a taxonomy). Most importantly, people could rate the posts.

Ratings would be positive or negative, with negative being the one most likely applied to my own posts. Negative ratings would require a critical note (pointing out incorrect equations, unrecognized assumptions, suggestions for missed citations, etc), while a note on positive ratings could be optional. Of course, ratings would be rated too. :-)

The idea is that eventually the best micro-papers would get good rankings, and the best raters would gain in standing within the community (using some sort of scoring mechanism). I'm thinking that the role of "rating" should only be enabled for those fields the user has selected as specialties. New specialties would be allowed, but a lower standing as a rater would apply, until the user built up a reputation in that discipline as well.

The real benefit is to get good ideas out there and accessed by the community. Constructive rating by interested parties could even lead to collaborations on real-life papers.

The thing about an idea like this is that it needs people to use it. It's something that I would use, but that tells me nothing about the other 6.6 billion people out there. OK, a much smaller portion would want to write anything academic, but a small fraction of 6.6 billion is still a big number. Would any of them like to see a system like this?

Thursday, June 07, 2007

Nova

I only heard about Radar Networks for the first time last year when conversing with Peter Royal. Being in "stealth mode" (I dislike this form of corporate speak, but it avoids pleonasms) there wasn't a lot to learn about them, except that they're working with various technologies tied up with the Semantic Web.

Nova Spivack is a name I only started to hear in more recent times, though I'm sure if I'd been in the Valley (or even the USA), then I think I'd have heard of him by the mid 90's. It was only during SemTech that I got to hear him speak for the first time, but since then I've started going back through his blog, and just now heard his interview with Paul Miller at Talis. If you want to know more about Nova, then I recommend the interview, just for the background list of projects he's been involved in during his career.

The fact that Nova had been at Thinking Machines caught my attention, for no other reason than the presence of Richard Feynmann. I have to digress here, because Feynmann is one of my greatest luminaries. The entire reason I studied physics (and would like to get back to it one day) can be traced back to Einstein and Feynmann (with a little nudge from Shor). Danny Hillis's essay on Feynmann has stayed with me for years. I had already read a lot about Feynmann, but this essay gave him a characterization you don't usually find elsewhere. It shows him as a person who had his flaws, but someone who thought no more of himself, nor less of others. The two things that really stuck with me from Hillis's essay were Feynmann's solution of a digital problem using PDE's (no wonder the man was my hero!), and a comment made about him by a recipient of his occasionally sexist behavior, when she said, "On the other hand, he is the only one who ever explained quantum mechanics to me as if I could understand it."

Thinking Machines did some really cool things, and even without Feynmann (Sr) the company is worth mentioning purely on its own merit. So Nova's presence there impressed me.

But the real reason that I'm following what Nova says at the moment is because he makes so much sense. Which is to say, I agree with much of what he has to say. Then there are times, such as during the Talis interview when I wanted to yell at my iPod. However, I was on a bus at the time, so I refrained, for fear of appearing to be a nutter.

All the same, anyone eliciting this sort of reaction from me is worthy of a comment.

Mulgara Optimizations

The reason I wanted to yell through my iPod came in the final 7 minutes of Nova's Talis interview.

Nova described Kowari as the "low end" version of TKS (Tucana Knowledge Server), which wasn't actually true. It was TKS, only it missed a couple of elements, such as the JAAS classes (security) and the distributed querying module. Nova explained that Radar Networks requires massive "federation" capabilities, which was not in Kowari, but then it wasn't really in TKS either.

"Federation" in TKS was really just distributed querying. This meant you could perform joins and unions between data from different servers. It's a useful capability, and no one else was supporting that sort of thing at the time, so this was a reasonably significant feature. After all, a single query could pull in data from an arbitrary number of servers, and form connections efficiently across that data. But I wouldn't describe TKS as having "massive" federation capabilities.

The main problem with this arrangement was that all remote data would come back into the first server described in the query and get joined there. There was no attempt to optimize network traffic. I've already explained this issue when I wrote about the new Mulgara distributed resolver a short while ago. Then, as now, I knew how to solve the problem in the main, along with some heuristics for iterative optimizations once the main work has been done. Then, as now, I just didn't have the time to implement these optimizations.

All the same, if TKS was acceptable because of its distributed queries, then Mulgara should be too, since I've implemented it again. Only this time I've done a better job of it, and I've included more features!

Nova also mentioned that he's aware of Mulgara, and the fact that we're looking at significantly greater scalability through the implementation of the new XA2 storage layer. He sounded interested in that.... only there's so much more that he doesn't know about, which I wish I could have told him. I've been talking with David, Andrae and Amit about a new set of ideas which will take the scalability several steps further along. Unfortunately, I just haven't had the time to document it for Mulgara's wiki or in this blog. Incidentally, all the current docs are in Google Docs (which I write about below).

Essentially, there are 4 types of improvements:

Optimizations on the existing architecture. Small, quick hits with incremental improvements, but we lose the benefits when we go to XA2.
Rearrangement of the indexing to take advantage of parallelism in hardware. It was never done in the first place because we had less capable hardware in 2001. These changes are small, quick hits, that apply to XA2 as well.
Restructuring of our systems to move into clustering. There are a remarkable number of opportunities for us here to take advantage of clustering. In particular, the phase-tree architecture lends itself to some interesting possibilities. A lot of this can also be applied to XA2.
A whole new design, architected on a colored phase-tree graph and built on top of a "Google File System" style of storage layer.

The first and second items I hope to do in the coming weeks. The third is a big deal, and would need help, since there's just no way I can do this after the kids are in bed, and before I need sleep myself (my only time to work at the moment).

The final one was an "Oh sh**" moment which has had me thinking about it constantly since I first worked it out. It's based on the fact that the phase-tree architecture gives us a mapping of 64-bit tree-node identifiers to immutable blocks of RDF statements, and the GFS lets you create mappings like this that scales out as you add commodity hardware. Believe me, we can make this sucker scale. :-)

The challenge isn't technical, but working out the resources to make this happen. 10 years ago you'd do it by founding a .com startup, but you'd have been dragged through the coals, and not actually been given the chance to release anything. Today's OSS community is a much better option, but building the kind of community that can accomplish this will be a herculean effort. The only way I can see forward is to lift the scalability of Mulgara by a couples of order of magnitude, so we can develop the profile of the project. Then we might have a chance to make this happen.

I mean, who wouldn't want to see a Google-scale RDF store? :-)

Sigh. I just realized that I've now committed myself into writing a lot of detail about how this design works. There go my evenings.

Web 3.0

During his Talis interview, Nova suggested that perhaps we shouldn't even be using the name "Semantic Web". He likes the name "Web 3.0" as it implies that it's the same technology we always had, but we've just been building it up iteratively. Indeed, he even suggests that the 1.0, 2.0 and 3.0 monikers just be used to refer to the decades which have roughly distinguished the capabilities inherent in the web (1.0 for the 90s, 2.0 for the 2000s, and 3.0 for the coming decade).

I was impressed. Someone actually gets it. With all the hype that has surrounded Web 2.0 (AJAX, social networking, linking), the Semantic Web, RDF/OWL, RSS, etc, etc, you'd think that we'd been inventing entirely revolutionary concepts and ideas every few weeks. It's not that these things aren't great (they are), or that they don't enable some new functionality that we could never come up with before (they do). Instead, it's all been a gradual development of abstractions, each of which have allowed us to see just over the next ridge on our climb up the mountain. There have been some really great ideas along the way, but in essence, it's all come down to good ol' engineering, or as Andrae likes to say, "It's just a simple matter of coding."

I still have issues with the names Web 2.0 and Web 3.0. I see Nova's point that version numbers just indicate a progression, but my expectation is that a new major revision number (2.0 -> 3.0, as opposed to a minor revision of 2.0 -> 2.1) implies a revolution in approach and functionality. This is the opposite of what Nova said he's trying to imply. I suppose I also have a problem with the marketing-speak approach taken in adopting the ".0" style of naming.

Marketing-speak makes me cringe. The people using the words often don't know how much their phrase implies, or they mean more than the phrase really carries. Consequently, the meaning of these phrases can be poorly understood, and may evolve over time. Meanwhile the people using them are often using these phrases to sound knowledgeable in a field that they have an incomplete understanding of, or sometimes using them to make other people appear unknowledgeable.

Still, jargon can be really useful, and as I said at the start, it does avoid pleonasms.

Web OS

The concept of a web OS has left me cold for some time. The suggestion has been that the OS is obsolete, and that the entire desktop can be the browser.

I've always felt that this had limited application. It would completely leave multimedia and gaming enthusiasts in the cold. People like many of the graphics features that have been creeping into the OS in recent years (like texturing, alpha blending and 3D effects), and there is nothing in the pipeline yet to get the browser up to these standards. (note: is it time for the next version of VRML yet?) Graphic artists, and other high-end users wouldn't have much support either.

On the other hand, most desktop usage doesn't need this kind of power. Back in the early 90's I was working in computer retail part time while studying engineering. Of course, I was always into the latest and greatest hardware (and still am), but I started to realize that many people were spending far too much money on computer hardware that was well beyond their needs. I would speak with people who had never owned a computer before, and wanted to get one for their small business. They typically had very modest needs that could be filled with a simple record system, a word processor, and a spreadsheet. Word Perfect and Lotus 1-2-3 (both of which ran on an XT!) would have been fine, and yet they were getting 486s with huge hard drives and maximum memory, and running Word 2.0 and Excel on Windows 3.1 (a less stable solution with fewer features, but it did look nice).

We now need absurd levels of processing power just to install a basic operating system and office suite, but 90% of users' needs can be met with what was available in DOS over 15 years ago (dare I say 20?). Seen from that perspective, maybe a web browser can fill the needs of most users.

I recently had to write a document (proposing significant scalability improvements for Mulgara - I hope to get the details up here soon) while on my notebook computer. However, I knew I'd shortly be on my desktop computer at home (a much nicer iMac), and I didn't want to go through the inconvenience of moving documents back and forth. The longer the document lived, the greater the chance that I forget to move it to where I'd need it later before shutting down the machine on which I'd composed the latest version.

One solution for me was to use Apple's iDisk. This came as part of the package when I bought some .Mac web space for Anne, and it works pretty well. It's just a Webdav filesystem with automatic local replication, so it's fast and easy to use. Unfortunately for this purpose, the replication is infrequent, so the notebook probably hasn't uploaded to the server whenever I send it to sleep. Saving directly to Webdav is slow and can lead to unnecessary pauses. The last option was to move the data up to the .Mac space manually, but then I'm back to where I started.

All of this is the perfect argument for the web-based desktop. So I decided to write the document in Google Documents. I'd already used Google's spreadsheets for similar reasons, though not in any serious way. I know there are competitors, and have heard that some of them are pretty good. On the other hand, I knew where to find Google, and I already have an account there, so it made sense to use theirs (I'm sure statements like that send a shudder through ever competitor Google ever had).

The interface was OK, but navigating with the keyboard felt a little clunky. My main problem was with missing features, such as limited styles (something the web taught me to use) automatic numbering of headings. It's on the right track, but it has a long way to go. This was basically the same experience I had with the spreadsheet application a while ago.

The big advantage was that I didn't have to the save the file, and it was there whenever I got onto a machine to look at it. After the reality of saving/copying files and not having access when you wanted it, then this kind of 100% availability was cool. We all know that this is how it's advertised to work, but it's so much nicer when you actually use it. It's enough to make me overlook the UI issues in many cases. It also makes me pleased that Google have such a good track record of keeping their services up, but a little trepidatious all the same. Google Gears suddenly looks really nice, so I'm looking forward to when it enters the user experience.

So for the first time I was feeling receptive to the idea of a Web OS when Nova talked about it recently. I've only mentioned the 24/7 availability of my data from any device connected to the internet, but I am also aware of how far this could spread when data gets linked and integrated across the web, in the ways that Nova describes.

Some of the ideas are compelling, but it still felt restricted to applications like an office suite, rather than the more general computing paradigm implied by integration into the OS. Then, for the first time I saw someone address this issue when Nova said, "When native computation is needed it will take place via embedding and running scripts in the local browser to leverage local resources, rather than installing and running software locally on a permanent basis. Most applications will actually be hybrids, combining local and remote services in a seamless interface."

This statement nicely encapsulates all the issues I had with total integration, and proposes a reasonable way of dealing with it. While browsers don't have the capability for a lot of this yet, people are trying to build it. Firefox is continually expanding in functionality, and AJAX and Google Gears are coming some way to building the required infrastructure. I suppose I can now accept that the integration of these components, along with many others will get us to the WebOS that Nova was talking about. I look forward to seeing it all come together.

Strengths and Weaknesses

Recent runs of "5 things you probably don't know about me" got me thinking about what I know about me. As it often goes with these things, I ran off on a tangent. This led me back to a question I was asked in a job interview a little over 10 years ago:
What is your greatest strength, and what is you greatest weakness?

This wasn't the last time I was asked this question in an interview, but I always remember that first time.

At the time, I thought that my greatest strength was my ability to learn quickly. Unfortunately, at that age I didn't have a clear idea of my greatest weakness. This was rightly pointed out as being a weakness in itself.

In more recent times, I've learned more about what I'm good at, and what I'm not. But it's only recently that I worked out that I have one set of overriding characteristics, and this is the source of pretty much all my strengths, and my weaknesses. It's a shame that I can't see myself answering this kind of question in an interview ever again.

I believe I'm very good at solving problems I can concentrate on, particularly when they have technical solutions. When given a chance, I can see the details very well. This is the engineer in me.
At the other end of the scale, I'm pretty good at seeing the bigger picture, and how lots of seemingly unrelated things can fit into it. Given the time to look at something properly, I have pretty good vision.
Conversely, I am terrible at having to keep both perspectives in mind at the same time.

Convergence

Rather than listing things out, what I described above comes down to one real attribute. I have an addictive personality, and I use it pretty well. Whenever I set my mind to something, I enjoy sticking at it until I do well at it (that goes for exercise as well - ask David about me and weights training). It's this that works for me, and against me.

This is the reason why I claimed I could "learn quickly" when I was younger. It's also why I can solve tough problems, or see how to put things together in new ways. I just need that initial interest to get me into it, and I'll stick at it until it's done.

However, if I'm not given the chance to really concentrate on something, then what was my strength works against me. Rather than concentrating on any one thing, I spread my efforts and my productivity goes down. So not being allowed concentrate on just a few things is where I'm weakest.

Day to Day

So why am I thinking about this? Well I'm in a startup, with few resources. That means that I'm working on several products, plus I'm having to help determine their direction and where they fit in. This plays to my weaknesses, and it's driving me nuts. There are so many things that need my complete concentration, and I just need some time to focus on a few of them. In particular I want to apply significant effort to several areas in Mulgara.

At least it's helped me work out what I'm good at, and why.

Google Searches

A lot of the hits on this blog come from Google. I get to see the last 20, though I only know the time and query and nothing else (check it out here). This evening I just thought I'd check it out, and I saw the following search string that made me laugh:

calculate the value of π using the Monte Carlo method using a quarter-circle in programming C

I'm guessing that the person searching must have already read the post once before, and wanted to come back. That's the first time I've seen someone type the letter π into Google though!

Another SemTech Day

Writing about SemTech is taking so long because I've had to do a lot of offline writing as well (not to mention my day job). Quite a bit of my writing was Mulgara related, so it'll end up here shortly - promise.

I'm still only halfway through recalling the conference. I suppose that's because a lot happened. More importantly, I met a lot of people, all of whom I wish I could have given more time.

For instance, Henry Story and I had a series of short discussions, involving RDF, Sun, and Mulgara. He was interested in learning more about Mulgara, but the lack of SPARQL was a major impediment for him. I know how important that feature is, so it's something I hope we have soon. Indy is working on that in his own time, so we may have something soon. However, Indy is on vacation at the moment, so I won't hear any more about it until he gets back. Henry also introduced me to one of the managers at Sun, pointing out that Mulgara was developed way back at the start of 2001, and was using NIO from the outset. He didn't seem to know much about RDF, but was interested in our early use of NIO. We had a brief discussion about 64 bit support (which we manage with NIO, but not without jumping through hoops). I was pleased to hear that Sun are planning on new APIs to address the 64 bit problem.

I kept trying to find time (over lunch for instance) when I could sit down and have a long and uninterrupted conversation with Henry, but somehow it never eventuated. Maybe next time. In the meantime, I now have a face to put with his emails. Somehow, having spoken to a person about their work gives their emails much more meaning.

Another person I wanted to speak with was Eyal Oren, due to his work on ActiveRDF. One of the places that Mulgara needs to be used is as the backend for web applications, and in this day and age that means Ruby on Rails. SOAP support (which we have) is only a stopgap measure, and doesn't cut it. We need a real API for non-Java languages. In fact, given the importance of RoR, I fully support an API specifically for Ruby, and ActiveRDF looks like the best option, particularly since it's starting to get wider use.

I saw Eyal from a distance on the first day, but I wasn't sure that the guy I was looking at was the same one I'd seen in Eyal's photo. Even when I realized it was definitely him, I always seemed too busy to give him the time and attention I wanted, so I held back. Finally, on the last day I introduced myself, and we had a productive conversation. The guys in ActiveRDF would like to get another store behind their API, so our goals are aligned there. Also, Eyal tells me that they only use simple SPARQL queries, so these might already parse fine as TQL (since Andrae added features to the parser to make it mostly compatible). We have someone new at fourthcodex who needs to learn about programming with and in Mulgara, so this may be one of the earlier tasks we set him (not the very first task - I'd like him to get some experience with simpler things first).

I have to get in touch with Eyal soon to make this happen.

Wednesday Morning

So after being given 2 beers and 1 (large) glass of red wine on Tuesday night (all over a period of 6 hours), I thought I'd be fine by morning. After all, I'd had much more on Monday night, with no ill effects (surprisingly). Yet, I woke up on Wednesday feeling a little "seedy". The beer was the same I'd had on Monday, so it must have been the red wine. I'll have to be more careful when someone offers me a glass like that (even if it was my boss).

The consequence was that I didn't feel like concentrating during the first session. I'd picked something on NLP. I've done a little work here, but it's not that relevant to me at the moment, so I decided to catch up on emails and RSS instead. I guess that turned out well, as this was when I learned about Danny's new job at Talis, before meeting the Talis guys just a little later that morning.

I ran into Henry again just before the next session, and we went into the presentation given by Susie Stephens, now from Eli Lilly (despite the hundred bios on the web which still say she's from Oracle) and Jeff Pollock from Oracle, who were discussing practical (and commercial) "Enterprise" semantic web systems. Features of various systems were discussed, such as Dartgrid mapping RDBMSs to one another (this had a nice accessible look to it), TopQuadrant's TopBraid IDE for integrating RDBMSs, and also a recap on the WebMethods acquisition of Cerebra to gain their RDF product (WebMethods had been looking at TKS before Tucana went the way it did). I think I was the only person in the room who hadn't realized that Jeff used to work for Cerebra.

Jeff also pointed out a number of commercial RDF systems, including the TKS system from Northrop Grumman, which left me bemused. We already have some improved features over TKS, but I'm looking forward to leaving it in our dust. Should be Real Soon Now. ;-)

Next came lunch, and another exhibit session. Fortunately, this was less crazy than the night before, so there was opportunity to take a break in order to eat. Unfortunately it conflicted with a couple of commercial sessions, and dragged into a standard conference session that I wanted to get to. These included a presentation from AllegroGraph that would have been interesting to see, and talks by both Eyal and Elisa Kendall from Sandpiper Software. Fortunately, I got to hear Elisa speak later, even if it was just a SIG.

Reverse NLP

The next session I made was called Applications of Analogical Reasoning and was by John Sowa from VivoMind Intelligence. I'd heard of John before, but wasn't sure what to expect in this session. Regardless, this was the only session of the whole conference that made me take a step back and say, "Wow".

John started out with the quote,
A technique is a trick you use twice.
The idea is that analogies allow us to use tricks (or techniques) in multiple places that may not appear to be equivalent at first glance.

Analogical reasoning is the process of finding correlations for objects and relationships in separate ontologies. He provided an example of an analogy discovery system called VAE, which was applied to WordNet to compare the concept types of cat and car. The automatically generated result was:

Cat	Car
head	hood
eye	headlight
cornea	glass plate
mouth	fuel cap
stomach	fuel tank
bowel	combustion chamber
anus	exhaust pipe
skeleton	chassis
heart	engine
paw	wheel
fur	paint

The correlation of paths of relationships, and any similarities along the way can help these algorithms that search for analogies. So "paws" matched to "wheels" because they had a similar function, and there were 4 of each of them. The algorithms specifically cater for extra, or missing, elements along path, with these elements being recorded as a "deviation". So the paths of:

Cat: mouth, esophagus, stomach, bowel, anus
Car: fuel cap, fuel tank, combustion chamber, muffler, exhaust pipe

matched, but received reduced confidence due to the esophagus not matching anything in the car, and the muffler not matching anything in the cat.

This is all given in much more detail in a paper written by John on his site.

Anyway, it all sounded good, but what can it be used for?

I didn't record all the details (so apologies if I get some of the specifics wrong), but the example given was a task of documenting a banking system in preparation of merging with another system. There were some 30 years of development with source code in several languages, including Cobol and Java. Every stage of development had used extensive documentation, the process had continued over 30 years, building up to a corpus of 100MB of documentation. Consequently there was a need for new documentation in the form of:

A glossary.
Data flow diagrams.
Process architecture.
System context diagrams.
Data dictionary.

All of this was supposed to be done in about 6 weeks (a seemingly impossible task).

It is well understood that NLP isn't good enough today to understand the documentation to the extent required here. However, one of the developers working on this decided to approach the problem laterally. Instead of trying to parse the unstructured documentation, instead he adjusted the NLP system to parse the source code. Of course, the source code must be parseable else it would be much good as source code, would it? The results were then put into a concept graph, which was combined with ontologies to provide the "analogy" for the 100MB of documentation, which could then be parsed with relative ease.

I love it when someone shifts my brain sideways with an idea like this.

Initially the developers asked the system to flag if the deviation between the concept graphs of the code and the documentation was greater than a couple of percent. These were then perused by hand. It didn't take long to decide that the system was proceeding just fine, and the deviation cut-off was increased to something significantly higher. However, a couple of items were still flagged, indicating possible contradictions that had to be manually reviewed to figure out why.

One such item started with the following pair of facts:

No humans are computers
All employees are human

The problem was flagged when the system tried to describe 2 employees as computers. After carefully going back through the system it was discovered that 20 years before an employee in one department received assistance from 2 computers. However, at the time there was no way to bill for computer time. The workaround was to name the computers "Bob" and "Sally" and enter them into the system as employees. The unintended consequence of this was that Bob and Sally had been issued paychecks for 20 years that they had not been cashing!

So not only did this approach successfully document the system, it also discovered contradictions between the document and the implemented system. This was my "Wow" moment.

Evening

The keynote didn't really capture me that afternoon, so it was tempting when the others from fourthcodex trid to invite me to lounge on the pool deck. But I stuck with it, and made it into the talk on Advanced OWL by Deborah McGuinness from Stanford and Elisa Kendall (whom I mentioned earlier).

When I first considered postgraduate research at UQ, I tried to get some idea of the research background of those who might be interested in supervising me. Bob was at the top of the list, and in looking him up I found that he had done a lot of work with IBM Almaden, DSTC, and Sandpiper Software. The last was specifically with Elisa Kendall. I was fortunate enough to be introduced to her on Tuesday evening, so I was pleased to get a chance to see one of her talks.

In this case, the talk was really a Special Interest Group meeting (SIG) where various details of OWL were discussed, rather than the formal "Sessions". I would have gained more if there was some discussion about OWL 1.1, which I'm only just starting to look at seriously. However, I was already pretty comfortable with the various concepts in OWL that were being discussed.

The one point I picked up on was to be very careful about using owl:InverseFunctionalProperty (IFP) as it can lead to OWL Full. This is worth noting, because many modeling/schema environments want to use IFP to simulate the RDBMS concept of the "primary key" of an object. However, IFP cannot be used on datatype properties in OWL-DL, which is what you usually want IFP for if you want to use it to indicate a primary key. Not using IFP on datatype properties is well documented for OWL-DL, but the whole "primary key" thing is likely to lead to forgetting this rule in heat of moment when just trying to make it work.

I spent the rest of the time working out that I can build a rule for owl:IntersectionOf using the TQL minus operator. Unfortunately, the resulting query is nested into several parts (of the form: A and B minus (C and (D minus E))), making me think that I should really write a resolver to do it easily. Algorithmically it's easy: once you have an intersection, you just have to check that an object has all of those types in order to meet it. The hard part of a new resolver is usually the syntax for providing everything that it needs. In this case, how do I tell the resolver which graphs to query?

Zepheira

That night was a little soirée put on by Zepheira in a hotel suite backing onto the pool deck. I was invited by Bernadette, but told that it would be focused on potential clients, and I may not want to stay long. Besides, it would only be running for 90 minutes.

Ha!

The party had included people of all ilks engaged in everything from flirting to deep technical discussions on semantic technologies. The latter by the younger, single set, and the former by at least half the room. A couple of times I noticed both going on in the same conversation. The 90 minute limit came and went, and the party went on until sometime around midnight. I got to meet Dave Beckett (mentioned several blog entries ago), several W3C notables, and a number of people from all over, each with their own contribution to make to the semantic web. It was a lot of fun.

Due to the exhibit that day, I was still wearing my fourthcodex shirt. After the exhibition sessions I knew that fourthcodex had products providing something unique and useful, and there had been some buzz around what we were offering. But I was surprised when seemed to be accorded minor celebrity status in some areas of the room. I was even cornered by one potential client who wanted to grill me over the features of our products. Luigi would have been thrilled.

Oracle

I also got to speak with Susie Stephens about the Oracle RDF implementation.

Susie's a lovely girl, and I enjoyed the conversation. She is also on a horrible conference circuit at the moment, spending the previous 4 weeks going from conference to conference, and about to head the Austria for another one the following week. She'd been getting a couple of days at home to water the plants every few weeks, and then she was off again. It seems a little rough of Eli Lilly to put Susie through this immediately after joining them, but she seemed to be coping. She had my sympathy nonetheless.

I have lamented before that Oracle had the opportunity to do this right, but instead their implementation has become an advertisement for using Mulgara. The reason is because the implementation is done entirely at the application layer. TKS/Kowari/Mulgara was developed specifically because we discovered that this approach didn't work, and we reasoned that a storage layer specifically designed to handle triples rather than a table of arbitrary arity could be much more efficient. I even recall discussing this with DavidM in the Toowong food court, back in 2000. <Insert nostalgia here>

Now Oracle have pretty much captured the scalable data storage market, so they should have the know how and resources to build the appropriate structures that would make an RDF system fly. But instead they build it at the application layer. My question was, "Why?"

Susie explained that it was her who built the RDF layer. She's a biologist, and she built it as a tool for another project she was working on at the time. The RDF layer is actually built on Oracle's Network Database layer (which is also built at the application layer). So while it offers nice abstraction, this is where the scalability fails.

Susie didn't initially anticipate that her code would go into wide circulation. However, it was apparently in the right place, at the right time, and even showcased the Network Database layer. So the powers-that-be picked it up and ran with it. The rest is history.

Wednesday, June 06, 2007

Whuffie

Yet again I've heard about this issue of trust on the Semantic Web. It's starting to become more important, and I guess I should have predicted the timing.

TBL's slowly-evolving image of the Semantic Web stack has always included Trust up near the top. I remember when we first started doing this, we were seeing all these higher layers which were entirely theoretical. But in the last couple of years, some real implementations have been coming out at all these levels. We may have some way to go before it's all stable and finalized, but now there are several usable implementations offering real functionality at almost every level. So it's to be expected that Trust needs to be addressed now.

Like the other layers before it, the issue of trust is currently being addressed by several vendors and researchers in a tentative way (a search will show lots of proposals and a few implementations). I'm sure the implementors will say that their solutions are more than tentative, and I don't want to disparage anyone's hard work, but I haven't seen anything that's been popularly received yet. However, in typical Web 2.0 fashion it seems that some notions of trust are bubbling to the surface all by themselves.

Social networking sites have major issues with trust. Note: I'm going to avoid the issue of physical safety, such as minors talking to pedophiles, or young women talking to predatory men, etc. These are issues of security and trust that really need to be controlled well, and I'm going to leave them to the experts.

Once people start linking up to one another on sites like Facebook, then issues of social trust start to come to the fore. One one hand, it isn't that big a deal, since it's only social information, and hence the risk to finances, property, etc. is not really there, leaving the sites open to experimentation, or often doing nothing at all. On the other hand things of social value, like reputation, social acceptance, and self esteem are all on the line.

Since social networking sites concentrate so heavily on teenagers and young adults, then things of social value are all extremely important issues. In fact, in many cases these social risks are more important to teenagers and young adults than security of finances and property (usually because they HAVE no finances and property of note - well, that's how it was for me and all my friends at that age). So what happens when there's a real need for social trust, and no mechanism built to deal with it? Well, like many features of Web 2.0, something tends to emerge on its own.

Most sites have ways in which popularity can be measured. The first I ever saw this was moderation (and meta-moderation) on Slashdot. Now most sites have a mechanism to adding votes, or comments, or just plain old links, to popular content. The result is that people tend to bubble up to the top if they have something to say that other people care about. Anyone without anything useful to contribute drops to the bottom. Of course, this is fickle, but in the world of social perceptions, popularity is king whether you choose to recognize it or not.

What happens when you first go to sites like YouTube? Unless you're there to see a specific video, you browse through the most popular stuff. (Actually, I prefer Vimeo as it is about raw social interaction, while YouTube has so much commercial content and even the amateur content usually has some "production" effort behind it). But regardless of whether you're browsing Vimeo, Flickr, or Facebook, the pages you look at first are the ones that others already endorsed in some way.

To me this looks a lot like a precursor to Whuffie. This isn't the only element of Doctorow's book which we are starting to see. Sites like Twitter are starting to connect groups of people all the time, Google, and Wikipedia are available on cell phones 24/7, and the phones are capable of managing video, video, and text with increasing ease. Not to mention Josh Spears observation that young people (and others) today feel semi-naked without their cell phone. While Doctorow's book has everyone implanted with this technology, he refers back to a time when everyone used hand-held devices, much like what we're starting to see (note the buzz around the consumer centered iPhone, vs. the ho-hum responses to upcoming business phones). With each year Doctorow is sounding more and more visionary. This is happening despite some people on the fringes actively striving toward a "Bitchun Society" rather than because of it.

I don't spend much time in social networking sites, but the more I do, the more I think that the problem of trust is going to emerge from that space, almost because it has to rather than because someone designs a system of trust.

Thanks to David for finding the latest version of the Semantic Web Layer Cake. There are so many versions now that I kept finding old ones instead.

Talis

I hadn't heard much about Talis before SemTech. I suppose I spend too much time on technology and not enough time on how and where it's applied. That's what people like Andrew and Danny are good for, which is why I stay in touch with their blogs.

So I only really started paying attention to Talis when David did his interview on their podcast. This podcast has a fascinating series of interviews with people whose opinions I'm interested in hearing. The only problem I have with it is the intro music. It reminds me of the 70's relaxation music they play while warming up the star projector in the Brisbane Planetarium, so I end up imagining that I'm listening to the interviews in a cavernous auditorium with the lights turned off. But don't let that put you off - the content is good.

So now I had an awareness of Talis, but never had any interaction with them. Then on Tuesday morning I decided to catch up on my RSS feeds, and read that Danny had started employment with them. I didn't even know he was looking for that kind of work. Congrats Danny.

About an hour after reading Danny's news, David introduces me to Ian and Sam from Talis. This is the same Ian mentioned in Danny's blog post about his new job. I know we're all in the same industry, but reading about someone and meeting them an hour later is just scary synchronicity.

Next I hear that Talis may be interested in Mulgara features. These guys are everywhere. But that reminds me: I'd better do some coding.

Tuesday, June 05, 2007

New Media

I like TED. I'd never heard of it before David pointed to Jeff Han's talk last year. Now Andrew is pointing to a really impressive presentation by Blaise Aguera y Arcas about the interactive multi-scale system called SeaDragon, and it's stunning integration into PhotoSynth. I believe that it's investments like this that will be the key to Microsoft's long term success, and not operating systems or internet search.

I showed this clip to Anne, and the next thing I hear is her listening to Josh Spear's presentation where he discusses New Media. A lot of the things I heard him talk about are vaguely familiar to me (emphasis on the word "vaguely") but Anne had never heard of them. I'd heard of things like micro-blogging and Twitter, but couldn't see myself spending the time on them.

(Actually, the one constant reminder I have of Twitter is someone subscribed to this blog who sets his GTalk status as "Twittering like a tit". This made me laugh.)

Meanwhile, Anne is learning about all these social networking sites and ideas which she had never heard of before. Some of them are just silly, but in others the potential is staggering - particularly with the teenage set. This isn't something to disregard if you have children of your own who will be an active part of this technology in years to come. She seems both fascinated and horrified. I guess it's interesting for me to observe this response, as I've been aware of it all as it's grown, but have taken the reliable stance of, "I'm going to bury my head in the sand because I'm way too busy to keep up with it all."

Her search tonight has had her sending me links to everything from Startup Search (looking at Krugle), through Pileus, and on to Vimeo, which we both agreed was our favorite. I'm humming the tune as I type. (Notice how the favorite site was one of social networking?)

The next thing I know, Anne is asking if I've heard of 37 Signals. So I had to explain what RoR is, and how it's a great framework for setting up interactive websites. This had her on Scriptaculous 5 minutes later, looking at the source code for the examples, and saying, "That all looks perfectly readable. I thought it would be hard."

(My home life may be about to get more complex.)

I'm always trying to "get things done", so I've avoided undirected surfing for a long while now. For a while now Anne has reminded me:
a) How useful it can be.
b) How much fun it is.
Maybe I should put time aside for it on occasion.

Feedburner

Wow! A few of you actually updated your RSS feeds! Thank you! :-)

So now there are 6 subscribers using feed readers:

I hadn't heard of a couple of these. Maybe I should try them?

Monday, June 04, 2007

Stats

For those of you who never noticed (and weren't here when I first mentioned it) there's a little icon at the bottom of the right hand column that does some stats on the views of this blog. There's nothing invasive (at least, I don't think so), but it's kind of interesting (and disturbing) to find out if anyone has paid attention to my little ramblings. You can't miss the icon. It looks just like this:

Go ahead, take a look. It's all public info. But I guarantee it's more interesting to me than it is to you. :-)

I find the most interesting part to be the Search Engine queries in the Referrer stats. People end on this blog with all sorts of queries. Frustratingly, I often see people asking questions (like "java file unmap") that I know the answer to, but never wrote in here. That's a shame, as I'd be happy to help, but I guess they don't find what they're looking for and just move on.

However, anyone who actually cares about this blog isn't going to show up in any way, because I hadn't been doing anything with RSS. So last week Chris suggested trying out Feedburner. I'd heard of this, but never looked into it before. So I put a new Feedburner link on my page, and the very next day Google announced that they've bought Feedburner. What weird timing.

Anyway, no one is under any obligation to help me at all. All the same I'd appreciate it if you could replace any existing RSS feed with the new one from the top of this page.

Now I just have to work out if there's some way to make the Feedburner stats public too...

Sunday, June 03, 2007

Conference Discussions

A couple of ideas came out of SemTech which I wanted to write down.

Mulgara Reasoning

The first is the reasoning engine in Mulgara. I've been thinking for a little while that we wanted more than one reasoner, but I've finally worked out what we want and why:

The Krule engine, for running OWL inferencing rules scalably over large TBox and ABox sets after loading. OK, I have a personal attachment to this, but it works very well, for several reasons - all of which will be mentioned in my upcoming thesis. :-)
A Rete engine, for change management. This is required for data that is coming into an existing system. Running the full set of rule calculations in this case is prohibitively expensive, so a system that handles changes efficiently will be important. This will be based on Krule (since Krule uses Rete principles, but leverages the indices instead of using memory). I'm also hoping to use some ideas from Christian's work, if I can. Adding data isn't really an issue, but deleting it efficiently is trickier.
A tableaux reasoner. While most queries are managed quite well by inferred data, there are some questions that cannot be solved without infinite inferences. This happens when there are two or more alternate paths that could satisfy a query, and the ontology describes that one of the path will be taken, but does not provide the details of which one. The classic one described in chapter 2 of the DL handbook as the Oedipus example. For this reasoner I want to use Pellet.

I noticed that all the professional systems that are using tableaux reasoners are using Racer. This isn't an option for Mulgara, as Racer is a commercial product. However, Pellet has some unique features which make it quite worthwhile. It stays up to date with the standards very well, and the OWL debugger has some really valuable features going for it. Finally, you never know, but I might be able to find some way to hook it into the indexes to let it do some of the tableaux reasoning on disk. I don't know about that one, but you never know.

Mulgara Security

Mulgara has never had security, but the hooks have always existed. In the commercial product, authentication and authorization was just handled with JAAS. Once these were established, then an RDF description of the accessible models for the current user could be intersected with the models requested.

Since JAAS is easy to implement, I've been wondering about putting this in again. There may be a bit of missing code (or code that suffered bit-rot) but the idea of performing the intersections is an easy one, so it wouldn't be hard. However, David told me that he'd heard from a few people that the JAAS approach is the wrong one to take.

It seems that security conscious people don't like security being managed in the database. Fair enough. So where do we put it?

David had the idea that maybe we put permissions into the database, as always, but then access and authentication gets done outside of the store, in a gateway interface. Any incoming queries to this interface can be modified to intersect on the models accessible by the currently authenticated party. Any RDF databases that perform efficient model intersections (ie. Mulgara) would work well, while everyone else would merely work correctly.

David thinks this sounds like a good open source project. I like the idea so much that I just had to write it down. It would be very cool to provide something like this for every SPARQL compliant database.

Working notes

Saturday, June 23, 2007

Talis

Dopplr

Friday, June 22, 2007

Camping

Slow Debugging

Mulgara 1.1

Blank Nodes

Trans

Thursday, June 21, 2007

FOAF

Sunday, June 17, 2007

Scaling Details

Saturday, June 16, 2007

LinkedIn

Twitter

Linking Data

Description Logic Handbook

Friday, June 15, 2007

Debugging

N3

Wednesday, June 13, 2007

Citeulike

Tuesday, June 12, 2007

Micro Papers

Thursday, June 07, 2007

Nova

Mulgara Optimizations

Web 3.0

Web OS

Strengths and Weaknesses

Convergence

Day to Day

Google Searches

Another SemTech Day

Wednesday Morning

Reverse NLP

Evening

Zepheira

Oracle

Wednesday, June 06, 2007

Whuffie

Talis

Tuesday, June 05, 2007

New Media

Feedburner

Monday, June 04, 2007

Stats

Sunday, June 03, 2007

Conference Discussions

Mulgara Reasoning

Mulgara Security

Semantic Links

About Me

Blog Archive