Friday, August 03, 2007

Nodalities

Last night Luc was determined to keep me up, and he did a pretty good job of it. This happens frequently enough that it shouldn't be worth a mention in this blog, except that today I had agreed to speak with Paul Miller from Talis, for the Talking with Talis podcast.

So that I'd be compos mentis, I resorted to a little more coffee than usual (I typically have one in the morning, and sometimes have one in the afternoon. Today I had two in the morning). While this had the desired affect of alertness, the ensuing pleonastic babble was a little unfortunate. Consequently, I feel like I've embarrassed myself eight ways to Sunday, though Paul has been kind enough to say that I did just fine.

I was caught a little off guard by questions asking me to describe RDFS and OWL. Rather than giving a brief description, as I ought to have, I digressed much too far into inane examples. I also said a few things which I thought at the time were kind of wrong (by which, I mean that I was close, but did not hit the mark), but with the conversation being recorded it felt too awkward to go back and correct myself, particularly when I'd need a little time to think in order to get it right.

Perhaps more frustratingly, my needless digressions and inaccurate descriptions stole from the time that could have been used to talk about things I believe to me more interesting. In particular, I'm thinking of the Open Source process, and how it relates to a project like Mulgara. David was able to give a lot of the history behind the project, but as an architect and developer, I have a different perspective that I think also has some value. I also think that open source projects are pivotal in the development of "software as a commodity", which is a notion that deserves serious consideration at the moment. I touched on it briefly, but I also ought to have elaborated on how open source commodity software is really needed as the fundamental infrastructure for enabling the semantic web, and hence the need for projects like Mulgara, Sesame and Jena.

But despite my missed opportunity to discuss these things today, I should not consider Talis's podcast to be a forum for expressing my own agenda. If I have a real desire to say these things, then I should be using my own forum, and that is this blog.

As always, time is against me, but I'll mention a few of these things, and perhaps I can have time to revisit the others in the coming weeks.

People

I should also have mentioned some of the other names involved in Mulgara, from both the past and present. Fortunately, David already mentioned some of them (myself included) but since I'm in my own blog I can go into some more detail. Whether paid or not, these people all gave a great deal of commitment into making this a project with a lot to offer the community. However, since there are so many, I'll just stick to those who have some kind of ongoing connection to the project:
  • David Wood, who decided we could write Mulgara, made enormous sacrifices to pay for it out of his own pocket... and THEN made it open source! His ongoing contributions to Mulgara are still valuable.
  • David Makepeace (a mentor early in my career, who I was fortunate to work with again at Tucana) who was the real genius behind the most complex parts of the system.
  • Tate Jones, who kept everyone focused on what we needed to do.
  • Simon Raboczi who drove us to use the standards, and ensured the underlying mathematical model was correct.
  • Andrew Newman who knew everything there was to know in the semantic web community, and aside from writing important code, he was the one who wouldn't stop asking when we could overcome the commercial concerns and make the system Open Source.
  • Andrae Muys, the last person to join the inner cabal, and the guy who restructured it all for greater modularity, and correctness. This contribution alone cannot be overstated, but since Tucana closed shop he has remained the most committed developer on the project.
  • Collectively, the guys at Topaz, who have provided more support than anyone else since Tucana closed.
These were just some of the guys who made the project worthwhile, and Tucana a great place to work.

Sorry to those I didn't mention.

Even if I move past Mulgara and into a new type of RDF store, then the open source nature of Mulgara will allow me to bring a lot of that intelligence and know-how forward with me. For this reason alone, I think that the Open Source process deserves some discussion.

Architecture

Back when Mulgara (or TKS/Kowari) was first developed, it was interesting to see the schemas being proposed. Looking at them, there was a clear influence from the underlying Description Logic that RDF was meant to represent. However, I was not aware of description logics back then, and instead only knew about RDF as a graph. Incidentally, I only considered RDF/XML to be a serialization of these graphs (a perspective that has been useful over the years), so a knowledge of this wasn't relevant to the work I was doing (though I did learn it).

Since I was graph focused, and not logic focused, I didn't perceive predicates as having a distinct difference from subjects or objects (especially since it is possible to make statements where predicates appear as subjects). Also, while "objects" are different from "subjects" by the inclusion of literal values, this seemed to be a minor annotation, rather than a fundamental difference. Consequently, while considering the "triple" of subject, predicate and object, I started wondering at the significance of their ordering. This led me to drawing them in a triangle, much as you can see in the RDF Icon.


This then led naturally to the three way index we used for the first few months of the project, and is still the basis of our thinking today. Of course, in a commercial environment, we were acutely aware of the need for security, and it wasn't long before we introduced a fourth element to the mix. Initially this was supposed to provide individualized security for each statement (a requested feature), but it didn't take long to realize that we wanted to group statements together, and that security should be applied to groups of statements, rather than each individual statement (else security administration would be far too onerous, regardless of who thought this feature would be a good idea). So the fourth element became our "model", though a little after that the name "graph" became more appropriate.

Moving to 4 nodes in a statement led to an interesting discussion, where we tried to determine what the minimum number of indices would be, based on our previous 3-way design. This is what led to the 6 indices that Mulgara uses today. I explored this in much more depth some time later in this blog, with a couple of entries back in 2004. In fact, it is this very structure that allows us to do very fast querying regardless of complexity (and if we don't, then it just needs re-work on the query optimizer, and not our data structures). More importantly, for my recent purposes (and my thesis), this allows for an interesting re-interpretation of the RETE algorithm for fast rule evaluation. This then is our basis for performing OWL inferences using rules.

See? It's all tied together, from the lowest conceptual levels to the highest!

I freely acknowledge that OWL can imply much more than can be determined with rules (actually, that's not strictly true, as an approach using magic sets to temporarily generate possible predicates can also get to the harder answers - but this is not practical). To get to these other answers, the appropriate mechanism is with a Tableaux reasoner (such as Pellet). However, from experience I believe that most of what people need (and want) is covered quite well with a complete set of rule-base inferences. This was reinforced for me when KAON2 came up with exactly the same approach (though I confess to having been influenced by KAON2 before it was released, in that I was already citing papers which formed the basis of that project).

All the same, while I think Rules will work for most situations, having a tableaux reasoner to fall back on will give Mulgara a more complete feature set. Hence, my desire to integrate Pellet (originally from MIND Lab).

I have yet to look at the internals of Pellet, to see how it stores and accesses its data. I'd love to think that I could use an indexing scheme to help it to scale out over large data sets like rules can, but my (limited) knowledge of the tableaux algorithm says that this is not likely.

Open Source

There are several reasons for liking Pellet over the other available reasoners. First, is that it is under a license that is compatible with Mulgara. Second, is that I saw the ontology debugger demonstrated at MIND Lab a couple of years ago, and have been smitten ever since. Third, the work that Christian Halaschek-Wiener presented at SemTech on OWL Syndication, convinced me that Pellet is really doing the right thing for scalability on TBox reasoning.

Finally, Pellet is open source. Yes, that seems to be repeating my first point about licenses, but this time I have a different emphasis. The first point was about legal compatibility of the projects. The point I want to make here is that reasoning like this is something that everyone should be capable of doing, in the same way that storing large amounts of data should be something that everyone can do. Open source projects not only make this possible, but if the software is lacking in some way, then it can be debugged and/or expanded to create something more functional. Then the license point comes back again, allowing third party integration and collaboration. This lets people build something on top of all these open source commodities that is a gestalt of all the components. Open source projects enable this, allowing the community to rapidly create things that are conceptually far beyond the component parts.

From experience, I've seen the same process in the commercial and open source worlds. In the commercial world, the growth is extraordinarily slow. This is because of limited budgets, and limited communication between those who can make these things happen. Ideas are duplicated between companies, and resources are spent trying to make one superior to all the others, sometimes ignoring customers' needs (and often trying to tell the customer what they need).

In the open source world, everyone is free to borrow from everyone else's ideas (within license compatibility - a possible bugbear), to expand on them, or to use them as a part of a greater whole. Budgets are less of an issue, as projects have a variety of resources available to them, such as contributing sponsors, and hobbyists. Projects focus on the features that clients want, because often the client is contributing to the development team.

Consider MS-SQL and Oracle. Both are very powerful databases, which have competed now for many years. In a market dominated by these players, it is inconceivable that a new database could rival them. Yet MySQL has been steadily gaining ground for many years, first as a niche product for specialized use, and then more and more as a fully functional server. It still has a way to go to scale up to high end needs as the commercial systems do, but this is a conceivable target for MySQL. In the meantime, I would guess that there are more MySQL installations in the world than almost any other RDBMS available today. Importantly, it got here in a fraction of the time that it took the commercial players.

Semantic Web software has a long way to go before reaching the maturity of products like those I just mentioned. We still have to take semantic web software a long way forward. But history has shown us that the way forward is to make the infrastructural software as open and collaborative as possible, enabling everyone to develop at a much higher level, without being concerned about the layers below them. Higher level development has happened with many layers of computing in the past (compilers, OO toolkits, spreadsheets, databases, web scripting languages for server-side and client-side scripting), and the cheaper and more open the lower levels were, the more rapid and functional the high level development became.

It is at this top level that we can provide real value for the world at large, and not just the IT community. It is this that should be driving our development. We should not be striving to make computing better. Computers are just tools. We should be striving to make the world better.

Sounds pretty lofty, I know. Blame the caffeine from this morning wearing off and leaving me feeling light headed. But there has to be some point to it all. This all takes too much work if we indulge in navel gazing by only enabling IT. IT has to enable people outside of its own field or else there is no reason for it to exist, and we will all get caught in another .com bubble-burst.

2 comments:

Anonymous said...

sounds like we need an ASLv2-licensed store implementation for your next work :)

we can always propose a tuplestore implemenation as an apache project! freedom the shackles of the original tucana license

Anonymous said...

Paul - don't be so hard on yourself! The call went well, and you'll rarely (if ever!) like hearing yourself. You (or I, or most other people) always listen back, and think that we didn't quite say what we meant to. For the listener, who doesn't know what you set out to say, it's a whole different experience.

And as to not being 'qualified' to comment on OWL, RDFS, etc? Nonsense. You weren't asked as the world's leading expert on OWL, RDFS or anything else. You were asked as someone who cares about and tries to use these things for real, day-to-day. You're therefore eminently qualified to comment in that capacity... which is what you did.

The podcast is going up onto S3 now, and people will be able to listen for themselves via http://blogs.talis.com/nodalities/ later today. Your words, linked to your written thoughts here, certainly provide a useful developer's perspective on the space, and you really shouldn't be so hard on yourself.

As for all the other things you wished you'd talked about? I'd be happy to get you back for a follow-up any time you want... just make sure Luc sleeps through the night before! ;-)