Showing posts with label Talis. Show all posts
Showing posts with label Talis. Show all posts

Friday, August 03, 2007

Nodalities

Last night Luc was determined to keep me up, and he did a pretty good job of it. This happens frequently enough that it shouldn't be worth a mention in this blog, except that today I had agreed to speak with Paul Miller from Talis, for the Talking with Talis podcast.

So that I'd be compos mentis, I resorted to a little more coffee than usual (I typically have one in the morning, and sometimes have one in the afternoon. Today I had two in the morning). While this had the desired affect of alertness, the ensuing pleonastic babble was a little unfortunate. Consequently, I feel like I've embarrassed myself eight ways to Sunday, though Paul has been kind enough to say that I did just fine.

I was caught a little off guard by questions asking me to describe RDFS and OWL. Rather than giving a brief description, as I ought to have, I digressed much too far into inane examples. I also said a few things which I thought at the time were kind of wrong (by which, I mean that I was close, but did not hit the mark), but with the conversation being recorded it felt too awkward to go back and correct myself, particularly when I'd need a little time to think in order to get it right.

Perhaps more frustratingly, my needless digressions and inaccurate descriptions stole from the time that could have been used to talk about things I believe to me more interesting. In particular, I'm thinking of the Open Source process, and how it relates to a project like Mulgara. David was able to give a lot of the history behind the project, but as an architect and developer, I have a different perspective that I think also has some value. I also think that open source projects are pivotal in the development of "software as a commodity", which is a notion that deserves serious consideration at the moment. I touched on it briefly, but I also ought to have elaborated on how open source commodity software is really needed as the fundamental infrastructure for enabling the semantic web, and hence the need for projects like Mulgara, Sesame and Jena.

But despite my missed opportunity to discuss these things today, I should not consider Talis's podcast to be a forum for expressing my own agenda. If I have a real desire to say these things, then I should be using my own forum, and that is this blog.

As always, time is against me, but I'll mention a few of these things, and perhaps I can have time to revisit the others in the coming weeks.

People

I should also have mentioned some of the other names involved in Mulgara, from both the past and present. Fortunately, David already mentioned some of them (myself included) but since I'm in my own blog I can go into some more detail. Whether paid or not, these people all gave a great deal of commitment into making this a project with a lot to offer the community. However, since there are so many, I'll just stick to those who have some kind of ongoing connection to the project:
  • David Wood, who decided we could write Mulgara, made enormous sacrifices to pay for it out of his own pocket... and THEN made it open source! His ongoing contributions to Mulgara are still valuable.
  • David Makepeace (a mentor early in my career, who I was fortunate to work with again at Tucana) who was the real genius behind the most complex parts of the system.
  • Tate Jones, who kept everyone focused on what we needed to do.
  • Simon Raboczi who drove us to use the standards, and ensured the underlying mathematical model was correct.
  • Andrew Newman who knew everything there was to know in the semantic web community, and aside from writing important code, he was the one who wouldn't stop asking when we could overcome the commercial concerns and make the system Open Source.
  • Andrae Muys, the last person to join the inner cabal, and the guy who restructured it all for greater modularity, and correctness. This contribution alone cannot be overstated, but since Tucana closed shop he has remained the most committed developer on the project.
  • Collectively, the guys at Topaz, who have provided more support than anyone else since Tucana closed.
These were just some of the guys who made the project worthwhile, and Tucana a great place to work.

Sorry to those I didn't mention.

Even if I move past Mulgara and into a new type of RDF store, then the open source nature of Mulgara will allow me to bring a lot of that intelligence and know-how forward with me. For this reason alone, I think that the Open Source process deserves some discussion.

Architecture

Back when Mulgara (or TKS/Kowari) was first developed, it was interesting to see the schemas being proposed. Looking at them, there was a clear influence from the underlying Description Logic that RDF was meant to represent. However, I was not aware of description logics back then, and instead only knew about RDF as a graph. Incidentally, I only considered RDF/XML to be a serialization of these graphs (a perspective that has been useful over the years), so a knowledge of this wasn't relevant to the work I was doing (though I did learn it).

Since I was graph focused, and not logic focused, I didn't perceive predicates as having a distinct difference from subjects or objects (especially since it is possible to make statements where predicates appear as subjects). Also, while "objects" are different from "subjects" by the inclusion of literal values, this seemed to be a minor annotation, rather than a fundamental difference. Consequently, while considering the "triple" of subject, predicate and object, I started wondering at the significance of their ordering. This led me to drawing them in a triangle, much as you can see in the RDF Icon.


This then led naturally to the three way index we used for the first few months of the project, and is still the basis of our thinking today. Of course, in a commercial environment, we were acutely aware of the need for security, and it wasn't long before we introduced a fourth element to the mix. Initially this was supposed to provide individualized security for each statement (a requested feature), but it didn't take long to realize that we wanted to group statements together, and that security should be applied to groups of statements, rather than each individual statement (else security administration would be far too onerous, regardless of who thought this feature would be a good idea). So the fourth element became our "model", though a little after that the name "graph" became more appropriate.

Moving to 4 nodes in a statement led to an interesting discussion, where we tried to determine what the minimum number of indices would be, based on our previous 3-way design. This is what led to the 6 indices that Mulgara uses today. I explored this in much more depth some time later in this blog, with a couple of entries back in 2004. In fact, it is this very structure that allows us to do very fast querying regardless of complexity (and if we don't, then it just needs re-work on the query optimizer, and not our data structures). More importantly, for my recent purposes (and my thesis), this allows for an interesting re-interpretation of the RETE algorithm for fast rule evaluation. This then is our basis for performing OWL inferences using rules.

See? It's all tied together, from the lowest conceptual levels to the highest!

I freely acknowledge that OWL can imply much more than can be determined with rules (actually, that's not strictly true, as an approach using magic sets to temporarily generate possible predicates can also get to the harder answers - but this is not practical). To get to these other answers, the appropriate mechanism is with a Tableaux reasoner (such as Pellet). However, from experience I believe that most of what people need (and want) is covered quite well with a complete set of rule-base inferences. This was reinforced for me when KAON2 came up with exactly the same approach (though I confess to having been influenced by KAON2 before it was released, in that I was already citing papers which formed the basis of that project).

All the same, while I think Rules will work for most situations, having a tableaux reasoner to fall back on will give Mulgara a more complete feature set. Hence, my desire to integrate Pellet (originally from MIND Lab).

I have yet to look at the internals of Pellet, to see how it stores and accesses its data. I'd love to think that I could use an indexing scheme to help it to scale out over large data sets like rules can, but my (limited) knowledge of the tableaux algorithm says that this is not likely.

Open Source

There are several reasons for liking Pellet over the other available reasoners. First, is that it is under a license that is compatible with Mulgara. Second, is that I saw the ontology debugger demonstrated at MIND Lab a couple of years ago, and have been smitten ever since. Third, the work that Christian Halaschek-Wiener presented at SemTech on OWL Syndication, convinced me that Pellet is really doing the right thing for scalability on TBox reasoning.

Finally, Pellet is open source. Yes, that seems to be repeating my first point about licenses, but this time I have a different emphasis. The first point was about legal compatibility of the projects. The point I want to make here is that reasoning like this is something that everyone should be capable of doing, in the same way that storing large amounts of data should be something that everyone can do. Open source projects not only make this possible, but if the software is lacking in some way, then it can be debugged and/or expanded to create something more functional. Then the license point comes back again, allowing third party integration and collaboration. This lets people build something on top of all these open source commodities that is a gestalt of all the components. Open source projects enable this, allowing the community to rapidly create things that are conceptually far beyond the component parts.

From experience, I've seen the same process in the commercial and open source worlds. In the commercial world, the growth is extraordinarily slow. This is because of limited budgets, and limited communication between those who can make these things happen. Ideas are duplicated between companies, and resources are spent trying to make one superior to all the others, sometimes ignoring customers' needs (and often trying to tell the customer what they need).

In the open source world, everyone is free to borrow from everyone else's ideas (within license compatibility - a possible bugbear), to expand on them, or to use them as a part of a greater whole. Budgets are less of an issue, as projects have a variety of resources available to them, such as contributing sponsors, and hobbyists. Projects focus on the features that clients want, because often the client is contributing to the development team.

Consider MS-SQL and Oracle. Both are very powerful databases, which have competed now for many years. In a market dominated by these players, it is inconceivable that a new database could rival them. Yet MySQL has been steadily gaining ground for many years, first as a niche product for specialized use, and then more and more as a fully functional server. It still has a way to go to scale up to high end needs as the commercial systems do, but this is a conceivable target for MySQL. In the meantime, I would guess that there are more MySQL installations in the world than almost any other RDBMS available today. Importantly, it got here in a fraction of the time that it took the commercial players.

Semantic Web software has a long way to go before reaching the maturity of products like those I just mentioned. We still have to take semantic web software a long way forward. But history has shown us that the way forward is to make the infrastructural software as open and collaborative as possible, enabling everyone to develop at a much higher level, without being concerned about the layers below them. Higher level development has happened with many layers of computing in the past (compilers, OO toolkits, spreadsheets, databases, web scripting languages for server-side and client-side scripting), and the cheaper and more open the lower levels were, the more rapid and functional the high level development became.

It is at this top level that we can provide real value for the world at large, and not just the IT community. It is this that should be driving our development. We should not be striving to make computing better. Computers are just tools. We should be striving to make the world better.

Sounds pretty lofty, I know. Blame the caffeine from this morning wearing off and leaving me feeling light headed. But there has to be some point to it all. This all takes too much work if we indulge in navel gazing by only enabling IT. IT has to enable people outside of its own field or else there is no reason for it to exist, and we will all get caught in another .com bubble-burst.

Saturday, June 23, 2007

Talis

Lately I've been hanging out in #talis on IRC. It's a corporate IRC channel, but unlike other companies (like mine) they've put it on a public server where anyone can find it and join in. Not only that, but some of the staff blogs point out that they can usually be found there.

Well, that seemed like an open invitation, so I've been joining in for the last few days. :-)

It's exactly the sort of thing you'd expect, in that it is just discussion between staff facilitating the various tasks they are involved with. Of course, there is a lot of humor as well, which I've appreciated, since American humor is different to my own. We spell it differently, for a start!

I was particularly interested to see that they hold their daily Scrum meetings in IRC. I suppose this is necessary, given that they have people all over. The other thing is that I've enjoyed saying hello to people who's blogs or emails I've been reading for years, but have never interacted with before. Well, almost never. One of them responded to something I said at the end of a post a few years ago, and I've received very occasional emails from him ever since (and I occasionally pester him with stuff - he'll know who I mean if he's reading this). :-)

Overall, between IRC and their podcasts, they seem to be a bright group of people, and are destined to make waves in the semantic web. I'm curious to see where they go.

Wednesday, June 06, 2007

Talis

I hadn't heard much about Talis before SemTech. I suppose I spend too much time on technology and not enough time on how and where it's applied. That's what people like Andrew and Danny are good for, which is why I stay in touch with their blogs.

So I only really started paying attention to Talis when David did his interview on their podcast. This podcast has a fascinating series of interviews with people whose opinions I'm interested in hearing. The only problem I have with it is the intro music. It reminds me of the 70's relaxation music they play while warming up the star projector in the Brisbane Planetarium, so I end up imagining that I'm listening to the interviews in a cavernous auditorium with the lights turned off. But don't let that put you off - the content is good.

So now I had an awareness of Talis, but never had any interaction with them. Then on Tuesday morning I decided to catch up on my RSS feeds, and read that Danny had started employment with them. I didn't even know he was looking for that kind of work. Congrats Danny.

About an hour after reading Danny's news, David introduces me to Ian and Sam from Talis. This is the same Ian mentioned in Danny's blog post about his new job. I know we're all in the same industry, but reading about someone and meeting them an hour later is just scary synchronicity.

Next I hear that Talis may be interested in Mulgara features. These guys are everywhere. But that reminds me: I'd better do some coding.