Working notes: 08/01/2007

Friday, August 31, 2007

Lunar Eclipse

A friend of mine took a series of photos of the recent lunar eclipse, as seen from Brisbane. They're not serious astrophotography, but for someone like me who doesn't have much of a view of the sky any more, they were nice to see. (I miss my telescope).

I particularly like the bright exposure of the final crescent, and again when it re-emerges. It's also a nice example of how even the simplest of telescopic lenses is able to see the Galilean moons.

I recommend using the "Slideshow" view, to watch the progression of the full-size photos.

Thursday, August 30, 2007

FOAF

Like David's post yesterday, there have been a number of discussions in recent months about the best practice for URIs that identify people. I've typically stayed out of the public debates, but have been involved in a number of offline conversations.

A popular approach to building these URIs is to configure an HTTP server such that when it receives a request for this URI it responds with an HTTP 303 (which means "See Other"). This lets the server respond with a document pertaining to that URI, but at the same time informs the client that this document is NOT the resolution of that URI. After all, the resolution of the URI is a person, and one can hardly respond with that (for a start, you'd need all that quantum state, and I haven't yet seen the internet protocols for quantum teleportation).

Another approach is to simply use the URI of a document describing the person, and tack an anchor onto the end. Typically this anchor is #me. Like the 303 approach, you can get a retrievable document that can be found from the person's URI, and again that document has a different URI to the URI of the person. The main problem cited with this second approach is that a #me anchor may exist in the document, meaning that the URI resolves to something other than the person (while I recently learned that URI ambiguity is not strictly illegal, it is a really bad idea. After all, we usually rely on these things to identify a unique thing). Other people suggest avoiding possible anchor ambiguity with a query (?key=value on the end of the URL). This is much less popular, and I'll let the public arguments against this stand for themselves.

While looking at the "303" approach the other day, I realized that both Safari and Firefox respond to a 303 as if it were a redirection. This makes sense in several ways. If a user has asked for "something" by address, then they'd like to see whatever data is associated with that address (as opposed to a response of "not here"). Also, the HTTP RFC says that this link should be followed. Even so, since the resulting document is NOT what was asked for, the user should at least be told that they are looking at the "Next Best Thing", rather than silently being redirected.

I came to all of this while updating my FOAF file the other day. While it is possible to describe all of your friends in minute detail, the normal practice is to include just enough information to uniquely identify them (plus a couple of things that are useful to keep locally, like the friend's name). Then when you and your friend's FOAF files are brought into the same store together, all that information will get linked up. This sounds great, until you realize that there is no defined way to find your friends' files. The various FOAF browsers, surfers, etc, that I've tried are all terrible at tracking down people's FOAFs, so whatever they're trying isn't working very well either.

Whether using anchor suffixes or 303s, the URI that people often use for themselves just happens to lead you to their own FOAF files. This would be the solution to the problem of finding your friends' files... if your friends happened to use this approach. While useful, it can't be relied upon for automatic FOAF file gathering. Because of this, I decided that I should try to put explicit links to all of my friends' FOAF URLs that I know about. This led me to tracking down the files of each of the people in my FOAF file (fortunately not many, as most of the people I know don't have a FOAF file), which had me following various 303 links, like the one to Tom's URI. I was using wget, which doesn't follow a "See Other" link automatically, and this was how I discovered that Tom was using a 303. I'm sure if I'd followed his URI with Firefox then I wouldn't have noticed the new address.

After following the links for all these people, I then wanted some way to describe the location of their FOAF in my own FOAF description of them. After some investigation of the FOAF namespace, I discovered that there is no specified way to do this. I suppose this is what led to the de facto standard that people have adopted where their person URI leads you (however indirectly) to their FOAF file. This actually makes perfect sense, as you don't want to invalidate people's links to you just because you chose to move the location of your file, but it's still annoying if you want to be able to link to other people's file. Perhaps everyone should get a PURL address?

The closest thing I could find to a property describing a FOAF file, is the more general <foaf:homepage>. This property lets you link a resource (like a person) to some kind of document describing that resource. This meets the criteria of what I was looking for, but it is also more general than I was after, as it can also be used to point to non-FOAF pages, like a person's home page (the original intent of this property). All the same, I went with it, since it was a valid thing to do. At least it will help any applications that I write to look at my own file. It's a shame that it's so manual.

While thinking about how to automate this process, it occurred to me that I could try the following:

If a person's URI ends in an anchor, then strip it off, and follow the URI. If the returned document is RDF then treat it as FOAF data (identifying RDF as being FOAF or not FOAF is another problem).
Follow the person's URI, and if the result is a 303, then follow that URI. If the resulting document is a RDF, the treat it as FOAF.
Iterate through each URI associated with the person (such as <foaf:homepage>) and if any of these return an RDF file then treat it as FOAF.
On each of the HTML pages returned from the previous iterations, check for <a href=...> tags to resources that don't end with .html, .jpg, .png, etc. If querying for any of these links returns an RDF file, then treat as FOAF.

Incidentally, Tom's FOAF file would only be picked up via the last message. You have to follow his URI to get a 303, which then leads you to his home page. Then on that page you'll find links to his FOAF file. Frankly, it was just easier to manually add a <foaf:homepage> tag to his file. :-)

Anachronism

During the various conversations I've had (mostly with Tom), it occurred to me that there is an underlying assumption that all URIs will be HTTP. This is particularly true for 303 responses, as this is an HTTP response code. However, nothing in RDF suggests that the protocol (or scheme, according to URI terminology) has to be HTTP. For instance, it isn't unheard of to find resources at the end of an ftp://... URL. It got me wondering how much it would break existing systems if the URIs used for and in a FOAF file were not in HTTP, but something different. If they handle anything else, then it's almost certain to be FTP (and possibly even HTTPS), so these weren't going to really test things. No, the protocol I chose was Gopher.

The GoFish server managed the details for me here, though it took me a bit of debugging to realize that it wasn't starting when it couldn't find a user/group of "gopher" on my system (Apple didn't retain that account on OS X. Go figure). Once I'd found that problem, it then took me a few minutes to discover that addresses for text file in the root are prefixed with 00/. But once that was done I was off and running.

I'm not a huge fan of running services from my home PC, so I can't say that I'll keep it up for a long time. But at the same time, it gives me some perverse pleasure to hand out my FOAF file as a gopher address. :-)

Sunday, August 05, 2007

Mulgara on Java 6

I'm nearly at the point where I can announce that Mulgara runs on Java 6 (or JDK 1.6). Many of the problems were due to tests looking for an exact output, while the new hashtables iterate over their data in a different order.

The remaining problems fall into two areas.

The first seems to be internal to the implementation of the Java 6 libraries. In one case the HTTPS connection code is unable to find an internal class that occurs in the same place for both Java 5 and Java 6. In another case, a method whose javadoc explicitly says will contain no caching, claims that it has "already closed the file" for all but the first time you try to open a JAR file using a URL, even though the URL object is newly created for each attempt.

These problems may involve some browsing of Java source code to properly track down. Fortunately, they only show up in rarely-used resolver modules (I've never used them myself).

The other problem is that as of Java 6, the java.sql.ResultSet interface has a new series of methods on it. I'd rather that we didn't, but we implement this interface with one of our classes. This is a holdover from a time when we tried to implement JDBC. While it mostly worked, there was a fundamental disconnect with the metadata requirements of JDBC, and so we eventually abandoned the interface. However, the internal implementation of this interface remains.

Since we don't use any of the new methods, then it is a trivial matter to implement these methods with empty stubs. Eclipse does this with a couple of clicks, so it was very easy to do. Once this was done, the project compiled fine, and I could get onto tracking down the failures and errors in the tests, the causes of which I've described above.

All this was going well, until someone pointed out that there were some issues under Windows. After spending some time getting the OS up and running again, I quickly found that the class implementing ResultSet was missing some methods. How could this be? It had all the methods on OS X.

The simple answer is to run javap on the java.sql.ResultSet interface, and compare the results. Sure enough, on Windows (and Linux) the output contains 14 entries not found in OS X!

WTF?!?

This is easy enough to fix. Implementing the methods with stubs will make it work on Linux and Windows, and will have no effect on OS X. But why the difference? This meets my definition of broken.

Saturday, August 04, 2007

OWL Inexpertise

One of my concerns about Talking to Talis yesterday (interesting pun between a verb and a noun there) was in making criticisms of some of the people working on OWL, when I'm really not enough of an expert to make such a call.

I expressed concern over the "logicians" who have designed OWL as being out of touch with the practical concerns of the developers who have to use it. While I still believe there is basis for such an accusation, it is glossing over the very real need for a solid mathematical foundation for OWL, and is also disrespectful to several people in the field whom I respect.

Knowing and understanding exactly what a language is capable of, is vital in its development. Otherwise, it is very easy to introduce features that conflict, or don't make sense in certain applications. Conflicting or vague definitions may work in human language, but is not appropriate when developing systems with the precision that computers require. I have to work hard to get to the necessary understanding of description logic systems, which is why I respect people like Ian Horrocks (or Pat Hayes, or Bijan Parsia, the list goes on...) for whom it all seems to come naturally. Without their work, we wouldn't know exactly what all the consequences of OWL are, meaning that OWL would be useless for reasoning, or describing much of anything at all.

However, coming from a perspective of "correctness" and "tractability", there is a strong desire in this community to keep everything within the domain of OWL-DL (the computationally tractable variant of OWL). Any constructs which fall outside of OWL-DL (and into OWL Full) are often dismissed. Anyone building systems to perform reasoning on OWL seems to be limiting their domain to OWL-DL or less. There appears to be an implicit argument that since calculations for OWL Full cannot be guaranteed to complete, then there is no point in doing them. Use of many constructs is therefore discouraged, on the basis that it is OWL Full syntax.

While this makes sense from a model-theoretic point of view, pragmatically it doesn't work. Turing machines are not tractable (for instance, one can create an infinite loop), and yet no one has suggested that Turing complete languages are not important! Besides, Gödel taught us that tractability is not all that it's cracked up to be.

A practical example of an OWL Full construct is in trying to map a set of RDBMS tables into OWL. It is very common for such tables to be "keyed" on a single field, often a numeric identifier, but sometimes text (like a student number, or SSN). Even if these fields are not the primary key of the table, a good mapping into a language like OWL will need to capture this property of the field.

The appropriate mapping of a key field on a record is to mark that field as a property of type owl:InverseFunctionalPredicate. However, it is not legal to use this property on a number or a string (an RDF literal) in anything less than OWL Full.

There are workarounds to stay within OWL-DL. However, this is one of many common use cases where workarounds are required to stay within the confines of OWL-DL. While theoretically possible that using owl:InverseFunctionalPredicate on a "literal" would cause intractability, most use cases will not lead to this. It would seem safe in many systems to permit this - with an understanding of the dangers involved. Instead, the unwillingness of the experts to let people work with OWL Full, has caused onerous restrictions on many developers. This in turn leads to them simply not bothering with OWL, or to go looking for alternatives.

I can appreciate the need to prevent people from shooting themselves in the foot. On the other hand, preventing someone from taking aim and firing at their feet often leads to other difficulties, encouraging them to just remove the safety altogether.

It's an argument with two sides. There may well be many logicians out there who agree that a practical approach is required for developers, in order to make OWL more accessible to them. However, my own observations have not seen any concessions made on this point.

There. It reads much better here than the bald assertion I made for Talis. :-)

Friday, August 03, 2007

Nodalities

Last night Luc was determined to keep me up, and he did a pretty good job of it. This happens frequently enough that it shouldn't be worth a mention in this blog, except that today I had agreed to speak with Paul Miller from Talis, for the Talking with Talis podcast.

So that I'd be compos mentis, I resorted to a little more coffee than usual (I typically have one in the morning, and sometimes have one in the afternoon. Today I had two in the morning). While this had the desired affect of alertness, the ensuing pleonastic babble was a little unfortunate. Consequently, I feel like I've embarrassed myself eight ways to Sunday, though Paul has been kind enough to say that I did just fine.

I was caught a little off guard by questions asking me to describe RDFS and OWL. Rather than giving a brief description, as I ought to have, I digressed much too far into inane examples. I also said a few things which I thought at the time were kind of wrong (by which, I mean that I was close, but did not hit the mark), but with the conversation being recorded it felt too awkward to go back and correct myself, particularly when I'd need a little time to think in order to get it right.

Perhaps more frustratingly, my needless digressions and inaccurate descriptions stole from the time that could have been used to talk about things I believe to me more interesting. In particular, I'm thinking of the Open Source process, and how it relates to a project like Mulgara. David was able to give a lot of the history behind the project, but as an architect and developer, I have a different perspective that I think also has some value. I also think that open source projects are pivotal in the development of "software as a commodity", which is a notion that deserves serious consideration at the moment. I touched on it briefly, but I also ought to have elaborated on how open source commodity software is really needed as the fundamental infrastructure for enabling the semantic web, and hence the need for projects like Mulgara, Sesame and Jena.

But despite my missed opportunity to discuss these things today, I should not consider Talis's podcast to be a forum for expressing my own agenda. If I have a real desire to say these things, then I should be using my own forum, and that is this blog.

As always, time is against me, but I'll mention a few of these things, and perhaps I can have time to revisit the others in the coming weeks.

People

I should also have mentioned some of the other names involved in Mulgara, from both the past and present. Fortunately, David already mentioned some of them (myself included) but since I'm in my own blog I can go into some more detail. Whether paid or not, these people all gave a great deal of commitment into making this a project with a lot to offer the community. However, since there are so many, I'll just stick to those who have some kind of ongoing connection to the project:

David Wood, who decided we could write Mulgara, made enormous sacrifices to pay for it out of his own pocket... and THEN made it open source! His ongoing contributions to Mulgara are still valuable.
David Makepeace (a mentor early in my career, who I was fortunate to work with again at Tucana) who was the real genius behind the most complex parts of the system.
Tate Jones, who kept everyone focused on what we needed to do.
Simon Raboczi who drove us to use the standards, and ensured the underlying mathematical model was correct.
Andrew Newman who knew everything there was to know in the semantic web community, and aside from writing important code, he was the one who wouldn't stop asking when we could overcome the commercial concerns and make the system Open Source.
Andrae Muys, the last person to join the inner cabal, and the guy who restructured it all for greater modularity, and correctness. This contribution alone cannot be overstated, but since Tucana closed shop he has remained the most committed developer on the project.
Collectively, the guys at Topaz, who have provided more support than anyone else since Tucana closed.

These were just some of the guys who made the project worthwhile, and Tucana a great place to work.

Sorry to those I didn't mention.

Even if I move past Mulgara and into a new type of RDF store, then the open source nature of Mulgara will allow me to bring a lot of that intelligence and know-how forward with me. For this reason alone, I think that the Open Source process deserves some discussion.

Architecture

Back when Mulgara (or TKS/Kowari) was first developed, it was interesting to see the schemas being proposed. Looking at them, there was a clear influence from the underlying Description Logic that RDF was meant to represent. However, I was not aware of description logics back then, and instead only knew about RDF as a graph. Incidentally, I only considered RDF/XML to be a serialization of these graphs (a perspective that has been useful over the years), so a knowledge of this wasn't relevant to the work I was doing (though I did learn it).

Since I was graph focused, and not logic focused, I didn't perceive predicates as having a distinct difference from subjects or objects (especially since it is possible to make statements where predicates appear as subjects). Also, while "objects" are different from "subjects" by the inclusion of literal values, this seemed to be a minor annotation, rather than a fundamental difference. Consequently, while considering the "triple" of subject, predicate and object, I started wondering at the significance of their ordering. This led me to drawing them in a triangle, much as you can see in the RDF Icon.

This then led naturally to the three way index we used for the first few months of the project, and is still the basis of our thinking today. Of course, in a commercial environment, we were acutely aware of the need for security, and it wasn't long before we introduced a fourth element to the mix. Initially this was supposed to provide individualized security for each statement (a requested feature), but it didn't take long to realize that we wanted to group statements together, and that security should be applied to groups of statements, rather than each individual statement (else security administration would be far too onerous, regardless of who thought this feature would be a good idea). So the fourth element became our "model", though a little after that the name "graph" became more appropriate.

Moving to 4 nodes in a statement led to an interesting discussion, where we tried to determine what the minimum number of indices would be, based on our previous 3-way design. This is what led to the 6 indices that Mulgara uses today. I explored this in much more depth some time later in this blog, with a couple of entries back in 2004. In fact, it is this very structure that allows us to do very fast querying regardless of complexity (and if we don't, then it just needs re-work on the query optimizer, and not our data structures). More importantly, for my recent purposes (and my thesis), this allows for an interesting re-interpretation of the RETE algorithm for fast rule evaluation. This then is our basis for performing OWL inferences using rules.

See? It's all tied together, from the lowest conceptual levels to the highest!

I freely acknowledge that OWL can imply much more than can be determined with rules (actually, that's not strictly true, as an approach using magic sets to temporarily generate possible predicates can also get to the harder answers - but this is not practical). To get to these other answers, the appropriate mechanism is with a Tableaux reasoner (such as Pellet). However, from experience I believe that most of what people need (and want) is covered quite well with a complete set of rule-base inferences. This was reinforced for me when KAON2 came up with exactly the same approach (though I confess to having been influenced by KAON2 before it was released, in that I was already citing papers which formed the basis of that project).

All the same, while I think Rules will work for most situations, having a tableaux reasoner to fall back on will give Mulgara a more complete feature set. Hence, my desire to integrate Pellet (originally from MIND Lab).

I have yet to look at the internals of Pellet, to see how it stores and accesses its data. I'd love to think that I could use an indexing scheme to help it to scale out over large data sets like rules can, but my (limited) knowledge of the tableaux algorithm says that this is not likely.

Open Source

There are several reasons for liking Pellet over the other available reasoners. First, is that it is under a license that is compatible with Mulgara. Second, is that I saw the ontology debugger demonstrated at MIND Lab a couple of years ago, and have been smitten ever since. Third, the work that Christian Halaschek-Wiener presented at SemTech on OWL Syndication, convinced me that Pellet is really doing the right thing for scalability on TBox reasoning.

Finally, Pellet is open source. Yes, that seems to be repeating my first point about licenses, but this time I have a different emphasis. The first point was about legal compatibility of the projects. The point I want to make here is that reasoning like this is something that everyone should be capable of doing, in the same way that storing large amounts of data should be something that everyone can do. Open source projects not only make this possible, but if the software is lacking in some way, then it can be debugged and/or expanded to create something more functional. Then the license point comes back again, allowing third party integration and collaboration. This lets people build something on top of all these open source commodities that is a gestalt of all the components. Open source projects enable this, allowing the community to rapidly create things that are conceptually far beyond the component parts.

From experience, I've seen the same process in the commercial and open source worlds. In the commercial world, the growth is extraordinarily slow. This is because of limited budgets, and limited communication between those who can make these things happen. Ideas are duplicated between companies, and resources are spent trying to make one superior to all the others, sometimes ignoring customers' needs (and often trying to tell the customer what they need).

In the open source world, everyone is free to borrow from everyone else's ideas (within license compatibility - a possible bugbear), to expand on them, or to use them as a part of a greater whole. Budgets are less of an issue, as projects have a variety of resources available to them, such as contributing sponsors, and hobbyists. Projects focus on the features that clients want, because often the client is contributing to the development team.

Consider MS-SQL and Oracle. Both are very powerful databases, which have competed now for many years. In a market dominated by these players, it is inconceivable that a new database could rival them. Yet MySQL has been steadily gaining ground for many years, first as a niche product for specialized use, and then more and more as a fully functional server. It still has a way to go to scale up to high end needs as the commercial systems do, but this is a conceivable target for MySQL. In the meantime, I would guess that there are more MySQL installations in the world than almost any other RDBMS available today. Importantly, it got here in a fraction of the time that it took the commercial players.

Semantic Web software has a long way to go before reaching the maturity of products like those I just mentioned. We still have to take semantic web software a long way forward. But history has shown us that the way forward is to make the infrastructural software as open and collaborative as possible, enabling everyone to develop at a much higher level, without being concerned about the layers below them. Higher level development has happened with many layers of computing in the past (compilers, OO toolkits, spreadsheets, databases, web scripting languages for server-side and client-side scripting), and the cheaper and more open the lower levels were, the more rapid and functional the high level development became.

It is at this top level that we can provide real value for the world at large, and not just the IT community. It is this that should be driving our development. We should not be striving to make computing better. Computers are just tools. We should be striving to make the world better.

Sounds pretty lofty, I know. Blame the caffeine from this morning wearing off and leaving me feeling light headed. But there has to be some point to it all. This all takes too much work if we indulge in navel gazing by only enabling IT. IT has to enable people outside of its own field or else there is no reason for it to exist, and we will all get caught in another .com bubble-burst.

Working notes