Sunday, December 02, 2007

Tragedy

This evening I heard some tragic news, involving Chimezie Ogbuji.

I don't really know Chime, though I've met his brother Uche a few times (to quote a friend, "A very stylish man"). Anyone who follows the semantic web mailing lists will have seen numerous messages from both men.

I just want to express my deepest condolences to the Ogbuji family. I also sincerely hope that the youngest will recover.

Tuesday, November 13, 2007

Synchronicity

After writing about complex behavior emerging from networks of simple non-linear elements this morning, I read Slashdot this evening to see a story on just that topic. Strange.

Other than that I worked to get the new interpreter system working against the existing test suite. It's mostly there, but there are still a few bugs left.

Ironically, the transaction bug of the day was occurring in a section of code where I was doing a lot of testing to see exactly what command had been issued, and responding accordingly. However, I have an AST that works for me, and after staring at it for 10 minutes I suddenly realized that all the problems would go away if I used the same code for each type of command. Consequently, 12 lines turned into 2, and all the bugs went away. Thank goodness for being able to call into a clean interface design.

This isn't the first time that I've fixed a problem by removing code. Sometimes I wonder if real software engineering is about removing lines of code rather than inserting them. Pretty much destroys the "Lines of Code" metric that some companies like to employ (word to the wise - don't work for these companies).

Laryngitis

I'm unable to speak above a hoarse whisper today (a fact that my two year old son is delighting in) due to some kind of virus. I'm over the worst of it, but I'm a little lightheaded, so if this post rambles more than usual you'll know why. :-)

OWL... Again

I'm much further into the 2nd Edition of the Description Logic Handbook (and to my deep satisfaction, wading through stuff I already know), and can see some interesting stuff coming up in the next few chapters. I'm also learning some interesting points in the discussions that go on in the OWL developers list. And it reoccurs to me that something is wrong here.

If it takes profession developers, and even professional academics this long to get a real handle on how OWL works, then how on earth can we expect the rest of the world to get it right? The idea of the Semantic Web is to link data from lots of different sources, but that implies we need lots of people out there who can structure that data in a way that will allow the linking to be consistent (and I'm referring to the logic meaning for "consistent").

Conversely, in order to create a semantic web, we need precise descriptions of things, and that implies Description Logic. The inventors of OWL were not trying to be obtuse - indeed, I think they desired the opposite effect. However, years of Description Logic research has led to an understanding that seemingly insignificant details in a language can have dramatic effects. So OWL had to be carefully built and constrained so as to prevent the future semantic web from shooting itself in the foot. But this leads us directly into this language of horrible complexity with subtle rules that even catch the experts off guard occasionally.

So what's the solution? Well for the moment, the industry is doing what it always does. It muddles through using what expertise the developer community has, and incrementally drags itself up to greater consistency (hopefully) and complexity (certainly). It's hardly ideal, but then, it's no different to what usually happens with software. This is why Windows used to blue-screen all the time, and why I'm unable to run Windows XP in Parallels without Leopard losing the ability to start new programs or kill off old ones (I'm really hoping Apple fix that one!). It leaves me concerned about wisdom of this approach.

On the other hand, there seems to be little alternative to this kind of design if we want to design for semantics. OWL is simply a representation of an underlying mathematics that is fundamental to what we are trying to represent. But if it turns out to be too complex to design this stuff as a community (I believe individuals are capable of it, but not enough to make a "web" out of the semantics), then that means we can't really design this at all. But we know that semantics are possible, since our brains deal with them, and our brains are little more (ha ha) than enormous networks of simple, non-linear elements. There are general guidelines (giving functional areas like the prefrontal cortex for higher thoughts and the amygdala for emotional starting emotional thoughts). In other words, build a large enough network of simple constructs, with general design guidelines), but the details can vary dramatically, even between identical twins, and as we grow and learn then the network starts to adapt and modify itself.

Despite the randomness (neural network theory even demonstrated that randomness is essential), and despite all the lack of detailed "design", the brain is the only instrument we currently have that can process semantics. Almost all of its processing capabilities come about as an emergent property simply from building up a large enough network of interacting elements. So maybe the idea of the semantic web isn't that far fetched after all. We just need to get things mostly right at a local level, and when we link it all together something special will emerge. I don't think this is what the proponents of the semantic web had in mind when they first set out, but it might be what we end up with.

We are already seeing emergent properties coming out of networks that hit some critical mass. This is the effect behind Web 2.0 - whatever that means. And that is the point here. The label "Web 2.0" is a recognition of something that has "emerged" from these networks when connected with the right technologies. Because it wasn't explicitly designed, then it's hard to exactly pinpoint just what it is, but most people in the industry agree that it's there - even if they don't agree where it's boundaries lie.

Having semantics emerge rather than being designed in would seem to be a natural extension of what we're seeing now, especially when we are getting partial semantics in small systems already (courtesy of such technologies as OWL). But is there enough structure, and is it of the correct type for true semantics to finally emerge from the network?

OK, now I'm just going off on a wild tangent. At least I didn't look at the whole OWL problem and give up on it today. Perhaps our partial and not-quite-correct systems will have a part to play in a larger network.

Thursday, October 25, 2007

The Road to SPARQL

A long time ago, an interface was built for talking to Mulgara. At the time a query language was needed. We did implement RDQL from Jena in some earlier versions (that code is still in there), but quickly realized that we wanted more and slightly different functionality to what this language offered, and so TQL was born. Initially it was envisioned that there would be a text version for direct interaction with a user (interactive TQL, or iTQL), and an equivalent programmatic structure more appropriate for computers, written in XML (XML TQL, or xTQL). The latter was never developed, but this was the start of iTQL. (The lack of xTQL is the reason I've been advocating that we just call it TQL).

But a language cannot exist in a vacuum. Queries have to go somewhere. This led to the development of the ItqlInterpreter class, and the associated developer interface, ItqlInterpreterBean. These classes accept queries in the form of strings, and give back the appropriate response.

So far so good, but here is where it comes unstuck. For some reason, someone decided that because the URIs (really, URLs) of Mulgara's graphs describe the server that the query should be sent to, then the interpreter should perform the dispatch as soon as it sees the server in the query string. This led to ItqlInterpreter becoming a horrible mess, combining grammar parsing and automatic remote session management.

I've seen this mess many times, and much as I'd like to have fixed it, the effort to do so was beyond my limited resources to do so.

But now Mulgara needs to support SPARQL. While SPARQL is limited in its functionality, and imposes inefficiencies when interpreted literally (consider using OPTIONAL on a variable and FILTERing on if it is bound) the fact that it is a standard makes it extremely valuable to both the community, and to projects like Mulgara.

SPARQL has its own communications protocol for the end user, but internally it makes sense for us to continue using our own systems, especially when we need to continue maintaining TQL compatibility. What we'd like to do then, is to create an interface like ItqlInterpreter, which can parse SPARQL queries, and use the same session and connection code as the existing system. This means we can re-use the entire system between the two languages, with only the interpreter class differing, depending on the query language you want to use.

Inderbir has built a query parser for me, so all we'd need to do would be to build a query AST with this parser, and we have the new SPARQLInterpreter class. There are a couple of missing features (like filter and optional support), but these are entirely compatibile with the code we have now, and easily feasible extensions.

However, in order to make this happen, ItqlInterpreter finally had to be dragged (kicking and screaming) into the 21st century, by splitting up it functionality between parsing and connection management. This has been my personal project over the last few months, and I'm nearly there.

Connections

One of the things that I never liked about Mulgara is that I couldn't create connections for it like you do in databases like MySQL. There is the possibility of doing this with Session objects, but this has never been a user interface, and there were some operations that are not handled easily with sessions.

My solution was to create a Connection class. The idea is to create a connection to a server, and to issue queries on this connection. This allows some important functionality. To start with, it permits client control over their connections, such as connection pooling, and multiple parallel connections. It also enables the user to send queries to any server, regardless of the URIs described in the query. This is important for SPARQL, as it may not include a model name in the query, and so the server has to be established when forming the connection. This is also how the SPARQL network protocol works.

Sending queries to any server was not possible until recently, as a server always presumed that it held the graphs being queried for locally. However, serendipity lend a hand here, as I created the DistributedQuery resolver just a few months ago, enabling this functionality.

As an interesting aside, when I was creating the Connection code I discovered an unused class called Connection. This was written by Tom, and the accompanying documents explained that this was going to be used to clean up ItqlInterpreter, deprecating the old code. It was never completed, but it looks like I wasn't the only one who decided to fix things this way. It's a shame Tom didn't make further progress, or else I could have merged my effort in with his (saving myself some time).

Compatibility

So I don't break any existing Mulgara clients, my goal has been to provide 100% compatibility with ItqlInterpreterBean. To accomplish this, I've created a new AutoInterpreter class, which internally delegates query parsing to the real interpreter. The resulting AST is then queried for the desired server, and a connection is made, using caching wherever possible. Once this is set up, the query can be sent across the connection.

This took some work, but it now appears to mostly work. I initially had a few bugs where I forgot certain cases in the TQL syntax, such as backing up to the local client rather than the default of the remote server. But I have overcome many of these now.

Debugging

The main problem has been where ItqlInterpreterBean was built to allow specified sessions to be set, or operations that were set to operate directly on specified sessions. I think I'm emulating most of this behavior correctly, but it's taken me a while to track them all down.

I still have a number of failures and errors in my tests, but I'm working through them all quickly, so I'm pretty happy about the progress. Each time I fix a single bug, the number of problems drops by a dozen or more. The latest one I found was where the session factory is being directly invoked for graphs with a "file:" scheme in the URI. There is supposed to be a fallback to another session factory that knows how to handle protocols that aren't rmi, beep, or local, but it seems that I've missed it. It shouldn't be too hard to find in the morning.

One bug that has me concerned looks like some kind of resource leak. The test involved creates an ItqlInterpreterBean 1000 times. On each iteration a query is invoked on the object, and then the object is closed before the loop is iterated again. For some reason this is consistently failing at about the 631st iteration. I've added in some more logging, so I'm hoping to see the specific exception the next time I go through the tests.

One thing that caught me out a few times was the set of references to the original ItqlInterpreter, ItqlSession and ItqlSessionUI classes. So yesterday I removed all reference to these classes. This necessitated a cleanup of the various Ant scripts which defined them as entry points, but everything appears to work correctly now, which gives me hope that I got it all. The new code is now called TqlInterpreter, TqlSession and TqlSessionUI. While the names are similar, and the functionality is the same, most of it was re-written from the ground up. This gave me a more intimate view of the way these classes were built, leading to a few surprises.

One of the things the old UI code used to do was to block on reading a pipe, while simultaneously handling UI events. Only this pipe was never set to anything! It was totally dead code, but would never have been caught by a code usage analyzer, as it got run al time time (it could just never do anything). I decided to address this by having it read from, and process the standard input of the process (I suspect this was the initial intent, but I'm not sure). I don't know how useful it is, but it's sort of cute, as I can now send queries to the standard input while the UI is running, and not just paste into the UI.

I've added a few little features like this as I've progressed, though truth be told I can't remember them all! :-)

The other major thing I've been doing has been to fix up the bad code formatting that was imposed at some point in 2005, and to add generics. Sometimes this has proven to be difficult, but in the end it's worth it. It's not such a big issue with already working code, but generics make updating significantly easier, both by documenting what goes in and out of collections, and by doing some checking on the types being used. Unfortunately, there are some strange structures that makes generics difficult (trees built from maps with values that are also maps), so some of this work was time consuming. On the plus side, it's much easier to see what that code is now doing.

I hope to be through this set of fixes by the end of the week, so I can get a preliminary version of SPARQL going by next week. That will then let me start on those features of the SPARQL AST that we don't yet support.

Web Service Descriptions

Whenever I design ontologies and the tools to use them, I find myself thinking that I should somehow be describing more, and having the computer do more of the work for me. Currently OWL is about describing general structure and relationships, but I keep feeling like ontologies should describe behavior as well.

I'm not advocating the inclusion of an imperative style of operation description here. If that were the case, then we'd just be building programming languages in RDF (something I've been thinking of for a while - after all, it seems natural to represent ASTs or List structure in RDF). I'm thinking more along the lines of OCL for UML. This is a far more declarative approach, and describes what things are, rather than how they do it. Like we were taught in first-year programming, OCL is all about describing the pre-conditions and post-conditions of functions.

The main problem I have with OCL is that it looks far too much like a programming language, which is the very thing I'd like to avoid. All the same, it's on the right track. It would be interesting to see something like it in OWL, though OWL still has a long way to go before it is ready for this. The other problem is that OCL is about constraining a description, which seems to be at odds with the open-world model in OWL.

Still, the world appears to be ready for OWL to have something like this, even if OWL isn't ready to include it. Maybe I should be building an RDF/Lisp interpreter after all? :-)

The reason for me thinking along these lines is that I'd like to be able to describe web services in a truely interoperable way. Today we have WSDL and OWL-S, which are very good at describing the names of services and how to interface to them, but do little to describe what those services do. If we could really describe a service correctly, then any client could connect into an unknown server and discover exactly what that server was capable of and how to talk to it, all without user interaction, and without any prior understanding of what that server could do.

Ultimately, I think this is what we are striving for with the Semantic Web. It is a long way beyond us today, but the evolution of hardware and software in computers over the last 60 years have taught us that amazing things are achievable, so long as we take it one step at a time.

To a limited extent, there are already systems like BioMOBY, which use ontologies to query about services, and can work out for themselves how to connect to remote services and connect them together to create entirely new functions. There are still assumptions made about what kind of data is there and how to talk to it, but it includes a level of automation that is astounding for anyone familiar with WSDL standards.

When I last saw BioMOBY nearly 3 years ago, they were using the own RDF-like data structures, and were considering moving to RDF to gain the benefit of using public standards. I should check them out again, and see where they went with that. They certainly had some great ideas that I'd like to see implemented in a more general context.

Monday, October 22, 2007

Theory and Practice

The development of OWL comes from an extensive history of description logic research. This gives it a solid theoretical foundation for developing systems on, but that still doesn't make it practical.

There are numerous practical systems out there which do not have a solid theoretical foundation (and this can sometimes bite you when you really try to push a system), but can still be very useful for real world applications. After all, Gödel showed us that every system will either be incomplete or inconsistent (maybe that explains why Quantum Physics and General Relativity cannot both be right. Since Gödel's Theorem requires that there be an inconsistency somewhere in our description of the universe, then maybe that's it). :-)

If theoretically solid systems not an absolute requirement for practical applications, then are they really needed? I'd like to think so, but I don't have any proof of this. In fact, the opposite seems to be true. Systems with obvious technical flaws become successful, while those with a good theoretical underpinning languish. There are many reasons for failure, with social and marketing being among the more common. Ironically, the problems can also be technical.

John Sowa's essay on Fads and Fallacies about Logic describes how logic was originally derived from trying to formalize statements made in natural language. He also mentions that a common complaint made about modern logic is its unreadability. But this innocuous statement doesn't do the evolution justice. Consider this example in classical logic:
  • All men are mortal.
  • Socrates is a man.
  • Therefore: Socrates is mortal.
Now look at the following definition of modal logic:

Given a set of propositional letters p1,p2,..., the set of formulae of the modal logic K is the small set that:
  • contains p1,p2,...,
  • is closed under Boolean connectives, ∧, ∨ and ¬, and
  • if it contains Φ, then it also contains □Φ and ◇Φ.
The semantics of modal formulae is given by Kripke structures of M=⟨S,π,K⟩, where S is a set of states, π is a projection of propositional letters to sets of states, and K is the accessibility relation which is a binary relation on the states S. Then, for a modal formula Φ and a state sS, the expression M, sΦ is read as "Φ holds in M in state s". So:
  • M, spi     iff sπ(pi)
  • M, sΦ1Φ2    iff M,sΦ1 and M,sΦ2
  • M, sΦ1Φ2    iff M,sΦ1 or M,sΦ2
  • M, s ⊨ ¬Φ     iff M,sΦ
  • M, s ⊨ □Φ     iff there exists s'S with (s,s') ∈ K and M,s'Φ
  • M, s ⊨ ◇Φ     iff for all s'S if (s,s') ∈ K, then M,s'Φ


(Courtesy of The Description Logic Handbook. Sorry if it doesn't render properly for you... I tried! The UniCode character ⊨ (⊨) is rendered in Safari, but not in Firefox on my desktop - though it shows up on my notebook.)

While several generations removed, it can still be hard to see how this formalism descended from classical logic. Modal logic is on a firm theoretical foundation, and it is apparent that the above is a precise description, yet it is not the sort of tool that professional programmers are ever likely to use. This is because the complexity of understanding the formalism is a significant barrier to entry.

We see this time and again. Functional programming is superior to imperative programming in many instances, and yet the barrier to entry is too high for many programmers to use it. Many of the elements that were considered fundamental to Object Oriented Programming are avoided or ignored by many programmers, and are not even available in some supposedly Object Oriented languages (for instance, Java does not invoke methods by passing messages to objects). And many logic formalisms are overlooked, or simply not sufficiently understood for many of the applications for which there were intended.

After working with RDF, RDFS and OWL for some years now, I've started to come to the conclusion that these systems suffer from these same problems with the barrier to entry. It took me long enough to understand the complexities introduced in an open world model without a unique name assumption. Contrary to common assumption, RDFS domain and range is descriptive rather than prescriptive. And Cardinality restrictions rarely create inconsistencies.

Part of the problem stems from the fact that non-unique names and an open world are a complete different set of assumptions from the paradigms that programmers have been trained to deal with. It takes a real shift in thinking to understand this. Also, computers are good at working with the data they have stored. Working with data that is not stored is more the domain of mathematics: a field that has been receiving less attention in the industry in recent years, particularly as professionals have moved away from "Computer Science" and into "Information Technology". Even those of us who know better still resort to the expediency of using many closed world assumptions when storing RDF data.

Giving RDF, RDFS, and OWL to the general world of programmers today seems like a recipe for implementations of varying correctness with little hope of interoperability - the very thing that these technologies were designed to enable.

However, the RDF, RDFS and OWL were designed the way they are for very sound reasons. The internet is and open world. New information is being asserted all the time (and some information is being retracted, meaning that facts on the web are both temporal and non-monotonic, neither of which is dealt with by semantic web technologies, but let's deal with one problem at a time). There are often many different ways of referring to the same things (IP addresses and hostnames are two examples). URIs are the mechanism for identifying things on the internet, and while URIs may not be unique for a single resource, they do describe a single resource, and no other. All of these features were considered when RDF and OWL were developed, and the decisions made were good ones. Trying to build a system that caters to programmers presumptions by ignoring these characteristics of the internet would be ignoring the world as it is.

So I'm left thinking that the foundations of RDF and OWL are correct, but somehow we have to present them in such a way that programmers don't shoot themselves in the foot with them. Sometimes I think I have some ideas, but it's easy to become disheartened.

Certainly, I believe that education is required. To some extent this has been successful, as I've seen significant improvement in developers' understanding (my own included) in the last few years. We also need to provide tools which help guide developers along the right path, even if that means restricting some of their functionality in some instances. These have started to come online, but we have a long way to go.

Overall, I believe in the vision of the semantic web, but the people who will make it happen are the people who will write software to use it. OWL seems to be an impediment to the understanding they require, and the tools for the task are still rudimentary. It leaves me wondering what can be done to help.

Monday, September 17, 2007

Ruby

I've recently been making my way through the Pickaxe book, otherwise known as "Programming Ruby". I've been avoiding Ruby for a few years now, but I finally decided there was enough critical mass to make it worth my while.

Initially I thought little of Ruby, as it was just another scripting language. Despite the power and fast turnaround that these languages provide, in the late 90's and early 2000's they had always failed to deliver compelling performance. However, hardware performance has largely overcome that limitation. Even when it became apparent some years ago that performance was no longer an issue, Ruby was not a language with a lot of mindshare. The popular languages were always Perl for sysadmins and the Web, and Python for scientific applications, programs with GUIs, and some web sites. (I won't mention Tcl. Anyone who used that got what they deserved). These are generalizations, but they serve as a picture of the general landscape.

However, the importance of Ruby seemed to change with the advent of Ruby on Rails (RoR). While often criticized as not providing the scalability of the enterprise frameworks, it boasts a pragmatism that gets a lot of very compelling sites up and running in record time. This is the perfect example of the advantages in avoiding premature optimization. The fact that people can just make stuff work in Rails has been enough to see it expand into almost every Web 2.0 site I can think of. Just in case I wasn't paying attention, Sun recently decided this was worth looking at when they hired Thomas Enebo and Charles Nutter, two of the key developers in JRuby. Another indication (if I needed any) is all the interest in ActiveRDF, which is a framework for accessing RDF in a way that is compatible withe existing RoR APIs.

Now everyone I know who uses RoR tells me that I don't need to know Ruby to use it, I'm probably going to spend more time providing services (from Mulgara) than I am to be using it directly. Not to say that I don't want to use RoR... it's just that I usually find myself elsewhere in the programming stack. Besides, I love working at lower levels. So I've been thinking that I should make the time to properly learn this language.

Actually, the thing that finally made me start learning Ruby was when I discovered that there is in-built support for lambdas. Well why didn't anyway just say that before?

Currying

I'm still only partway through the book, but I'm enjoying it a lot. In my "spare time" I'm finding myself torn between reading more, and writing code in Ruby. This is completely ignoring the fact that I'm doing a big refactor in Mulgara (one reason is to improve SOAP support - allowing Ruby to work better), and that I've dabbled with the Talis Platform as well. Because of this I've been restricting myself to playing with language constructs, but I intend to start using it in earnest soon.

One of the things I've been wondering about is currying. I found a reference to doing this in Ruby, but the technique was limited to explicit currying of particular lambdas. It wasn't general, and does not refer to methods.

So I had a look at doing it more generally, and learnt a little on the way.

It would be great if methods in Ruby could be completely compatible with lambdas, but they are separate object types. Initially I despaired of getting a type for a method, as every way I could think of referring to them would call the method instead. This should have clued me into the fact that methods are found by name (as in a string) or by "label". Fortunately, both methods and lambdas respond to the "call" message, meaning they can be used the same way. Instance based Method objects also need to be bound to an object in order to be called, which I also learnt through trial and error. But in the end I found the simple invocation for currying a method:
def curry(fn, p)
lambda { |*args| fn.call(p, *args) }
end
This doesn't do any checking that the parameters will work with it, but it works if called correctly. The thing I like about it is that it lets you curry down arbitrary lambdas iteratively. For instance, I can take a lambda that adds 3 parameters, and curry it down to a lambda that has no parameters:
add_xyz = lambda { |x,y,z| x+y+z }
add_3yz = curry(add_xyz, 3)
add_35z = curry(add_3yz, 5)
add_357 = curry(add_35z, 7)

puts "add_xyz(3,5,7) = #{add_xyz[3,5,7]}"
puts "add_3yz(5,7) = #{add_3yz[5,7]}"
puts "add_35z(7) = #{add_35z[7]}"
puts "add_357() = #{add_357[]}"
All 4 invocations here return the same number (15).

It also works on methods:
class Multiplier
def mult(x,y)
x * y
end
end

foo = Multiplier.new
double = curry(foo.method(:mult), 2)

puts "double(5) = #{double[5]}"
It's not as elegant as Haskell, but I'm pleased to see that it can be generalized. It gives me faith in the power of the language.

Language Life

In recent weeks I've been having discussions with friends about where I see languages like Ruby and Java, and what we think the future may hold.

Ruby really impresses me with the eclectic approach it has taken to advanced techniques, without going overboard on multiple approaches like Perl did (Perl's infamous TMTOWTDI). The Open World model that it brings via dynamic class extension is also refreshing, and a welcome relief to those programming in Java. Ruby is also heavily OO (much moreso than Java - and very much like Smalltalk) and with lambdas it permits a very functional style of programming, which has also been gaining in popularity recently. (For some reason people often think that functional and OO are at odds with each other. This is not so. Most functional languages make significant use of objects. Functional is the opposite of imperative).

As the engine for Ruby improves (maybe even the JRuby engine), and computers become more capable, then a language like this may become the standard in just a few years. Ruby has been getting a PR boost from people heavily involved in the XP development community, and the buzz around RoR continues to grow.

Java, on the other hand, reminds me of C++ in the late 90's.

Around this time there was a huge community around C++. The ISO standard had been approved, and there were several large software houses getting the final elements of the spec into their compilers. Developers were using it from everything from operating systems and embedded devices through to financial applications and GUI front ends. Many of the major GUI libraries were in C++ (MFC, Qt, among others). Text books written on obscure template constructs were selling well. There was a strong market for good C++ developers.

By contrast, Java was a niche system. It had been publicly released in 1995, and had received quite a bit of criticism. Security flaws were found in the early "sandboxes". The GUI's that could be built were rudimentary. Performance was poor. The most compelling aspect of the system was the "demo" that Sun put together, called an "Applet", which allowed you to insert dynamic content into web pages of the browser they built to handle it. Ultimately, other systems were to overtake this one feature that was generating interest.

But I'm sure that if you're reading this blog, then you know all this stuff.

My point is that C++ seemed to be in a secure position, while Java occupied a niche. For a while there it looked like Java wouldn't make it, when Microsoft set out to "Embrace and Extend" this system, like they'd done to so many before it.

And yet, here we are, a decade later. C++ has almost fallen off the map. Sure, it's still important in some areas, but it's largely been supplanted by more modern systems. Java holds pride of place in many system, from financial, through GUIs, to embedded controllers. In fact, today Java seems to hold the place that C++ held a decade before. This alone should be an indication that Java has crested.

Using history as my guide, I would say that in 10 years time Java will be a very minor player, while something else that is a niche today will dominate the market.

If Java is to have any significance in the future, I would guess it to be in Virtual Machine (VM). While there are many VMs out there, the Java VM has had a lot of work go into it by clever people, and now that it's being open sourced, a lot more clever people will be able to help. It has already been successful enough to have spawned several new systems to run on it, including Jython, JRuby and Groovy.

If we look at Ruby as an example of a possible successor, then some may criticize Ruby for not doing OO as well as Smalltalk, or functional programming as well as Erlang and Haskell. Similar criticisms can be made of any of the other languages that are popular today. However, Java was hardly original in anything it did, and yet that didn't prevent it from the success it ultimately enjoyed. (Personally, I see Ruby's threading support to be an Achilles Heel, especially in today's hardware environment, and the direction the chip manufacturers are taking us.)

Friday, August 31, 2007

Lunar Eclipse

A friend of mine took a series of photos of the recent lunar eclipse, as seen from Brisbane. They're not serious astrophotography, but for someone like me who doesn't have much of a view of the sky any more, they were nice to see. (I miss my telescope).

I particularly like the bright exposure of the final crescent, and again when it re-emerges. It's also a nice example of how even the simplest of telescopic lenses is able to see the Galilean moons.

I recommend using the "Slideshow" view, to watch the progression of the full-size photos.

Thursday, August 30, 2007

FOAF

Like David's post yesterday, there have been a number of discussions in recent months about the best practice for URIs that identify people. I've typically stayed out of the public debates, but have been involved in a number of offline conversations.

A popular approach to building these URIs is to configure an HTTP server such that when it receives a request for this URI it responds with an HTTP 303 (which means "See Other"). This lets the server respond with a document pertaining to that URI, but at the same time informs the client that this document is NOT the resolution of that URI. After all, the resolution of the URI is a person, and one can hardly respond with that (for a start, you'd need all that quantum state, and I haven't yet seen the internet protocols for quantum teleportation).

Another approach is to simply use the URI of a document describing the person, and tack an anchor onto the end. Typically this anchor is #me. Like the 303 approach, you can get a retrievable document that can be found from the person's URI, and again that document has a different URI to the URI of the person. The main problem cited with this second approach is that a #me anchor may exist in the document, meaning that the URI resolves to something other than the person (while I recently learned that URI ambiguity is not strictly illegal, it is a really bad idea. After all, we usually rely on these things to identify a unique thing). Other people suggest avoiding possible anchor ambiguity with a query (?key=value on the end of the URL). This is much less popular, and I'll let the public arguments against this stand for themselves.

While looking at the "303" approach the other day, I realized that both Safari and Firefox respond to a 303 as if it were a redirection. This makes sense in several ways. If a user has asked for "something" by address, then they'd like to see whatever data is associated with that address (as opposed to a response of "not here"). Also, the HTTP RFC says that this link should be followed. Even so, since the resulting document is NOT what was asked for, the user should at least be told that they are looking at the "Next Best Thing", rather than silently being redirected.

I came to all of this while updating my FOAF file the other day. While it is possible to describe all of your friends in minute detail, the normal practice is to include just enough information to uniquely identify them (plus a couple of things that are useful to keep locally, like the friend's name). Then when you and your friend's FOAF files are brought into the same store together, all that information will get linked up. This sounds great, until you realize that there is no defined way to find your friends' files. The various FOAF browsers, surfers, etc, that I've tried are all terrible at tracking down people's FOAFs, so whatever they're trying isn't working very well either.

Whether using anchor suffixes or 303s, the URI that people often use for themselves just happens to lead you to their own FOAF files. This would be the solution to the problem of finding your friends' files... if your friends happened to use this approach. While useful, it can't be relied upon for automatic FOAF file gathering. Because of this, I decided that I should try to put explicit links to all of my friends' FOAF URLs that I know about. This led me to tracking down the files of each of the people in my FOAF file (fortunately not many, as most of the people I know don't have a FOAF file), which had me following various 303 links, like the one to Tom's URI. I was using wget, which doesn't follow a "See Other" link automatically, and this was how I discovered that Tom was using a 303. I'm sure if I'd followed his URI with Firefox then I wouldn't have noticed the new address.

After following the links for all these people, I then wanted some way to describe the location of their FOAF in my own FOAF description of them. After some investigation of the FOAF namespace, I discovered that there is no specified way to do this. I suppose this is what led to the de facto standard that people have adopted where their person URI leads you (however indirectly) to their FOAF file. This actually makes perfect sense, as you don't want to invalidate people's links to you just because you chose to move the location of your file, but it's still annoying if you want to be able to link to other people's file. Perhaps everyone should get a PURL address?

The closest thing I could find to a property describing a FOAF file, is the more general <foaf:homepage>. This property lets you link a resource (like a person) to some kind of document describing that resource. This meets the criteria of what I was looking for, but it is also more general than I was after, as it can also be used to point to non-FOAF pages, like a person's home page (the original intent of this property). All the same, I went with it, since it was a valid thing to do. At least it will help any applications that I write to look at my own file. It's a shame that it's so manual.

While thinking about how to automate this process, it occurred to me that I could try the following:
  • If a person's URI ends in an anchor, then strip it off, and follow the URI. If the returned document is RDF then treat it as FOAF data (identifying RDF as being FOAF or not FOAF is another problem).
  • Follow the person's URI, and if the result is a 303, then follow that URI. If the resulting document is a RDF, the treat it as FOAF.
  • Iterate through each URI associated with the person (such as <foaf:homepage>) and if any of these return an RDF file then treat it as FOAF.
  • On each of the HTML pages returned from the previous iterations, check for <a href=...> tags to resources that don't end with .html, .jpg, .png, etc. If querying for any of these links returns an RDF file, then treat as FOAF.
Incidentally, Tom's FOAF file would only be picked up via the last message. You have to follow his URI to get a 303, which then leads you to his home page. Then on that page you'll find links to his FOAF file. Frankly, it was just easier to manually add a <foaf:homepage> tag to his file. :-)

Anachronism

During the various conversations I've had (mostly with Tom), it occurred to me that there is an underlying assumption that all URIs will be HTTP. This is particularly true for 303 responses, as this is an HTTP response code. However, nothing in RDF suggests that the protocol (or scheme, according to URI terminology) has to be HTTP. For instance, it isn't unheard of to find resources at the end of an ftp://... URL. It got me wondering how much it would break existing systems if the URIs used for and in a FOAF file were not in HTTP, but something different. If they handle anything else, then it's almost certain to be FTP (and possibly even HTTPS), so these weren't going to really test things. No, the protocol I chose was Gopher.

The GoFish server managed the details for me here, though it took me a bit of debugging to realize that it wasn't starting when it couldn't find a user/group of "gopher" on my system (Apple didn't retain that account on OS X. Go figure). Once I'd found that problem, it then took me a few minutes to discover that addresses for text file in the root are prefixed with 00/. But once that was done I was off and running.

I'm not a huge fan of running services from my home PC, so I can't say that I'll keep it up for a long time. But at the same time, it gives me some perverse pleasure to hand out my FOAF file as a gopher address. :-)

Sunday, August 05, 2007

Mulgara on Java 6

I'm nearly at the point where I can announce that Mulgara runs on Java 6 (or JDK 1.6). Many of the problems were due to tests looking for an exact output, while the new hashtables iterate over their data in a different order.

The remaining problems fall into two areas.

The first seems to be internal to the implementation of the Java 6 libraries. In one case the HTTPS connection code is unable to find an internal class that occurs in the same place for both Java 5 and Java 6. In another case, a method whose javadoc explicitly says will contain no caching, claims that it has "already closed the file" for all but the first time you try to open a JAR file using a URL, even though the URL object is newly created for each attempt.

These problems may involve some browsing of Java source code to properly track down. Fortunately, they only show up in rarely-used resolver modules (I've never used them myself).

The other problem is that as of Java 6, the java.sql.ResultSet interface has a new series of methods on it. I'd rather that we didn't, but we implement this interface with one of our classes. This is a holdover from a time when we tried to implement JDBC. While it mostly worked, there was a fundamental disconnect with the metadata requirements of JDBC, and so we eventually abandoned the interface. However, the internal implementation of this interface remains.

Since we don't use any of the new methods, then it is a trivial matter to implement these methods with empty stubs. Eclipse does this with a couple of clicks, so it was very easy to do. Once this was done, the project compiled fine, and I could get onto tracking down the failures and errors in the tests, the causes of which I've described above.

All this was going well, until someone pointed out that there were some issues under Windows. After spending some time getting the OS up and running again, I quickly found that the class implementing ResultSet was missing some methods. How could this be? It had all the methods on OS X.

The simple answer is to run javap on the java.sql.ResultSet interface, and compare the results. Sure enough, on Windows (and Linux) the output contains 14 entries not found in OS X!

WTF?!?

This is easy enough to fix. Implementing the methods with stubs will make it work on Linux and Windows, and will have no effect on OS X. But why the difference? This meets my definition of broken.

Saturday, August 04, 2007

OWL Inexpertise

One of my concerns about Talking to Talis yesterday (interesting pun between a verb and a noun there) was in making criticisms of some of the people working on OWL, when I'm really not enough of an expert to make such a call.

I expressed concern over the "logicians" who have designed OWL as being out of touch with the practical concerns of the developers who have to use it. While I still believe there is basis for such an accusation, it is glossing over the very real need for a solid mathematical foundation for OWL, and is also disrespectful to several people in the field whom I respect.

Knowing and understanding exactly what a language is capable of, is vital in its development. Otherwise, it is very easy to introduce features that conflict, or don't make sense in certain applications. Conflicting or vague definitions may work in human language, but is not appropriate when developing systems with the precision that computers require. I have to work hard to get to the necessary understanding of description logic systems, which is why I respect people like Ian Horrocks (or Pat Hayes, or Bijan Parsia, the list goes on...) for whom it all seems to come naturally. Without their work, we wouldn't know exactly what all the consequences of OWL are, meaning that OWL would be useless for reasoning, or describing much of anything at all.

However, coming from a perspective of "correctness" and "tractability", there is a strong desire in this community to keep everything within the domain of OWL-DL (the computationally tractable variant of OWL). Any constructs which fall outside of OWL-DL (and into OWL Full) are often dismissed. Anyone building systems to perform reasoning on OWL seems to be limiting their domain to OWL-DL or less. There appears to be an implicit argument that since calculations for OWL Full cannot be guaranteed to complete, then there is no point in doing them. Use of many constructs is therefore discouraged, on the basis that it is OWL Full syntax.

While this makes sense from a model-theoretic point of view, pragmatically it doesn't work. Turing machines are not tractable (for instance, one can create an infinite loop), and yet no one has suggested that Turing complete languages are not important! Besides, Gödel taught us that tractability is not all that it's cracked up to be.

A practical example of an OWL Full construct is in trying to map a set of RDBMS tables into OWL. It is very common for such tables to be "keyed" on a single field, often a numeric identifier, but sometimes text (like a student number, or SSN). Even if these fields are not the primary key of the table, a good mapping into a language like OWL will need to capture this property of the field.

The appropriate mapping of a key field on a record is to mark that field as a property of type owl:InverseFunctionalPredicate. However, it is not legal to use this property on a number or a string (an RDF literal) in anything less than OWL Full.

There are workarounds to stay within OWL-DL. However, this is one of many common use cases where workarounds are required to stay within the confines of OWL-DL. While theoretically possible that using owl:InverseFunctionalPredicate on a "literal" would cause intractability, most use cases will not lead to this. It would seem safe in many systems to permit this - with an understanding of the dangers involved. Instead, the unwillingness of the experts to let people work with OWL Full, has caused onerous restrictions on many developers. This in turn leads to them simply not bothering with OWL, or to go looking for alternatives.

I can appreciate the need to prevent people from shooting themselves in the foot. On the other hand, preventing someone from taking aim and firing at their feet often leads to other difficulties, encouraging them to just remove the safety altogether.

It's an argument with two sides. There may well be many logicians out there who agree that a practical approach is required for developers, in order to make OWL more accessible to them. However, my own observations have not seen any concessions made on this point.
There. It reads much better here than the bald assertion I made for Talis. :-)

Friday, August 03, 2007

Nodalities

Last night Luc was determined to keep me up, and he did a pretty good job of it. This happens frequently enough that it shouldn't be worth a mention in this blog, except that today I had agreed to speak with Paul Miller from Talis, for the Talking with Talis podcast.

So that I'd be compos mentis, I resorted to a little more coffee than usual (I typically have one in the morning, and sometimes have one in the afternoon. Today I had two in the morning). While this had the desired affect of alertness, the ensuing pleonastic babble was a little unfortunate. Consequently, I feel like I've embarrassed myself eight ways to Sunday, though Paul has been kind enough to say that I did just fine.

I was caught a little off guard by questions asking me to describe RDFS and OWL. Rather than giving a brief description, as I ought to have, I digressed much too far into inane examples. I also said a few things which I thought at the time were kind of wrong (by which, I mean that I was close, but did not hit the mark), but with the conversation being recorded it felt too awkward to go back and correct myself, particularly when I'd need a little time to think in order to get it right.

Perhaps more frustratingly, my needless digressions and inaccurate descriptions stole from the time that could have been used to talk about things I believe to me more interesting. In particular, I'm thinking of the Open Source process, and how it relates to a project like Mulgara. David was able to give a lot of the history behind the project, but as an architect and developer, I have a different perspective that I think also has some value. I also think that open source projects are pivotal in the development of "software as a commodity", which is a notion that deserves serious consideration at the moment. I touched on it briefly, but I also ought to have elaborated on how open source commodity software is really needed as the fundamental infrastructure for enabling the semantic web, and hence the need for projects like Mulgara, Sesame and Jena.

But despite my missed opportunity to discuss these things today, I should not consider Talis's podcast to be a forum for expressing my own agenda. If I have a real desire to say these things, then I should be using my own forum, and that is this blog.

As always, time is against me, but I'll mention a few of these things, and perhaps I can have time to revisit the others in the coming weeks.

People

I should also have mentioned some of the other names involved in Mulgara, from both the past and present. Fortunately, David already mentioned some of them (myself included) but since I'm in my own blog I can go into some more detail. Whether paid or not, these people all gave a great deal of commitment into making this a project with a lot to offer the community. However, since there are so many, I'll just stick to those who have some kind of ongoing connection to the project:
  • David Wood, who decided we could write Mulgara, made enormous sacrifices to pay for it out of his own pocket... and THEN made it open source! His ongoing contributions to Mulgara are still valuable.
  • David Makepeace (a mentor early in my career, who I was fortunate to work with again at Tucana) who was the real genius behind the most complex parts of the system.
  • Tate Jones, who kept everyone focused on what we needed to do.
  • Simon Raboczi who drove us to use the standards, and ensured the underlying mathematical model was correct.
  • Andrew Newman who knew everything there was to know in the semantic web community, and aside from writing important code, he was the one who wouldn't stop asking when we could overcome the commercial concerns and make the system Open Source.
  • Andrae Muys, the last person to join the inner cabal, and the guy who restructured it all for greater modularity, and correctness. This contribution alone cannot be overstated, but since Tucana closed shop he has remained the most committed developer on the project.
  • Collectively, the guys at Topaz, who have provided more support than anyone else since Tucana closed.
These were just some of the guys who made the project worthwhile, and Tucana a great place to work.

Sorry to those I didn't mention.

Even if I move past Mulgara and into a new type of RDF store, then the open source nature of Mulgara will allow me to bring a lot of that intelligence and know-how forward with me. For this reason alone, I think that the Open Source process deserves some discussion.

Architecture

Back when Mulgara (or TKS/Kowari) was first developed, it was interesting to see the schemas being proposed. Looking at them, there was a clear influence from the underlying Description Logic that RDF was meant to represent. However, I was not aware of description logics back then, and instead only knew about RDF as a graph. Incidentally, I only considered RDF/XML to be a serialization of these graphs (a perspective that has been useful over the years), so a knowledge of this wasn't relevant to the work I was doing (though I did learn it).

Since I was graph focused, and not logic focused, I didn't perceive predicates as having a distinct difference from subjects or objects (especially since it is possible to make statements where predicates appear as subjects). Also, while "objects" are different from "subjects" by the inclusion of literal values, this seemed to be a minor annotation, rather than a fundamental difference. Consequently, while considering the "triple" of subject, predicate and object, I started wondering at the significance of their ordering. This led me to drawing them in a triangle, much as you can see in the RDF Icon.


This then led naturally to the three way index we used for the first few months of the project, and is still the basis of our thinking today. Of course, in a commercial environment, we were acutely aware of the need for security, and it wasn't long before we introduced a fourth element to the mix. Initially this was supposed to provide individualized security for each statement (a requested feature), but it didn't take long to realize that we wanted to group statements together, and that security should be applied to groups of statements, rather than each individual statement (else security administration would be far too onerous, regardless of who thought this feature would be a good idea). So the fourth element became our "model", though a little after that the name "graph" became more appropriate.

Moving to 4 nodes in a statement led to an interesting discussion, where we tried to determine what the minimum number of indices would be, based on our previous 3-way design. This is what led to the 6 indices that Mulgara uses today. I explored this in much more depth some time later in this blog, with a couple of entries back in 2004. In fact, it is this very structure that allows us to do very fast querying regardless of complexity (and if we don't, then it just needs re-work on the query optimizer, and not our data structures). More importantly, for my recent purposes (and my thesis), this allows for an interesting re-interpretation of the RETE algorithm for fast rule evaluation. This then is our basis for performing OWL inferences using rules.

See? It's all tied together, from the lowest conceptual levels to the highest!

I freely acknowledge that OWL can imply much more than can be determined with rules (actually, that's not strictly true, as an approach using magic sets to temporarily generate possible predicates can also get to the harder answers - but this is not practical). To get to these other answers, the appropriate mechanism is with a Tableaux reasoner (such as Pellet). However, from experience I believe that most of what people need (and want) is covered quite well with a complete set of rule-base inferences. This was reinforced for me when KAON2 came up with exactly the same approach (though I confess to having been influenced by KAON2 before it was released, in that I was already citing papers which formed the basis of that project).

All the same, while I think Rules will work for most situations, having a tableaux reasoner to fall back on will give Mulgara a more complete feature set. Hence, my desire to integrate Pellet (originally from MIND Lab).

I have yet to look at the internals of Pellet, to see how it stores and accesses its data. I'd love to think that I could use an indexing scheme to help it to scale out over large data sets like rules can, but my (limited) knowledge of the tableaux algorithm says that this is not likely.

Open Source

There are several reasons for liking Pellet over the other available reasoners. First, is that it is under a license that is compatible with Mulgara. Second, is that I saw the ontology debugger demonstrated at MIND Lab a couple of years ago, and have been smitten ever since. Third, the work that Christian Halaschek-Wiener presented at SemTech on OWL Syndication, convinced me that Pellet is really doing the right thing for scalability on TBox reasoning.

Finally, Pellet is open source. Yes, that seems to be repeating my first point about licenses, but this time I have a different emphasis. The first point was about legal compatibility of the projects. The point I want to make here is that reasoning like this is something that everyone should be capable of doing, in the same way that storing large amounts of data should be something that everyone can do. Open source projects not only make this possible, but if the software is lacking in some way, then it can be debugged and/or expanded to create something more functional. Then the license point comes back again, allowing third party integration and collaboration. This lets people build something on top of all these open source commodities that is a gestalt of all the components. Open source projects enable this, allowing the community to rapidly create things that are conceptually far beyond the component parts.

From experience, I've seen the same process in the commercial and open source worlds. In the commercial world, the growth is extraordinarily slow. This is because of limited budgets, and limited communication between those who can make these things happen. Ideas are duplicated between companies, and resources are spent trying to make one superior to all the others, sometimes ignoring customers' needs (and often trying to tell the customer what they need).

In the open source world, everyone is free to borrow from everyone else's ideas (within license compatibility - a possible bugbear), to expand on them, or to use them as a part of a greater whole. Budgets are less of an issue, as projects have a variety of resources available to them, such as contributing sponsors, and hobbyists. Projects focus on the features that clients want, because often the client is contributing to the development team.

Consider MS-SQL and Oracle. Both are very powerful databases, which have competed now for many years. In a market dominated by these players, it is inconceivable that a new database could rival them. Yet MySQL has been steadily gaining ground for many years, first as a niche product for specialized use, and then more and more as a fully functional server. It still has a way to go to scale up to high end needs as the commercial systems do, but this is a conceivable target for MySQL. In the meantime, I would guess that there are more MySQL installations in the world than almost any other RDBMS available today. Importantly, it got here in a fraction of the time that it took the commercial players.

Semantic Web software has a long way to go before reaching the maturity of products like those I just mentioned. We still have to take semantic web software a long way forward. But history has shown us that the way forward is to make the infrastructural software as open and collaborative as possible, enabling everyone to develop at a much higher level, without being concerned about the layers below them. Higher level development has happened with many layers of computing in the past (compilers, OO toolkits, spreadsheets, databases, web scripting languages for server-side and client-side scripting), and the cheaper and more open the lower levels were, the more rapid and functional the high level development became.

It is at this top level that we can provide real value for the world at large, and not just the IT community. It is this that should be driving our development. We should not be striving to make computing better. Computers are just tools. We should be striving to make the world better.

Sounds pretty lofty, I know. Blame the caffeine from this morning wearing off and leaving me feeling light headed. But there has to be some point to it all. This all takes too much work if we indulge in navel gazing by only enabling IT. IT has to enable people outside of its own field or else there is no reason for it to exist, and we will all get caught in another .com bubble-burst.

Monday, July 30, 2007

WiFi in the Balkans

For some time I've known that there is WiFi all over the place here (such an easy-to-remember name when compared to 802.11). However, using an iPhone really brings it home. Whenever I look something up while out of range of my usual networks (and let's face it, I wouldn't have bought an iPhone if I weren't going to be using it all the time) I get a list of anything from 3 to a dozen networks all within range. And this doesn't consider the networks that aren't being broadcast (though I don't think many people use this option). With the exception of the occasional commercial access system (give us your credit card details and we'll let you in), then all of these access points are locked.

Many people have unlimited, or virtually unlimited, high speed internet access, and they're all attaching these wireless gateways to them. These access points then overlap tremendously in range, causing interference with each other, and slowing each other down. This seems like massive duplication to me. Add to this the fact that most of these networks spend the majority of their time idle, and the pointlessness of the situation is even more frustrating.

I'm not advocating grid networking (I'm skeptical that the technology has the algorithms to efficiently route the massive amount of data it would need to deal with). However, it would seem that if the network were configured such that access point owners could open up their access points and let everyone on, then everyone would benefit. Some points would get more traffic than others, but overall it should even out. Coming from this perspective I can understand why so many cities have looked at providing this service, and why Google decided to roll it out in Mountain View.

Many of the advantages are obvious. More efficient usage of the airwaves (fewer mid-air packet collisions), ubiquitous urban access, and less infrastructure cost to the community as a whole.

Unfortunately, I can see the downside too. It would have to be paid for by the community, rather than the individual. The total cost would be much less than is being paid now (by each individual, with their own access point and their own internet connection), but the money still has to come from somewhere. It may not be much in a city budget, but there are always those who don't feel they need to pay extra taxes for services they don't use. I disagree with this view, but my opinion has no impact on voters, and by extension, my opinions have no influence with politicians.

I can also see the authorities having a hissy fit over it. It's trivial to use the internet anonymously (3 coffee shops within a block of here have free WiFi - not to mention more technical solutions), but ignorance or laziness of most people still allow the police, and others, to track down people of interest. The fact that this kind of tracking can be circumvented, or even redirected to someone innocent is of little consequence here. Those who want to find people engaging in certain activities on the internet would not want to allow universal anonymous access, especially in this age of post-911 paranoia. Authorized (and identified) access is not really feasible in this situation, as it would be nearly impossible to roll out and enforce, and easily circumvented. So the easy solution is just prevent people from having ubiquitous community-sponsored WiFi.

The legal framework for some of these restrictions is already being set up in some jurisdictions. Many concerns are currently around accessing private (sometimes download limited) networks, but as these concerns are removed with the promise of ubiquitous "free" access, then other reasons will be cited.

Even more influential than law enforcement (in this country) are the network providers. These companies have already tried to prevent cities from rolling out ubiquitous WiFi. They are obviously scared it will threaten their business model. I don't really care too much, as they are already being paid a lot of money for under utilized service (all those redundant lines not being used to their capacity), and abuse their market in many other ways as well. Like many other large companies, they are unwilling to try to keep up with their market, preferring to shape the market to their own desires. This works in the short term, but history shows it is doomed to failure in the long run.

In the meantime, I'll continue to use EDGE on my iPhone, and wish that my previous phone hadn't died before Apple brought out a model that included 3G.

I'm struggling to stay awake while I type. Does it show?

Sunday, July 29, 2007

Conversations

I've just spent a week working in Novato, CA. While I didn't get much programming done, I did manage a few very productive conversations. I spent the whole week working with Alan, who is interested in Mulgara (for various reasons), and on Wednesday night I finally got to met with Peter from Radar Networks.

While describing the structure of Mulgara, and particularly the string pool, Alan had a number of astute observations. First of all, our 64 bit gNodes don't have a full 64 bit address space to work in, since each node ID is multiplied by the size of an index entry to get a file offset. This isn't an issue in terms of address space (we'd have to be allocating thousands of nodes a second for decades for this to be a problem), but it shows that there are several unused bits of addressable space that are unreachable. This provides opportunities for storing more type information in the ID.

This observation on the address space took on new relevancy when Peter mentioned that another RDF datastore tries to store as much data as possible directly in the indexes, rather than redirecting everything (except blank nodes) through their local equivalent to the string pool. This actually makes perfect sense to me, as the Mulgara string pool (really, it's a "URI and literal" pool) is able to fit a lot of data into less than 64 bits already. We'll only fit in short strings (7 ASCII characters or fewer), but most numeric and data/time data types should fit in here easily. Even if they can't, then we could still map a reduced set of values into this space (How many DateTime values really need more than, say, 58 bits?).

Indeed, I'm only considering the XA store here. When the XA2 store starts to come online it will have run-length encoded sets of triples in the blocks. This means that we can really stretch the length of what gets encoded in the indexes without diverging to the string pool.

The only thing that this approach might break would be some marginal uses of the Node Type and DataType resolvers. These resolvers are usually used to test or filter for node type information, and this function would not be affected. However, both resolvers are capable of being queried for all the contents of the string pool that meet the type criteria, and this function would be compromised. I'm not too worried though, as these functions are really only useful for administrative processes (and marginally at that). The only reason I allowed for this functionality in the first place was because I could, and because it was the natural semantic extension of the required operations. Besides, some of the other changes we might make to the string pool could invalidate this style of selection of "all uses of a given type".

Permanent Strings

The biggest impediment to load speed at the moment appears to the be string pool. It's not usually a big deal, but if you start to load a lot of string data (like from enwiki) then it really shows. Sure, we can cache pretty well (for future lookups), but when you are just writing a lot of string data then this isn't helping.

The use cases I've seen for this sort of thing usually involve loading a lot of data permanently, or loading it, dropping it, and then re-loading the data in a similar form. Either way, optimizing for writing/deleting strings seems pretty pointless. I'm thinking that we really need an index that lets us write strings quickly, at the expense of not being able to delete them (at least, not while the database is live).

I'm not too concerned about over optimizing for this usage pattern, as it can just be written as an alternative string pool, with selection made in the mulgara-config.xml file. It may also make more sense to make a write-once pool the default, as it seems that most people would prefer this.

I've been discussing this write-once pool with a few people now, but it was only while talking with Alan that I realized that almost everything I've proposed is already how Lucene works. We already support Lucene as the backend for a resolver, so it wouldn't be a big step to move it up to taking on many of the string pool functions. Factor in that many of the built in data types (short, int, character, etc) can be put into the indexes online, and the majority of things we need to index in the string pool end up being strings after all, which of course is what Lucene is all about. Lucene is a great system, and integration of projects like this is one of the big advantages of building open source projects.

It's been a while since I wrote to the Lucene API. I ought to pull out the docs and read them again.

Saturday, July 21, 2007

Java 1.6

Mulgara currently doesn't work with Java 6 (also called JDK 1.6). I knew I needed to enable this, but have been putting it off in lieu of more important features. But this release made it very plain that Mulgara is in an awkward position between two Java releases: namely JDK 1.4 and JDK 1.6.

The main problem going from Java 1.4 to Java 5 was the change in libraries included in the JRE. Someone had taken advantage of the Apache XML libraries that were in there, but now these had all changed packages, or were no longer available. The other issue was a few incompatibilities in the unicode implementation - some of which were the reason for introducing the CodePoint class last year, and published 8 days ago.

Going to Java 6 is relatively easy in comparison. Sun learnt their lesson about dropping in third party libraries that users may want to override with more recent versions, so this was not an issue. The only real change has been to the classes in java.sql, in which new interfaces and extensions to old interfaces have prevented a few classes from compiling. This is easily fixed with some stub methods to fulfill the interfaces, since we know these methods are not being called internally in Mulgara.

I haven't gone through everything yet (like the failing HTTP tests), but the main problem for Mulgara seems to be in passing the tests, but not in the code itself. The first of these was a query that returned the correct data, but out of order. Now any queries whose results are to be tested should have an ORDER BY directive, so this failure should not have been allowed to happen. It's easily resolved, but that made me wonder about the change in ordering, until I got to the next test failure.

Initially, I was confused with this failure. The "bad output" contained an exception, which is usually a bad sign. But when I looked at the query which caused the exception I realized that an exception was the correct response. So how could it have passed this test for previous versions of Java? Was it a Schrödinbug?

The first step was to see what the initial committer had expected the result to be. That then led to a "Doh!" moment. The idea of this test was to specifically test that the result would generate an exception, and this was the expected output. Why then, the failure?

Upon careful inspection of the expected and actual outputs, I found the difference in the following line from teh Java 6 run:
Caused by: (QueryException) org.mulgara.query.TuplesException: No such variable $k0 in tuples [$v, $p, $s] (class org.mulgara.resolver.AppendAggregateTuples)
Whereas the expected line reads:
Caused by: (QueryException) org.mulgara.query.TuplesException: No such variable $k0 in tuples [$p, $v, $s] (class org.mulgara.resolver.AppendAggregateTuples)
I immediately thought that the variables had been re-ordered due to the use of a hash table (where no ordering can be guaranteed). So I checked the classes which create this message (org.mulgara.resolver.SubqueryAnswer and org.mulgara.store.tuples.AbstractTuples). In both cases, they use a List, but I was still convinced that the list must have been originally populated by a HashSet. In fact, this also ties in with the first so-called "failure" that I saw, where data in a query was returned in a different order. Some queries will use internal structures to maintain their temporary data, and this one must have been using a Set as well.

To test this, I tried the following code in Java 5 and 6:
import java.util.HashSet;
public class Order {
public static void main(String[] args) {
HashSet s = new HashSet();
s.add("p");
s.add("v");
s.add("s");
for (String x: s) System.out.print(x + " ");
System.out.println();
}
}
In Java 5 the output is: p s v
In Java 6 the output is: v s p

I checked on this, and the hash codes have not changed. So it looks like HashMap has changed in its storage technique.

Fix

I have two ways I can address me problem. The first is to find the map where the data gets reorganized, and either use an ordered collection type, or else use a LinkedHashSet. The latter is still a set, but also guarantees ordering. However, this is a patch, and a bad one at that.

The real solution is to write some more modules for use in JXUnit, to make it more flexible than the current equal/not-equal comparisons done on strings now. This seems like a distraction from writing actual functionality, but I think it's needed, despite it taking longer that the "hack" solution.

Speaking of which... DavidW just asked if I could document the existing resolvers in Mulgara 1.1 (especially the Distributed Resolver). He didn't disagree with my reasons for releasing without documentation, but he pointed out that not having it written up soon could result in a backlash. Much as I hate to admit it (since I have other things to do), he's right.

Wednesday, July 18, 2007

Release

The last week or so has not been conducive to sleep, nor blogging. After getting back I've been trying very hard to get Mulgara version 1.1 out the door. This involved cleaning code a little, but mostly getting documentation right, administering Subversion, updating web sites, looking for source code for obsolete libraries we haven't updated yet (there's something I could use some spare time for), and a hundred things I can't remember right now. And to think, I only took on the administration role to expedite my desire to continue developing for the project.

Shortly after release, there was an awkwardnessful* moment when someone found that the Ant build was broken when a command-line Subversion was unavailable. Fortunately, Ronald fixed this rapidly, and all was well.

Speaking of Ronald, I should mention that a number of other people were involved in this as well. Ronald did some fixes and added a few features. Andrae did some amazing architectural work and implementations. Amit arranged for some of Andrae and Ronald's time. David followed through where few would dare to tread by checking the legal side of our licensing and the libraries we use. Even Brian (who has been MIA until recently) did some proof reading for me. Many thanks to all of these people.

So this leaves me with the rest of my professional life to get back to.

Blogging

So much to talk about, so little time.

I've been talking with a number of people lately on IRC, and IM, about some of the ideas I've had lately. I want to blog this stuff, but with the amount of time I've had lately I'm starting to wonder if I should just post the conversation logs (this was first suggested to me in Twitter by Peter). In the meantime, I hope some of the technical stuff is evolving to a more well thought out plan before I get it all written.

On the other hand, this may also be working out for the best. Some of the blogs I want to write are just rants. This may or may not be justified, but it wouldn't hurt me to get a full night's sleep before indulging in such a diatribe!

In the meantime, tonight I'm just writing something to remind people I'm still here.

Relaxing

With an impending tax refund, plus some frantic saving, I've finally convinced Anne to let me get a piano. One of my lifelong dreams is to own my own baby grand (but that needs more money than I'm likely to have any time soon, plus a bigger house than I'm likely to live in!), with a backup plan of a nice upright. Unfortunately, a decent upright piano is still too expensive, and much too heavy to get up our narrow stairway.

A good compromise at this stage is a digital piano. Now, I've never been a big fan of digital pianos (ironic, coming from an electrical engineer), but I have to say that you can get some pretty nice ones now. The one we finally selected has a very realistic action (essential for proper playing) and very nice sound, achieved through dynamic sampling. This is a fancy way of saying that they recorded every key being hit with varying amounts of force, so the sound played back is a completely different one, depending on how hard you hit the key. It's a great idea, and sounds really good.

But digital piano designers have never really seemed to understood harmonics. Sigh.

All the same, it's almost like the real thing, and it's been an amazing sense of relief to be able to play it. I haven't been able to play now for a few years, and it's always been one of my most enjoyable ways of relaxing the mind. Last night was the first time I got on the new keyboard, and I didn't fire it up until I got the Mulgara fix done (and then I had to spend time assembling it). So I was tired by the time I got to sit and play. But that didn't matter. Two hours later, my right hand was cramping, the fourth and fifth fingers on my left were too exhausted to respond properly, and my back was aching. But I felt so happy that none of that seemed to matter. :-)

Tonight after just a short warmup, I struggled my way through the first movement of the Pathétique sonata. This is way beyond me at the moment, but still fun to struggle through (no, I didn't do the repeat. I could already feel my hand tightening up by the time I got to it). By the end, I was feeling pretty much like I did last night, only now my brain felt fried from trying to read such dense music. Well, it was dense for someone as out of practice as I am.

The fried-brain feeling was unusual. I got to the final page of the first movement (it's 10 pages long) and I started looking at chords and triads that I knew I was capable of understanding, and yet my brain refused to decipher them. I had to stop and look away for 5 seconds before I could look back and understand it. I've never had anything like that happen to me before. I guess I was just using paths in my brain that haven't been used for a very long time. It should get better soon.

Notebook

After 2 years (1.5 in Chicago) at work, I've finally received a computer! Admittedly, I avoided asking to start with, as the standard desktop was Windows (and I've done enough of that, thank you very much), but 7 months ago I realized I was wasting too much time on my slow old PowerBook, so I put in the request. Fortunately, by this time, it was agreed that new Mac would be appropriate, and so I've been waiting ever since.

The order just went in 2 weeks ago, and the computer was supposed to arrive yesterday. However, after 7 months of waiting I wasn't really surprised to get an email forwarded to me from Apple which said that the shipment had been delayed. I was worried it wouldn't be here until after I've left for San Francisco next week, but then it showed up at lunch time today.

Whew!

I used a firewire cable to bring over as much as I could from my previous machine, but there are a few things that aren't working. The first thing I had to do was to get the latest in Java SDKs and documentation. After that, I've been trying to get up to date on the various PPC binaries that I've compiled through Fink. It hasn't been all smooth sailing (the firewire transfer took several hours, during which time I had NO computers available to me), but it's nearly there. I have to say though... it's fast!

Looking forward to actually getting some real work done in the office now!

* 10 points to anyone who can say where this word comes from. You can prove it by providing the other word in the same category. :-)

Friday, July 13, 2007

CodePoints

I was just asked about the Code Point code that I discussed last year. This is essentially a fix for java.lang.Character, since Unicode can require more than 16 bits.

I've put the Java file up here. There's also an example class which takes a string on the command line, and returns the characters sorted, with duplicates removed (the original task that led me to write this class). For instance, if the command line is:
$ java SortUnicode bacdaffaabzr
Then the output is:
abcdfrz
Sure, this can be done relatively easily in Java. The advantage of the CodePoint class is that it provides a useful utility interface, and allows a more functional approach. In this case, it's possible to use a single (verbose) line:
CodePoint.toString(
CodePoint.toArray(
new TreeSet(
CodePoint.toList(inputStr)
)
)
)
There's nothing fancy here, but I hope it's useful.

Thursday, July 12, 2007

OWL with Rules

Yesterday Henry asked me about using EulerSharp (by Jos De Roo) with Mulgara. I've already done the rules engine, and I'm very happy with it, so I told him I won't be replacing it. Admittedly, there's room for optimization - but that's just a matter of finding the time. It runs well as it is.

Of interest though is getting together the rules for making OWL happen. EulerSharp has these in abundance. It may be worthwhile writing an interpreter than can translate them to the Mulgara engine.

There are the obvious rules like:
{?P @has owl:inverseOf ?Q. ?S ?P ?O} => {?O ?Q ?S}.
But the real value would be in rules like:
{?J @has owl:intersectionOf ?L.
?L :item ?I1.
?I1 owl:onProperty ?P;
owl:someValuesFrom ?S.
?S owl:onProperty ?Q;
owl:minCardinality ?N.
?N math:equalTo 2.
?L :item ?I2.
?I2 owl:onProperty ?P;
owl:allValuesFrom ?A.
?A @has owl:unionOf ?K.
?K :item ?C,
?V.
?V owl:onProperty ?Q;
owl:maxCardinality ?M.
?M math:equalTo 1.
?L :item ?I3.
?I3 owl:onProperty ?P;
owl:allValuesFrom ?D.
?C owl:disjointWith ?D}
=> {?J owl:equivalentClass owl:Nothing}.
Which is a convoluted way of testing if a class is an intersection between restrictions with max cardinality of one, and minimum cardinality of 2, where the other possibilities of class membership (via unions, etc) are all eliminated.

Some of the type inferencing on intersections and unions may eliminate the need for complex rules like this (I've been meaning to check out just how far this takes you), but it's cool (scary?) to see it all done in one step like this.

I really need time to write more reasoning code in Mulgara. :-(