Working notes: 07/01/2007

Monday, July 30, 2007

WiFi in the Balkans

For some time I've known that there is WiFi all over the place here (such an easy-to-remember name when compared to 802.11). However, using an iPhone really brings it home. Whenever I look something up while out of range of my usual networks (and let's face it, I wouldn't have bought an iPhone if I weren't going to be using it all the time) I get a list of anything from 3 to a dozen networks all within range. And this doesn't consider the networks that aren't being broadcast (though I don't think many people use this option). With the exception of the occasional commercial access system (give us your credit card details and we'll let you in), then all of these access points are locked.

Many people have unlimited, or virtually unlimited, high speed internet access, and they're all attaching these wireless gateways to them. These access points then overlap tremendously in range, causing interference with each other, and slowing each other down. This seems like massive duplication to me. Add to this the fact that most of these networks spend the majority of their time idle, and the pointlessness of the situation is even more frustrating.

I'm not advocating grid networking (I'm skeptical that the technology has the algorithms to efficiently route the massive amount of data it would need to deal with). However, it would seem that if the network were configured such that access point owners could open up their access points and let everyone on, then everyone would benefit. Some points would get more traffic than others, but overall it should even out. Coming from this perspective I can understand why so many cities have looked at providing this service, and why Google decided to roll it out in Mountain View.

Many of the advantages are obvious. More efficient usage of the airwaves (fewer mid-air packet collisions), ubiquitous urban access, and less infrastructure cost to the community as a whole.

Unfortunately, I can see the downside too. It would have to be paid for by the community, rather than the individual. The total cost would be much less than is being paid now (by each individual, with their own access point and their own internet connection), but the money still has to come from somewhere. It may not be much in a city budget, but there are always those who don't feel they need to pay extra taxes for services they don't use. I disagree with this view, but my opinion has no impact on voters, and by extension, my opinions have no influence with politicians.

I can also see the authorities having a hissy fit over it. It's trivial to use the internet anonymously (3 coffee shops within a block of here have free WiFi - not to mention more technical solutions), but ignorance or laziness of most people still allow the police, and others, to track down people of interest. The fact that this kind of tracking can be circumvented, or even redirected to someone innocent is of little consequence here. Those who want to find people engaging in certain activities on the internet would not want to allow universal anonymous access, especially in this age of post-911 paranoia. Authorized (and identified) access is not really feasible in this situation, as it would be nearly impossible to roll out and enforce, and easily circumvented. So the easy solution is just prevent people from having ubiquitous community-sponsored WiFi.

The legal framework for some of these restrictions is already being set up in some jurisdictions. Many concerns are currently around accessing private (sometimes download limited) networks, but as these concerns are removed with the promise of ubiquitous "free" access, then other reasons will be cited.

Even more influential than law enforcement (in this country) are the network providers. These companies have already tried to prevent cities from rolling out ubiquitous WiFi. They are obviously scared it will threaten their business model. I don't really care too much, as they are already being paid a lot of money for under utilized service (all those redundant lines not being used to their capacity), and abuse their market in many other ways as well. Like many other large companies, they are unwilling to try to keep up with their market, preferring to shape the market to their own desires. This works in the short term, but history shows it is doomed to failure in the long run.

In the meantime, I'll continue to use EDGE on my iPhone, and wish that my previous phone hadn't died before Apple brought out a model that included 3G.

I'm struggling to stay awake while I type. Does it show?

Sunday, July 29, 2007

Conversations

I've just spent a week working in Novato, CA. While I didn't get much programming done, I did manage a few very productive conversations. I spent the whole week working with Alan, who is interested in Mulgara (for various reasons), and on Wednesday night I finally got to met with Peter from Radar Networks.

While describing the structure of Mulgara, and particularly the string pool, Alan had a number of astute observations. First of all, our 64 bit gNodes don't have a full 64 bit address space to work in, since each node ID is multiplied by the size of an index entry to get a file offset. This isn't an issue in terms of address space (we'd have to be allocating thousands of nodes a second for decades for this to be a problem), but it shows that there are several unused bits of addressable space that are unreachable. This provides opportunities for storing more type information in the ID.

This observation on the address space took on new relevancy when Peter mentioned that another RDF datastore tries to store as much data as possible directly in the indexes, rather than redirecting everything (except blank nodes) through their local equivalent to the string pool. This actually makes perfect sense to me, as the Mulgara string pool (really, it's a "URI and literal" pool) is able to fit a lot of data into less than 64 bits already. We'll only fit in short strings (7 ASCII characters or fewer), but most numeric and data/time data types should fit in here easily. Even if they can't, then we could still map a reduced set of values into this space (How many DateTime values really need more than, say, 58 bits?).

Indeed, I'm only considering the XA store here. When the XA2 store starts to come online it will have run-length encoded sets of triples in the blocks. This means that we can really stretch the length of what gets encoded in the indexes without diverging to the string pool.

The only thing that this approach might break would be some marginal uses of the Node Type and DataType resolvers. These resolvers are usually used to test or filter for node type information, and this function would not be affected. However, both resolvers are capable of being queried for all the contents of the string pool that meet the type criteria, and this function would be compromised. I'm not too worried though, as these functions are really only useful for administrative processes (and marginally at that). The only reason I allowed for this functionality in the first place was because I could, and because it was the natural semantic extension of the required operations. Besides, some of the other changes we might make to the string pool could invalidate this style of selection of "all uses of a given type".

Permanent Strings

The biggest impediment to load speed at the moment appears to the be string pool. It's not usually a big deal, but if you start to load a lot of string data (like from enwiki) then it really shows. Sure, we can cache pretty well (for future lookups), but when you are just writing a lot of string data then this isn't helping.

The use cases I've seen for this sort of thing usually involve loading a lot of data permanently, or loading it, dropping it, and then re-loading the data in a similar form. Either way, optimizing for writing/deleting strings seems pretty pointless. I'm thinking that we really need an index that lets us write strings quickly, at the expense of not being able to delete them (at least, not while the database is live).

I'm not too concerned about over optimizing for this usage pattern, as it can just be written as an alternative string pool, with selection made in the mulgara-config.xml file. It may also make more sense to make a write-once pool the default, as it seems that most people would prefer this.

I've been discussing this write-once pool with a few people now, but it was only while talking with Alan that I realized that almost everything I've proposed is already how Lucene works. We already support Lucene as the backend for a resolver, so it wouldn't be a big step to move it up to taking on many of the string pool functions. Factor in that many of the built in data types (short, int, character, etc) can be put into the indexes online, and the majority of things we need to index in the string pool end up being strings after all, which of course is what Lucene is all about. Lucene is a great system, and integration of projects like this is one of the big advantages of building open source projects.

It's been a while since I wrote to the Lucene API. I ought to pull out the docs and read them again.

Saturday, July 21, 2007

Java 1.6

Mulgara currently doesn't work with Java 6 (also called JDK 1.6). I knew I needed to enable this, but have been putting it off in lieu of more important features. But this release made it very plain that Mulgara is in an awkward position between two Java releases: namely JDK 1.4 and JDK 1.6.

The main problem going from Java 1.4 to Java 5 was the change in libraries included in the JRE. Someone had taken advantage of the Apache XML libraries that were in there, but now these had all changed packages, or were no longer available. The other issue was a few incompatibilities in the unicode implementation - some of which were the reason for introducing the CodePoint class last year, and published 8 days ago.

Going to Java 6 is relatively easy in comparison. Sun learnt their lesson about dropping in third party libraries that users may want to override with more recent versions, so this was not an issue. The only real change has been to the classes in java.sql, in which new interfaces and extensions to old interfaces have prevented a few classes from compiling. This is easily fixed with some stub methods to fulfill the interfaces, since we know these methods are not being called internally in Mulgara.

I haven't gone through everything yet (like the failing HTTP tests), but the main problem for Mulgara seems to be in passing the tests, but not in the code itself. The first of these was a query that returned the correct data, but out of order. Now any queries whose results are to be tested should have an ORDER BY directive, so this failure should not have been allowed to happen. It's easily resolved, but that made me wonder about the change in ordering, until I got to the next test failure.

Initially, I was confused with this failure. The "bad output" contained an exception, which is usually a bad sign. But when I looked at the query which caused the exception I realized that an exception was the correct response. So how could it have passed this test for previous versions of Java? Was it a Schrödinbug?

The first step was to see what the initial committer had expected the result to be. That then led to a "Doh!" moment. The idea of this test was to specifically test that the result would generate an exception, and this was the expected output. Why then, the failure?

Upon careful inspection of the expected and actual outputs, I found the difference in the following line from teh Java 6 run:
Caused by: (QueryException) org.mulgara.query.TuplesException: No such variable $k0 in tuples [$v, $p, $s] (class org.mulgara.resolver.AppendAggregateTuples)
Whereas the expected line reads:
Caused by: (QueryException) org.mulgara.query.TuplesException: No such variable $k0 in tuples [$p, $v, $s] (class org.mulgara.resolver.AppendAggregateTuples)
I immediately thought that the variables had been re-ordered due to the use of a hash table (where no ordering can be guaranteed). So I checked the classes which create this message (org.mulgara.resolver.SubqueryAnswer and org.mulgara.store.tuples.AbstractTuples). In both cases, they use a List, but I was still convinced that the list must have been originally populated by a HashSet. In fact, this also ties in with the first so-called "failure" that I saw, where data in a query was returned in a different order. Some queries will use internal structures to maintain their temporary data, and this one must have been using a Set as well.

To test this, I tried the following code in Java 5 and 6:

import java.util.HashSet;
public class Order {
  public static void main(String[] args) {
    HashSet s = new HashSet();
    s.add("p");
    s.add("v");
    s.add("s");
    for (String x: s) System.out.print(x + " ");
    System.out.println();
  }
}

In Java 5 the output is: p s v
In Java 6 the output is: v s p

I checked on this, and the hash codes have not changed. So it looks like HashMap has changed in its storage technique.

Fix

I have two ways I can address me problem. The first is to find the map where the data gets reorganized, and either use an ordered collection type, or else use a LinkedHashSet. The latter is still a set, but also guarantees ordering. However, this is a patch, and a bad one at that.

The real solution is to write some more modules for use in JXUnit, to make it more flexible than the current equal/not-equal comparisons done on strings now. This seems like a distraction from writing actual functionality, but I think it's needed, despite it taking longer that the "hack" solution.

Speaking of which... DavidW just asked if I could document the existing resolvers in Mulgara 1.1 (especially the Distributed Resolver). He didn't disagree with my reasons for releasing without documentation, but he pointed out that not having it written up soon could result in a backlash. Much as I hate to admit it (since I have other things to do), he's right.

Wednesday, July 18, 2007

Release

The last week or so has not been conducive to sleep, nor blogging. After getting back I've been trying very hard to get Mulgara version 1.1 out the door. This involved cleaning code a little, but mostly getting documentation right, administering Subversion, updating web sites, looking for source code for obsolete libraries we haven't updated yet (there's something I could use some spare time for), and a hundred things I can't remember right now. And to think, I only took on the administration role to expedite my desire to continue developing for the project.

Shortly after release, there was an awkwardnessful^* moment when someone found that the Ant build was broken when a command-line Subversion was unavailable. Fortunately, Ronald fixed this rapidly, and all was well.

Speaking of Ronald, I should mention that a number of other people were involved in this as well. Ronald did some fixes and added a few features. Andrae did some amazing architectural work and implementations. Amit arranged for some of Andrae and Ronald's time. David followed through where few would dare to tread by checking the legal side of our licensing and the libraries we use. Even Brian (who has been MIA until recently) did some proof reading for me. Many thanks to all of these people.

So this leaves me with the rest of my professional life to get back to.

Blogging

So much to talk about, so little time.

I've been talking with a number of people lately on IRC, and IM, about some of the ideas I've had lately. I want to blog this stuff, but with the amount of time I've had lately I'm starting to wonder if I should just post the conversation logs (this was first suggested to me in Twitter by Peter). In the meantime, I hope some of the technical stuff is evolving to a more well thought out plan before I get it all written.

On the other hand, this may also be working out for the best. Some of the blogs I want to write are just rants. This may or may not be justified, but it wouldn't hurt me to get a full night's sleep before indulging in such a diatribe!

In the meantime, tonight I'm just writing something to remind people I'm still here.

Relaxing

With an impending tax refund, plus some frantic saving, I've finally convinced Anne to let me get a piano. One of my lifelong dreams is to own my own baby grand (but that needs more money than I'm likely to have any time soon, plus a bigger house than I'm likely to live in!), with a backup plan of a nice upright. Unfortunately, a decent upright piano is still too expensive, and much too heavy to get up our narrow stairway.

A good compromise at this stage is a digital piano. Now, I've never been a big fan of digital pianos (ironic, coming from an electrical engineer), but I have to say that you can get some pretty nice ones now. The one we finally selected has a very realistic action (essential for proper playing) and very nice sound, achieved through dynamic sampling. This is a fancy way of saying that they recorded every key being hit with varying amounts of force, so the sound played back is a completely different one, depending on how hard you hit the key. It's a great idea, and sounds really good.

But digital piano designers have never really seemed to understood harmonics. Sigh.

All the same, it's almost like the real thing, and it's been an amazing sense of relief to be able to play it. I haven't been able to play now for a few years, and it's always been one of my most enjoyable ways of relaxing the mind. Last night was the first time I got on the new keyboard, and I didn't fire it up until I got the Mulgara fix done (and then I had to spend time assembling it). So I was tired by the time I got to sit and play. But that didn't matter. Two hours later, my right hand was cramping, the fourth and fifth fingers on my left were too exhausted to respond properly, and my back was aching. But I felt so happy that none of that seemed to matter. :-)

Tonight after just a short warmup, I struggled my way through the first movement of the Pathétique sonata. This is way beyond me at the moment, but still fun to struggle through (no, I didn't do the repeat. I could already feel my hand tightening up by the time I got to it). By the end, I was feeling pretty much like I did last night, only now my brain felt fried from trying to read such dense music. Well, it was dense for someone as out of practice as I am.

The fried-brain feeling was unusual. I got to the final page of the first movement (it's 10 pages long) and I started looking at chords and triads that I knew I was capable of understanding, and yet my brain refused to decipher them. I had to stop and look away for 5 seconds before I could look back and understand it. I've never had anything like that happen to me before. I guess I was just using paths in my brain that haven't been used for a very long time. It should get better soon.

Notebook

After 2 years (1.5 in Chicago) at work, I've finally received a computer! Admittedly, I avoided asking to start with, as the standard desktop was Windows (and I've done enough of that, thank you very much), but 7 months ago I realized I was wasting too much time on my slow old PowerBook, so I put in the request. Fortunately, by this time, it was agreed that new Mac would be appropriate, and so I've been waiting ever since.

The order just went in 2 weeks ago, and the computer was supposed to arrive yesterday. However, after 7 months of waiting I wasn't really surprised to get an email forwarded to me from Apple which said that the shipment had been delayed. I was worried it wouldn't be here until after I've left for San Francisco next week, but then it showed up at lunch time today.

Whew!

I used a firewire cable to bring over as much as I could from my previous machine, but there are a few things that aren't working. The first thing I had to do was to get the latest in Java SDKs and documentation. After that, I've been trying to get up to date on the various PPC binaries that I've compiled through Fink. It hasn't been all smooth sailing (the firewire transfer took several hours, during which time I had NO computers available to me), but it's nearly there. I have to say though... it's fast!

Looking forward to actually getting some real work done in the office now!

* 10 points to anyone who can say where this word comes from. You can prove it by providing the other word in the same category. :-)

Friday, July 13, 2007

CodePoints

I was just asked about the Code Point code that I discussed last year. This is essentially a fix for java.lang.Character, since Unicode can require more than 16 bits.

I've put the Java file up here. There's also an example class which takes a string on the command line, and returns the characters sorted, with duplicates removed (the original task that led me to write this class). For instance, if the command line is:

$ java SortUnicode bacdaffaabzr

Then the output is:

abcdfrz

Sure, this can be done relatively easily in Java. The advantage of the CodePoint class is that it provides a useful utility interface, and allows a more functional approach. In this case, it's possible to use a single (verbose) line:

CodePoint.toString(
  CodePoint.toArray(
    new TreeSet(
      CodePoint.toList(inputStr)
    )
  )
)

There's nothing fancy here, but I hope it's useful.

Thursday, July 12, 2007

OWL with Rules

Yesterday Henry asked me about using EulerSharp (by Jos De Roo) with Mulgara. I've already done the rules engine, and I'm very happy with it, so I told him I won't be replacing it. Admittedly, there's room for optimization - but that's just a matter of finding the time. It runs well as it is.

Of interest though is getting together the rules for making OWL happen. EulerSharp has these in abundance. It may be worthwhile writing an interpreter than can translate them to the Mulgara engine.

There are the obvious rules like:

{?P @has owl:inverseOf ?Q. ?S ?P ?O} => {?O ?Q ?S}.

But the real value would be in rules like:

{?J @has owl:intersectionOf ?L.
 ?L :item ?I1.
 ?I1 owl:onProperty ?P;
     owl:someValuesFrom ?S.
 ?S owl:onProperty ?Q;
    owl:minCardinality ?N.
 ?N math:equalTo 2.
 ?L :item ?I2.
 ?I2 owl:onProperty ?P;
     owl:allValuesFrom ?A.
 ?A @has owl:unionOf ?K.
 ?K :item ?C,
          ?V.
 ?V owl:onProperty ?Q;
    owl:maxCardinality ?M.
 ?M math:equalTo 1.
 ?L :item ?I3.
 ?I3 owl:onProperty ?P;
     owl:allValuesFrom ?D.
 ?C owl:disjointWith ?D}
=> {?J owl:equivalentClass owl:Nothing}.

Which is a convoluted way of testing if a class is an intersection between restrictions with max cardinality of one, and minimum cardinality of 2, where the other possibilities of class membership (via unions, etc) are all eliminated.

Some of the type inferencing on intersections and unions may eliminate the need for complex rules like this (I've been meaning to check out just how far this takes you), but it's cool (scary?) to see it all done in one step like this.

I really need time to write more reasoning code in Mulgara. :-(

Wednesday, July 11, 2007

Mulgara Nearly There

After my week off, I got back to find that the release candidate had gone fine, and no one had bugs to report. The only suggestions were some documentation shortcomings.

I fixed those parts of the documentation which were really wrong (like the fact that we only run on Java 1.5 at the moment), but decided to leave alone the documentation for the new resolvers. This is because of the time required to write this, and also because none of the existing resolvers are documented anyway! This is a problem for all resolvers, and needs fixing, but since we put 1.0.0 out without this documentation, I figure it won't kill 1.1.0 to be missing it either.

Consequently, the release is ready, and all the files are in place. I was prepared to announce it, but then David got back from vacation and offered to do a legal review. I can't afford to turn him down on this, so it's sitting and waiting until I hear back from him. Fortunately, he said he can have it done by the end of the weekend.

I'm looking forward to getting some new functionality in Mulgara, so I'll be relieved when this is out of the way.

Certificates

When I first set up https on Mulgara (for Subversion access) I used a self signed certificate so I could get it up and running. Since then, I've looked at getting a real certificate from an appropriate authority, but I'm not sure where to go for it.

There are a couple of free options (my favorite being CAcert), but I don't know of any that are included by default in any browsers. This makes them little better than self signed certificates - which aren't worth the bits they're printed on.

So, do I fork over some money for a Thawte, VeriSign or similar certificate? Or is there a better option for a project that is really just my hobby. I already pay for the domain and I keep a server solely for Mulgara tests, so it would be nice to keep the certificate cheap.

Working notes