Tuesday, March 18, 2008

Functions

Whew! I've finally finished filter functions.

I was just about done when I had two issues show up for me. First off, I realized that each parameter of regex takes an expression that resolves to a simple literal. In other words, it is possible to calculate a different pattern and/or flag for every line <shudder/>. OK, so I wouldn't do it, but the spec says it, so I did it. Not that it was hard. It just seems obtuse.

While I'm on it, the flags for regex don't quite match the flags in Java. Granted, they're ALMOST the same, but if I want to be a stickler about this things, then it's not quite there. The most apparent difference is that the "x" character is not the same as enabling the COMMENTS flag in Java - though it's similar. In fact, in Java 5, the COMMENTS flag does not even appear as an option in the Javadoc, though a quick scan of the library source shows that it is.

Once I found small differences (which frankly I expected to find) I decided not to look for any more. The point is that I am not going to implement my own regex engine. Sure, it would be a great learning experience (I know that suffix trees get me part of the way - but I'd have to learn some more to get all of it), but it would take me months, and for no useful purpose. I'm surprised they didn't just choose a standard engine and say "use a standards-compliant regex engine, like XXX". As it is, it looks like everyone will be nearly there, but never quite make it.

The next problem was that I hadn't looked carefully enough at the definition of equal. I was mostly right, but it turns out that if you compare two literals that are different, then you don't return false: you throw a type exception. That just feels broken. Yes, I understand the semantics, but it's a perfectly common thing to do to check that two literals are the same. Having unexpected data throw an exception from a perfectly formed query might make the type theoreticians happy, but from the perspective of a software developer it looks like bad judgement.

Ironically, you CAN choose to return true for two different literals if you have a specific extension that handles direct comparisons between their types. For instance, you can check if "5"^^xsd:integer is equal to "5"^^xsd:long. Or perhaps you want to compare "5"^^temp:celsius and "41"^^temp:fahrenheit. If you want to get the same lexical form, then you use the sameterm() function, so that case is covered. But what if you want to compare two literals to have the same semantic value, and simply return false if they don't? Maybe I need to re-read this spec, because it doesn't work for me. Still, I've implemented it as asked, if it was more annoying to do so.

So now I have a lot of unit tests to write. Yes, I know the TDD purists will be out to get me, but the exact implementation and interfaces were still floating a little when I started, and besides, it is faster to write code with the tests written after. This is mostly because you don't have to change the tests if you realize you need to change the interfaces. And time is something I'm working hard against at the moment.

Filter

Andrae had a go at me for looking to make filters annotations on the constraints in the AST for the query. I didn't see a problem with this (and there is no operational difference) until Andrae pointed out that it would have a big impact on the optimizer and query re-writer, since each node can have more that one type: a filtered version and an unfiltered version.

He was suggesting that I use the conjunction code to apply filters (and the concrete syntax of SPARQL almost seems to imply that FILTER is added in as a conjunction - though this might just be to allow alternative syntaxes) but I pointed out that this will get awkward as the BOUND() function requires that variables not be guaranteed to be pre-bound. This led to a discussion of the use of BOUND(), and I was able to show that it is often used in conjunction with NOT and OPTIONAL to emulate subtraction functionality. When he saw what I meant, he was quite congratulatory of SPARQL for taking a log(n) operation and making it linear in n.
(For any non-Australians reading this.... yes, that was sarcasm)

At least this conversation made me realize that filtering the output of each Tuple would be a mistake (good thing I haven't written this yet). Instead I'll be implementing FILTER in the AST as a new constraint element that wraps another constraint (this makes it easy for the optimizer and transformer to ignore) and to create a new operation akin to MINUS that will do the work. Currently MINUS removes elements on the left that match (via variable bindings) elements on the right. The new code will remove them based on failing the FILTER test. Simple. :-)

Saturday, March 15, 2008

Writing

I've been trying to sit down and write for over a week, but each time I try I end up writing code instead. I've even fallen behind reading Slashdot. I've been getting a lot of messages from people wanting to know what happened last week, what our plans are for Mulgara, etc, but I just haven't been able to respond. That's what happens when a developer tries to work in the real world. I handle the real world, and I can handle code, but not at the same time. :-(

For the moment, I have priorities with work that I have to see to, so I'll be concentrating on technical things for a while. However, there are a few things happening with Mulgara, so I'll try to mention them as I go. In the meantime, I'm working on SPARQL queries.

SPARQL

The two main features that we're missing now are OPTIONAL and FILTER. Looking at OPTIONAL some time ago I realized that it's a hybrid between ConstraintConjunction (the inner join aspect), and ConstraintDisjunction (matches on the left side leaving unbound columns). I worked on something similar when I did ConstraintDifference a few years ago, so I know that this is easy. Hence, I put this part off until last.

In the last week or so (in between the meeting in San Francisco, and getting a nasty virus) I've been on filters. Right now I'm down to some classes to represent the operator definitions for all the functions like bound(), isIRI() and regex(). I already have the functionality implemented, but you still need to represent it in an abstract syntax if you're going to construct expressions at query time. So it's all just some boiler plate code to represent the parameters and pass the context on down to any variables that need resolving. After that, I'm on to the unit tests. In an ideal world, I'd test everything but in reality I have less time than that. Many of the functions are so similar that I'll just be testing a good sample of each of them.

Looking at the list in the SPARQL definition, you might think that there aren't too many functions at all, but you would be wrong. For a first approximation, many of the functions have to be reimplemented for each type of parameter. I've even gone to the effort of making sure that working on an <xsd:int> returns an <xsd:int> (when appropriate), and that an <xsd:short> returns an <xsd:short>. Since I was already trying to keep floating point numbers and integers apart, then this seemed to be a natural extension. Then I have to consider the types of numbers typed into the SPARQL query, literal numbers typed in to the query, and variables that get bound to numbers during processing. This raises the complexity considerably.

My first attempt had me doing largish methods that have copious "if (value instanceof ...)" statements in them. This is clunky and brittle. The moment I went to do it a second time, I decided to throw it out, and do it all with maps to functors (where are closures?!?). This actually worked well, and has the advantage of giving short and simple functions, and consistent patterns to follow in implementations. I'd have liked to use generics a little more, but they are really suited for interpreting code you are writing, rather than code that is being structured from a parser. Consequently, in one class I ended up writing a little Ruby script to write the series of functor classes I needed for arithmetic operations! Scary, I know, but it works quite well. It was either that or a series of if/then/else blocks taking me down dark passages I never want to enter.

The frustrating thing is that via autoboxing, you can write the same arithmetic over and over again, and have it do different things. For instance, the expression:
x * y
can result it totally different return types depending on whether x and y are Doubles, Floats, Integers, etc. This is common when programming in Java using the native types (like double and int) but this must be established at compile time, not when processing query. That means you want to have access to every combination of parameters at run time. This can be done with autoboxing, and defining classes with interfaces that return java.lang.Numbers. Then the code x*y can be written over and over, and it means something different each time. Java generics are nice, but they are a long way short of C++ templates, a fact especially obvious when you want to use them on native types (along with a hundred other reasons). But Generics + Autoboxing can sometimes get you some of the way.

OK, so that gave me access to each combination of parameters, but surely there's a better way to do it dynamically? Well, not in Java. The only approaches I've seen in the past either use heuristics to work out which version of arithmetic to run, or else it promotes everything into a standard type (like Double). The latter has arithmetic problems, and gives an inappropriate type for the result. The former can just be complex to read, write, and verify.

The problem comes back the CPU having different instructions for the different forms of arithmetic. A compiler has no problems selecting which one to use, but that is because it has access to the entire library of instructions. Conversely, a parser is not expected to have access to all instructions, leading to the problems I'm talking about. So you either choose a subset of instructions to work with (ie. upcast everything), or else you provide all instructions in a library, and then map the parameters into the correct instruction - either with the heuristic tree or something like a hash map.

Dynamic languages have a much easier time of it. For a start, they usually have all instructions at their disposal in the interpreter. Many (though not all) of them also simplify their numeric types to only a couple of types. Whatever they use, the poor programming schmuck writing his own interpreter (that would be me) need only write x*y and let the dynamic language developer work out what he wanted. At the very least, we can emit it in a string and do an eval().

Oh well, I shouldn't complain. I have all the functions written out (via Ruby) and a hash map that lets me get what I need trivially. With the exception that there is a lot of machine generated code that looks like the same thing over and over, the whole system comes down to just a few lines of easily verifiable code - which is what I like to see. Following the code path you'll see that any kind of operation just goes through a few steps and it's done.

Sunday, March 09, 2008

Review

You know you've been lax keeping up with your blog when your mother comments that you haven't updated it in a while.

Part of the reason for my silence has been due to a lot of changes going on for me lately, some of which I was obliged to keep quiet about at the time. More recently, I've been working hard on Mulgara, and when it's come to a choice between coding or blogging, then coding had a higher imperative. But today I find myself in SFO feeling to wrung out to code, so it seems like a good opportunity to play some catch up on my blog.

Talis

Way back in the middle of 2007 I was contacted by Talis who were wondering if I would be interested in working with them on semantic web systems, and possibly on Mulgara. My job at the time (with Herzum Software and the spin-off fourthcodex) was supposed to be based on Semantic Web technology, with a sizable proportion devoted to Mulgara. However, this had not happened for the 2 years I had been there, and so I was willing to consider this proposal. Also, I was getting great enjoyment and occasional inspiration from Paul Miller's Talking with Talis interviews (and even gaining an interest in libraries, courtesy of Richard Walis's productions). I'd also met Ian Davis at SemTech earlier in the year, and had noted with interest that Danny Ayers had recently made the move as well.

So in August I took a few days from work and flew to England for an interview. I was really impressed with the guys in Birmingham, both technically and personally, and had a great time. While my understanding of the details has changed at various times, it seems that Talis have an approach of investing in Semantic Web technology without an requirement of immediate return. They are also providing support to a growing Semantic Web community with the expectation that this will lead to a data infrastructure on which they can layer semantic applications at a higher level than is possible today. To me this seems to be both very forward thinking, as well as operating for the mutual benefit of themselves and the community at large. As an Australian I also found that the similarities in culture with the British gave me a level of comfort beyond what I usually have here in America.

Whether I would be working in semantics, or in the storage layer to enable semantic work by others, this really seemed like a place I'd enjoy working. However, the position would be telecommuting, and I need a visa sponsor while I live here in the USA. Talis were aware of this, and though they said they were in the process of setting up a legal entity over here, the delays this brought about have led to events overtaking this opportunity.

That said, I'm still trying to keep channels open with everyone there, and I'm hoping that I'll be able to work with them in the future, in whatever capacity that may be.

Google

Shortly before the trip to England, I found myself thinking of distributing immutable tree nodes (from Mulgara's internal storage) over a cluster, with the idea of improving scalability of speed and size for RDF storage. These thoughts led to ideas of leveraging a system like the GFS or BigTable. Hadoop is also interesting in this regard, but not as advanced or scalable as the systems at Google. With this in mind, and being particularly frustrated at work, I checked out the Google jobs page, and discovered that they had engineering positions available in Chicago. So I filled in their online forms and sent it off. Disappointingly, the next day I received a form-reply email explaining that I wasn't what they were after.

A few weeks later I met Eric Olson at Tech Cocktail. Eric was still working at Google at the time, and said that he'd mention my name. I have no idea if he did or not, but a couple of weeks later, a Google recruiter in California rang me and asked if I would be available for a phone interview. This was delayed while I went to England, and then delayed further as that recruiter left and another took on my case, but it finally happened in September. It was very strange to do an interview again, when I've conducted so many in the last couple of years. I've also managed to avoid the "normal" interview process for most of the last decade, since I have usually been interviewed or offered positions by people who already knew me, either personally or by reputation.

All the same, this interview went well, as did the next phone interview. So Google organized tickets for me to fly out to Mountain View and interview on site. I hadn't seriously considered a job with them to this point, but I thought it would be interesting to follow the process through.

Visiting the Mountain View campus was quite an experience. It is vast, and has been gradually subsuming the surrounding business district in recent years. Getting around is often done by shuttle bus, or bicycle. People bring their own bikes, but there are a number of Google bikes parked around the place, with helmets available in large bins in the lobby of each building. Not having been given a building number to go to, I started at the central building, where I was quickly spotted and assisted by a security guard. Indeed, I was very impressed at the rapid and efficient response of on-campus security, especially as they were also very helpful and courteous.

The receptionist I was directed to was also helpful, showing where I needed to go, arranging a shuttle bus, providing a visitor's badge and directions, and a fruit juice (Google have large fridges full of Naked juice in every lobby I saw. They also have more exotic flavors available than I have seen anywhere before or since).

Passing by the truck that had come to provide cheap haircuts to staff, I proceeded by a central courtyard which had a full sized Tyrannosaurus Rex skeleton (with pink flamingo in it's mouth - several of it's cousins scattered the lawn) and a large sign proclaiming that there would be a Farmers' Market there at 11am that day.

One bus trip later, I was where I needed to be, and being given a tour of the building. The variety of free coffee and other beverages was really impressive, as was the local version of Google's famous cafeterias. But the thing that really got me was seeing a projected list of Google's text searches scrolling up the wall. These are not done in real time (they would go by too fast) and have been filtered for inappropriate content (no searches for pornography, for instance), but they still served to drive home exactly where you were. This was ground zero. Those searches were resolved here.

The queries were also interesting to watch go by. There were questions on movies, Britney Spears, medical conditions, landmarks, and many questions in foreign languages, some of which were in foreign character sets, like Simplified Chinese. Watching these going by, it is immediately apparent where ideas like Google Zeitgeist came from.

I then went on to have my interviews. There were about 4 of them, with a break for lunch which I had with one of the people I'd had a phone interview with. While a few of the questions were more general, most of them were about how I'd solve programming problems, with an emphasis on doing things to a "Google level of scaling". Funnily enough, my last few years of Mulgara work were perfect for this. On a couple of occasions I even found myself describing code I had written, rather than describing an abstract answer. I also got the chance to ask more about how Google works, and what it's like to be there. I was impressed by everyone's enthusiasm for their work, and for the company culture in general. A couple of people I spoke with also had children, and while they admitted that in the past Google had not been very good at supporting people with young children, in recent years this had improved significantly. But the thing that everyone talked about the most was the "perks". These extend into areas you couldn't imagine, and they are constantly evolving. Unlike most companies who occasionally institute a perk for their staff, possibly guided by a suggestion box, Google has a department whose sole mission it is to identify and implement perks.

Finally the day came to an end, and I was able to head up to San Francisco. I had a very enjoyable evening with Peter and Trish, and the next day spent several hours having Mulgara discussions with Amit and Ronald at Topaz. I was very pleased to get in this last meeting, and had shuffled things around with Google to make sure it could happen.

As most of my friends know, a few weeks later Google made me an offer. While the base salary was simple enough, I was bemused at the complexity of the arrangements for paying bonuses, stock options, and common stock. It is the first job offer I've ever had that came with a set of equations attached. While not going into details, I will say that it was very lucrative - if you came close to meeting your goals. I hadn't really considered accepting an offer until this point, but an offer like that would make anyone seriously reconsider. Consequently I agonized over this for a couple of weeks, right up to the deadline that Google set. In the meantime, I visited the Chicago site (where I insisted I would want to work, despite being asked several times if I'd move to Mountain View), and again was impressed with their setup. In fact, I've had a few people suggest that the setup at Mountain View is getting a little out of control in some ways, but this was not an issue for Chicago at all.

I finally decided to turn Google down, and let them know as soon as I got back from Thanksgiving. I'd had advice from a few people, including some from inside of Google, who all pointed out that my work in the Semantic Web would be totally subsumed by working at Google. I had thought to do something with the "20% projects" that Google is known for, but it was pointed out that because bonuses are based on meeting (and exceeding) goals, then the option to use 20% of your time on something not related to your immediate work was often forgone. You also have to wonder how much of your bonuses, options, and common stock you'd get to see if you tried to keep a balanced lifestyle and didn't achieve your annual goals (apparently these are supposed to be set at a level that is challenging to achieve).

Another serious consideration was one I hadn't expected. Despite having signed an NDA, I learned nothing about Google that isn't already known to the public. Consequently, to an outsider it looked like the company was not doing anything really "interesting". I'm sure they are, but there was nothing inspiring about what they had to tell me. For most of the things I considered to be "cool" technology, I was told that those things were pretty much done, and the work they now do is in different areas altogether. In fact, the majority of the people I spoke to worked in AdWords and Billing. They were very enthusiastic about their work, and given the novelty of their service and the scale they have to work at, then I'm sure it's challenging and interesting work, but it didn't inspire me at all.

Most of all, I've spent my career working with people who know a lot more than I do, to my enjoyment and benefit, and yet, no one I spent time with really impressed me with their knowledge of skills. Don't get me wrong - they were all quite competent and intelligent people. But I really expect something special out of the people I work with, if they are to bring out the best in me. Now I know that Google has employed some of the brightest people in the industry, but the sheer size of the company convinced me that I'm unlikely to find myself working with those people.

For those not paying attention, these last few paragraphs are all a means of justifying to myself that I made the right choice. It wasn't an easy choice to make, since Google does seem like a cool company, the perks were huge, and the remuneration was potentially substantial. But I'm pretty sure I did the right thing, and as one friend said, he thinks it is much cooler to say that you've turned down a Google offer than to have accepted one. :-)

Fedora Commons

Coming up to Christmas, I was finally getting a chance to do some Mulgara work during office hours. This was a huge thing for me, as I had been getting more and more frustrated about it for the previous two years when I was supposed to be doing this. Then in the final days before Christmas my boss, and several others I worked with in fourthcodex, decided that they wanted to do something different in semantic technologies, and resigned. Without a team to work with, there wasn't a lot of scope for me to do semantic work any more, and I was told to stop working on Mulgara again. Sigh.

While some semantic options were being pursued, the fact remained that Herzum Software desperately needed some more senior coders, and it looked very much like I would end up on projects that were of little interest to me. A notable one here was a .Net project that would have me working on site in Pittsburgh. This was something that nobody wanted, including my family, and everyone I was working with on Mulgara.

Talis tried to help at this point (and I'm very grateful that they did), but their interim solution would have made it illegal for Anne to keep her new business running, and I couldn't do that to her. But then, Topaz and Fedora Commons came back to me with an offer to work for them (which distinct organizations, there is an administrative relationship between them, and both are contributing to the Public Library of Science). I've already written about my decision to accept this, which brings me up to today.

I've officially been working for Fedora Commons for about a month now. I've been dividing my time between the SPARQL implementation and responding to support and debugging requests. However, this week has been different. We got all the developers from Topaz and Fedora Commons together, to discuss our plans for the year, and how to manage the process. Mulgara has also been generating some more external interest again, and since we form the core of the active developers, we wanted to discuss ways in which we can work with the community, particularly developers.

Features

The most important features we are implementing in the coming year are SPARQL, multiple concurrent writers, and significantly greater scalability. We have been talking about the last one for a long time, but no one has had the time (or money) to do anything about it. This has now changed, and the work is commencing very soon now. It's been a long time in coming, so I'm quite inspired to get it done now.

Andrae was present for the meeting, and presented some very impressive results to his research on transactionality for multiple writers on an RDF graph. Not only has he demonstrated a mathematically sound foundation for this work, but he has also included an impressive level of engineering for scalability in his designs.

In the meantime, I have come up with a new scheme for indexing RDF, which appears to have significantly better complexity results than what we currently do. Fortunately, the majority of this work is orthogonal to Andrae's designs, with the consequence that the improvement to scalability will be cumulative between both redesigns. I'm pretty chuffed at this. :-) I will be writing more on the indexing shortly, but I have been under some pressure to write this up as an academic paper as well, so that may take priority over my blog.

Significantly, we had James Leigh from Aduna at the meeting as well. Aduna are the company behind the Sesame RDF store, which has been one of the big open source alternatives to Mulgara. They are interested in merging our systems to a certain extent, to the benefit of both. After hearing James out, it sounds like a really good idea (though I may end up throwing away the SPARQL parsing that I've finished - sigh again). I'm not sure when it will happen as everyone as a lot of immediate priorities to get through, but everyone has expressed support for implementing the SAIL API on Mulgara. This is very significant for us, as it will provide a host of new reasoning features, the ability for existing Sesame users to easily try Mulgara, and a SPARQL protocol interface (I'd just been working on the query language for the moment). In turn, I'm hoping that we can demonstrate these new levels of scalability and concurrency for Sesame.

A lot more came out of the meeting, but that was the crux of it. Rather than pre-empt some of the things that are still in motion, I'll let others explain their end of things.

I'm very happy to see this level of interest in Mulgara, and I'm excited to see all these new features starting to be realized at last.