Working notes: 01/01/2007

Friday, January 26, 2007

Time Bugs

A short while ago I received a bug report for Mulgara, where certain Date/Time stamps were being randomly changed. I dreaded to think what this could be, and avoided it in the hope that the reporter (DavidM - but not the DavidM who wrote the storage layer!) or someone actively working in the storage layer (like Andrae) would find the problem.

Another report yesterday made me realize that DavidM didn't know where in the system the problem was to be found. Rather than simply tell him to look at SPObject and SPDateTimeImpl, I decided to have a look myself. After all, I know this area reasonably well.

The latest bug report claims that whenever the following timestamp is entered:
April 26, 1981, 02:56:00
Then the system always returns:
April 26, 1981, 03:56:00
(one hour later)

First, I ran the test queries and confirmed that the problem occurs for me too, and that it is definitely the string pool causing the problem. To help me narrow things down more quickly, I copied out some of the relevant code from the string pool so I could play with SPDateTimeImpl in an isolated environment.

I quickly realized that the problem was occurring when a third-party library was parsing XSD dates. So I reported this to the developers list, and stopped looking. There were some alternative library suggestions, and Brian said he would fix it.

Third Party F/OSS

It was still bothering me that this library was failing for us. Given that it was open source software, I decided that I could at least report the problem to the developers. I don't know about taking the time to debug the code, but at least I could give them a heads-up.

(I won't mention the name of the library, because I don't want anyone to think there is anything wrong with their library. Read on.)

So I wrote an example piece of code that parses one of these bad date/times, and sent it to the developers of this library.

A couple of hours later a developer wrote back to tell me that the library works correctly for him, and that the problem may be related to timezone issues. I'm in CST and he is in BST. He gave me a test case, and asked for some details of my system.

So I ran the test, and discovered 2 things. First, the time was still being printed incorrectly, though it worked for him. Second, the tests that he wrote all passed.

Initially I thought that the tests must have been written poorly, but on inspection I discovered that they were perfectly correct. In this case, he was comparing the results of their parsed date/time with a Calendar object from the Sun JDK. Of course, since the printed data was incorrect, I expected it to compare incorrectly to the Calendar object from the Sun JDK. But this wasn't happening.

The next step was to work out why this was happening, so I asked the third-party library to print its internal long value, and compare it to the internal long value from the Calendar. But the Calendar was also giving the wrong value. That's when I realized something was screwy with the JDK, and not the third-party library.

Timeslip

A quick test class shows the problem:

import java.util.Calendar;
public class TimeTest {
  public static void main(String[] args) {
    Calendar cal = Calendar.getInstance();
    cal.clear();
    cal.set(1981, Calendar.APRIL, 26, 2, 56);
    System.out.println("Time @2:56 in millis: " + cal.getTimeInMillis());
    cal.clear();
    cal.set(1981, Calendar.APRIL, 26, 3, 56);
    System.out.println("Time @3:56 in millis: " + cal.getTimeInMillis());
  }
}

If you live in Australia or Great Briton, then this code will work perfectly. For instance, running this is Australia will give you:

Time @2:56 in millis: 357065760000
Time @3:56 in millis: 357069360000

Note that there is a time difference between 2:56 and 3:56 of 3600000 milliseconds.

However, if you happen to live in the Central American timezone, then you will get this:

Time @2:56 in millis: 357123360000
Time @3:56 in millis: 357123360000

Note that these numbers are identical! They both represent 3:56am on that day.

I tried this with Java 1.4, 1.5 and 1.6 on my Mac. I also tried with with Java 1.5 on x86/Linux and Windows. All of these configurations return identical results.

Could Sun have created a problem here? Why does it only affect timezones in the USA?

At this point it was getting late, and I was startled at what I was seeing. Andrae was online, so I showed him. I was too tired to look in the source for java.util.Calendar and java.util.GregorianCalendar, so he offered to look for me. But rather than letting me go to bed, he insisted I should blog about it so that it was written up somewhere.

Fortunately, it takes a little while for me to write a blog entry. This allowed Andrae time to discover what was happening. It comes down to the following:

The Uniform Time Act of 1966 (15 U.S. Code Section 260a) [see law], signed into Public Law 89-387 on April 12, 1966, by President Lyndon Johnson, created Daylight Saving Time to begin on the last Sunday of April and to end on the last Sunday of October. Any State that wanted to be exempt from Daylight Saving Time could do so by passing a state law.

According to comments in java.util.GregorianCalendar, the hour sequence on that night is:

  12 Std -> 1 Std -> 3 Dst -> 4 Dst

This means that there was no such time as 02:56 on that morning. So the Mulgara date/time bug is not a bug at all, but required behavior.

Disambiguation

The Mulgara behavior may be correct from one perspective, but it can still be troublesome. A time like this can be supplied from a non-US source, but it cannot be entered into a US system. This demonstrates that Mulgara date/times do not support timezones, even though they are described in the XSD specification that they implement.

I'm thinking that the default behavior in Mulgara should be to assume the local timezone (as it does now), but to allow for explicit timezones as well. That way any ambiguity can be removed.

Date/times on the input should allow the timezone to be included, as per the spec. I was disappointed to learn that we don't support that at the moment, but it should be easy enough to add the option to the parser.

That leaves the issue of the output. At the moment, it will be converted to the client's local timezone though it will be printed without timezone information. I'd rather not change the default behavior, so I think we should introduce an option of printing the timezone. Strictly speaking, the timezone used won't matter, as the times will be converted according the timezone being used. However, it would still be nice to specify the exact zone to be used. The time is stored relative to UTC, and the conversion to the time in a specific locality is only really relevant when it needs to be presented to a person. This means that we only need to worry about it at the client end. The infrastructure doesn't exist to store a timezone at the server, so it is only really feasible to make the conversion at the client end anyway.

I expect to put in a system property to control the output format in a few days. I have a lot of travel ahead of me tomorrow, so perhaps I'll get the time then.

Wednesday, January 17, 2007

Object Interfaces

Last year I wrote some code for work for managing objects in Mulgara. It started out as a quick hack, but quickly extended into something properly structured and significant. Unfortunately, most of my hours at work seem to be very inefficient, but I spent many long evenings on it, and got something quite useful going in just 2 weeks.

The idea was to define objects using standard RDFS, and to store instances of those objects all in the same model. This is similar to an object database, or Castor, or even Hibernate, but the object definitions are more malleable, and the instances need not be complete, and can even have fields filled with inferencing.

I thought the result was reasonably compelling, and it enabled someone else to build a nice little Rails application that we used a few times in a demo. Unfortunately, since then we've gone in another direction, essentially orphaning the code.

I'd love to take this code and run with it, but I did write it for work. Even though the majority of the development time was after hours, and the idea was my own, I still spent some work time on it, and it was definitely for a work project. But that's OK. Now I get to implement a Second System. I'll just have to be careful not to overdo it.

Interface Options

One of the biggest problems with objects and object definitions in the previous system was the manual detail needed to build simple things. Every field had to be defined up front, including the name and datatype (including lists). There are also optional attributes such as the field being a key, required, and even transitive. These were encoded using owl:InverseFunctionalProperty, owl:minCardinality and owl:TransitiveProperty, which left the door open to start doing some more interesting things with OWL.

Defining all these fields made for a powerful system, but hardly a friendly one. I'd like to keep an interface like this, but also allow for more natural creation of objects.

The original style of interface was inspired by Perl. In this interface object definitions use a simple map, where each field name is mapped to it's type data (this includes the optional modifiers). I've also added in a couple of methods for finding fields via the properties. The object instances are similar, in that the field names map to the data. It's easy, and it works well, but it involves a lot of messy calls to Map.get() and Map.put().

The first alternative that comes to mind is to build objects by example. By this, I mean to create a Java class definition (sans methods, and with annotations for some more interesting features, like transitivity), and to construct the RDF via reflection. With a system like this, it would be easy to pass any java.lang.Class to create a new definition, or just pass an object and have the object and definition stored at once.

Reading objects back out of the store would take only a little more work. First of all, the name of the object definition can be tested in the class loader. If it exists, then create the object via reflection, and populate it. Otherwise, use ASM (or BCEL, or whatever appeals most to me at the time) to create a new class definition, and hand it off to the class loader to build in the same way as above. I'm still figuring out the bytecode for annotations, but ASM seems to handle them just fine.

(If I wanted to get really fancy, I could even store the methods for a class by storing the AST in RDF. This has been something I've wanted to do for a while, but it really belongs in another project, and including it at this stage really sounds like an example of the second system effect.)

The main problem with this system is that it requires that objects to be stored are already statically defined (you don't want your users to use a bytecode manipulator to dynamically build their classes). That makes the system similar to Hibernate, when it should be much more dynamic.

A simpler alternative might be to allow object definitions with a simple syntax, which I parse as a string. That kind of appeals, since it makes the API easy for the developer, while passing the hard work into the engine, where it belongs.

For the time being, it may be better to stick to the map-style implementation, and add features only once I have it all going again in an open source way.

Pitfalls

Implementing this the first time around showed up some of the difficulties in actually building something like this.

Atomicity

One important problem was a difficulty in querying all the data needed to construct objects in an atomic way. I had hoped that subqueries could solve this, but I didn't find a way through it. If it isn't atomic, then a structure could be inconsistent if someone were modifying the data at the same time.

So far my code has all been "client side", but the requirement of atomicity has me wondering if I need to move to a server side API. I should talk with Andrae about transactions.

Updates

There is also the question of updating the data. This is an important question, while at the same time it is a non-issue for RDF.

RDF asserts simple statements of subject-predicate-object. A statement either exists, or it doesn't. If I have a statement like:
[ S O P1 ]
and I want to change the predicate so that the statement becomes:
[ S O P2 ]
then I have not changed this statement at all! Instead I have removed the first statement, and inserted a new one. So RDF only handle assertions and denials, nothing else. Keeps everything simple.

However, any structures built on RDF (such as RDFS) do require the ability to modify them. I suppose that it is possible to naïvley remove the whole structure, and insert a new one, but this is both inefficient, and possibly disastrous for any links to that object. However, keeping track of object deltas is not very easy. I also have concerns (that I haven't addressed yet) of how to ensure any changes are compatible with related structures.

Open World

Another problem is that the real world wants object structures to be in a "closed world" definition. This isn't the time to justify this assertion, but there are a lot of practical reasons for it. To date I've taken a couple of liberties with RDFS, where I've required that fields in an instance must have type definitions in the object definition, but this is not RDF. The closed world is certainly more practical for computer applications, but I'm thinking I should explore an open world interface, while still keeping the API practical.

Datatype Properties

I also ran into an unusual issue with strings.

While I could have used the org.mulgara.query.Query interface (like I know Andrae does) I have chosen to use iTQL instead. There were several reasons, but the most compelling at the time was the need to create the code quickly and to debug easily.

However, I have to say that I hate working with strings. They're messy, prone to typos, and reek like magic numbers. That's a problem with iTQL, where all the queries are constructed from strings. To counteract this effect, I created a series of classes and enumerations which did all the iTQL building for me. The irony was that I ended up with a similar looking interface to using org.mulgara.query.Query directly. I like to think that I gained a few important features though. :-)

One part of the code passes all subjects, predicates and object to a method to get the appropriate iTQL representation. For URIs this is simply the URI wrapped in angle brackets. For blank nodes during insertion, this becomes a variable. And for literals, the data is converted to a string, placed between single quotes, and gets followed by the datatype. It even spots the difference between a URI and a URI Literal.

Reconstructing data from a literal is also straightforward, using an enumeration which maps the XSD datatypes back to the required Java constructor.

The problem is that all strings end up looking like this:
'A string'^^<http://www.w3.org/2001/XMLSchema#string>

Now strictly speaking, this is right. Unfortunately no one ever writes RDF data that uses typed literals for strings. It appears that untyped literals are always used for strings, and typed literals for anything else. Just to be clear, Mulgara considers a typed string to be distinct from the same value in an untyped string, making strings incompatible between imported RDF and the data I construct.

For the moment I've left the string datatype where I found it, as I haven't tried to integrate data from other sources yet. However, I will probably need to make an exception for strings, where they have the datatype stripped off during a write, and missing datatypes inferred as strings during a read.

This problem did point out that these two types of strings are incompatible in Mulgara. Subsequently I've been wondering for a while whether we need to address this. Maybe I should go and look at some more RDF theory. I suspect that they are intended to be distinct, but I'm not sure I see a practical reason for it.

New API

I've been thinking for a while that Mulgara needs an API at a higher level than RDF. I'm not talking about RDFS or OWL inferencing (which is what I'm usually discussing) but a solid way to manipulate structures at this level of abstraction. I'm even prepared to use a different abstraction to RDFS and OWL, though it makes sense to pursue these ones.

Everything I wrote here is really about object definitions and instances (MOF levels 0 and 1), rather than addressing RDFS or OWL directly. In fact, only a couple of OWL constructs are used, so I can't say that it supports OWL in any meaningful way, only that it uses a subset of OWL to represent some structural information.

I want to pursue this style of API for the time being, since many real applications want to refer directly to objects and their definitions. I think this is a very practical approach, and can even be extended to encompass most (or maybe all) of OWL. But I'm also thinking that I should consider a real honest-to-goodness OWL API as well. Perhaps something that resembles the OWL abstract syntax. After all, there are a lot of people using OWL out there, who may want direct access to OWL structures, but don't want to manipulate RDF (which can be verbose for simple OWL constructs). But before going too far down this road, I should take a closer look at others attempts at an OWL API (such as Sofa, which was integrated into Kowari at one point).

Late

As usual, I've stayed up much too late to write this, leaving no time to proof read. Caveat emptor.

Wednesday, January 10, 2007

Numbers

Here I was, hoping to get more blogging done on my "working vacation", when I suddenly discover that I'm leaving Australia tomorrow! Oh well, it was nice to get some family time in. We have more planned for today.

I got an email this morning telling me that a link to some code I wrote has gone stale in an old blog entry. Oops.

The code isn't a big deal, but if anyone wants to generate a lot of RDF based on cardinal numbers, then I've put the file up on Mulgara.

See you in the USA.

Sunday, January 07, 2007

New Look

I had to change some of the elements on this page (I was still linking to Kowari!), so while I was at it I decided to change the look a little bit.

The color change is mostly because someone anonymous complained about it. I couldn't tell if the complaint was spam, but I figured I was due for a change. The change in width is because the narrow columns have been bugging me for years. Unfortunately, I know next to no CSS, so I fiddled until I liked it a little better. Please let me know if I killed the layout in some horrible manner.

Friday, January 05, 2007

Mulgara Progress

Mulgara is going ahead nicely at the moment, thanks mostly to Andrae. He has implemented some major changes and bug fixes which were sorely needed. My only problem with any of this has been that my main support has been administrative, rather than technical.

Once upon a time I was writing TKS/Kowari/Mulgara code every day. It was very challenging and very satisfying. I also found that the whole process of blogging was useful to help keep my thoughts focussed, and keep other people in touch with what I was up to.

These days I'm involved in a lot of other people's code, and I can't blog about at all. It also keeps me away from coding, which is very frustrating, as I don't get that rush of implementing something cool, and seeing it all the way through to completion. That leaves me to do the interesting work at night, but after a long day at work, and time with family, I don't have the motivation to put in long hours of coding. Of course, two young boys mean that my weekends are packed as well.

Any remaining time does go to Mulgara, but the less frequently I work on it, the more I have to re-learn the context of where I was the "last time I was here". Just yesterday someone sent me an email, where they included a message they received 2 months ago. The email was coherent, and apparently written by someone who really knew the issues. I found myself agreeing with a lot of it and thinking, "this guy knows more than I do". Unfortunately (or fortunately?) I soon realized that the original author was myself.

It can be difficult to make a lot of progress when you have to bring yourself back up to speed every time you look at something.

I'm not sure what to do about this lack of time for non-work activities, but I'll have to resolve it soon. I'm supposed to start back at university shortly, and without any free time available I won't be writing a thesis! That's particularly frustrating, since I've been making all sorts of progress in processing OWL recently.

RLog

So what have I been doing in Mulgara/OWL?

Some time ago I realized that I needed to get an OWL processor out there, even if it were only partly complete. Hey, I can call it OWL-Lite if I want to! The specification explicitly says that there is no list of requirements to fulfill, so I'm covered. Fortunately, there are a lot of simple OWL inferences that I can perform already, courtesy of Raphael Volz. A number of others also come to mind, given some of the operations in iTQL, so this looked like a good first step.

The problem is that most of this work is easy to code using description logic (as Raphael has done in the paper I mentioned), but converting it into the RDF format I've created for the Krule engine is tedious and prone to bugs.

I found myself wishing that someone would write a programs that would parse the description logic, and convert it into Krule syntax. Even better, the dependencies could be automatically generated, based on the the format of the output of one rule, and the input to another. This would avoid many of the bugs I could potentially introduce through manual transposition, and make all future rules MUCH easier to encode.

So I'm motivated by laziness (who wants to manually encode all those OWL/RDFS rules and axioms?), impatience (why should I have to find all those dependencies, when computers are better at that kind of thing?), and hubris (hey, I could write a program to do it for me, and it would be cooler than every other logic interpreter out there because it works on my database).

So it looks like I want to re-write Prolog (yes, I've even looked at an interruption function in the algorithm to permit "cut" operations, à la Prolog). That takes hubris to a whole new level for me. But the funny thing is that it would be so easy. The hard work is in the rule engine, and I've already written that!

The language isn't really Prolog, or even the simpler logic languages, like Datalog. The main difference is in the allowable syntax. For a start, I want to allow domains on the predicates (so I can write owl:sameAs(x,y), for instance). I also want to use lower case letters for variables (instead of the upper case letters required by other logic processors). But in essence it's all the same.

So it's a type of logic, and it's based on RDF. Without anything better, I decided to call it RLog (which I pronounce Arr - Log). Who knows, if I finish it, it might even take off. (OK, OK, I don't have that much hubris).

It's funny that I'm looking to implement this. A couple of years ago I realized that Kowari could form the basis of a Prolog engine, though I knew that there was a lot of coding needed before we could do it. Now I'm on the verge of having written it, and I never really intended that.

Where Am I? How Did I Get Here? And What Am I Doing In This Handbasket?

I started all of this some time ago, but for the reasons I've mentioned above, I still have some way to go.

I started out by encoding my first set of OWL rules in RLog. The language didn't need much definition, just standard predicate logic, along with variables, domained predicates, comments, and so on. So then I converted all the RDFS rules, and added in the OWL rules that I want for the first cut. This showed up some interesting cases that I had to consider.

RLog tricks

The first thing I found was in rule 4b on RDFS. Strictly speaking, it should be:
  rdfs:Resource(u) :- a(x,u).
Meaning that for any triple <x a u> the element labeled "u" is a resource (or rdfs:Resource). That's what the RDF documents say anyway. However, this violates RDF, since u might be a literal. Literals are unable to have types like this, since it requires that they be the subject in a statements, which is illegal in RDF.

The way I've been getting around this in Krule so far has been to use a "type" model in Mulgara. The result removed all resources which were literals (via the "minus" operator). I decided that the best way to achieve this in RLog is with the following:
  rdfs:Resource(u) :- a(x,u), ~rdfs:Literal(u).
So now RLog has negation. Wonder how I'll make that work in the general case?

The next issue is a little trickier. Rule XI (Why Roman numerals? I never worked that out, except that this rule is weird) states that if a predicate URI starts with the RDF domain and an underscore, then it is a ContainerMembershipProperty.
  eg. http://www.w3.org/1999/02/22-rdf-syntax-ns#_1
This isn't quite right, as anything after the underscore is supposed to be a number, but it's not too bad. Sesame seems to get away with this rule.

Since Mulgara is using "magic" predicates for certain operations, it made sense to extend this into RLog:
  rdfs:ContainerMembershipProperty(i) :- i(x,y), mulgaraprefix(i,"&rdf;_").

Fortunately, none of the OWL rules needed anything fancy, though I suspect I may need to when I get deeper into subsumption.

Beaver

I think that I've already discussed the various compiler-compiler options I looked at. To recap, I chose Beaver as an LALR parser, and JFlex as the lexer.

JFlex is under the GPL, which would normally be bad (since Mulgara is OSL), but fortunately they state that any code generated by JFlex can be under any license you like. I don't plan on changing or extending JFlex, so that's fine (I'd be more than happy to contribute back any changes I made).

The incompatibility of the OSL and the GPL means that I can't distribute JFlex to generate the lexing code, but I have two other options. The first is that I just provide the lexing code, and the original .lex file. That would let anyone use the code just fine, since the lexer can just be taken "as is". A developer would only need to download JFlex for themselves if they needed to change the lexing rules.

The other option is to add an Ant task to download JFlex if needed. This seems to be the better, option, but I haven't done it yet.

As far as using Beaver/Flex is concerned, I've finished the parser, and can easily print out the Abstract Syntax Tree (AST). Now I need to write the code that walks over the tree, finds the dependencies, and converts the AST into Krule.

None of it seems to hard, but it's a matter of finding time. Meanwhile, the boys woke up a short time ago, so I'd better go and play "Daddy" for a while...

Working notes