Wednesday, January 17, 2007

Object Interfaces

Last year I wrote some code for work for managing objects in Mulgara. It started out as a quick hack, but quickly extended into something properly structured and significant. Unfortunately, most of my hours at work seem to be very inefficient, but I spent many long evenings on it, and got something quite useful going in just 2 weeks.

The idea was to define objects using standard RDFS, and to store instances of those objects all in the same model. This is similar to an object database, or Castor, or even Hibernate, but the object definitions are more malleable, and the instances need not be complete, and can even have fields filled with inferencing.

I thought the result was reasonably compelling, and it enabled someone else to build a nice little Rails application that we used a few times in a demo. Unfortunately, since then we've gone in another direction, essentially orphaning the code.

I'd love to take this code and run with it, but I did write it for work. Even though the majority of the development time was after hours, and the idea was my own, I still spent some work time on it, and it was definitely for a work project. But that's OK. Now I get to implement a Second System. I'll just have to be careful not to overdo it.

Interface Options

One of the biggest problems with objects and object definitions in the previous system was the manual detail needed to build simple things. Every field had to be defined up front, including the name and datatype (including lists). There are also optional attributes such as the field being a key, required, and even transitive. These were encoded using owl:InverseFunctionalProperty, owl:minCardinality and owl:TransitiveProperty, which left the door open to start doing some more interesting things with OWL.

Defining all these fields made for a powerful system, but hardly a friendly one. I'd like to keep an interface like this, but also allow for more natural creation of objects.

The original style of interface was inspired by Perl. In this interface object definitions use a simple map, where each field name is mapped to it's type data (this includes the optional modifiers). I've also added in a couple of methods for finding fields via the properties. The object instances are similar, in that the field names map to the data. It's easy, and it works well, but it involves a lot of messy calls to Map.get() and Map.put().

The first alternative that comes to mind is to build objects by example. By this, I mean to create a Java class definition (sans methods, and with annotations for some more interesting features, like transitivity), and to construct the RDF via reflection. With a system like this, it would be easy to pass any java.lang.Class to create a new definition, or just pass an object and have the object and definition stored at once.

Reading objects back out of the store would take only a little more work. First of all, the name of the object definition can be tested in the class loader. If it exists, then create the object via reflection, and populate it. Otherwise, use ASM (or BCEL, or whatever appeals most to me at the time) to create a new class definition, and hand it off to the class loader to build in the same way as above. I'm still figuring out the bytecode for annotations, but ASM seems to handle them just fine.

(If I wanted to get really fancy, I could even store the methods for a class by storing the AST in RDF. This has been something I've wanted to do for a while, but it really belongs in another project, and including it at this stage really sounds like an example of the second system effect.)

The main problem with this system is that it requires that objects to be stored are already statically defined (you don't want your users to use a bytecode manipulator to dynamically build their classes). That makes the system similar to Hibernate, when it should be much more dynamic.

A simpler alternative might be to allow object definitions with a simple syntax, which I parse as a string. That kind of appeals, since it makes the API easy for the developer, while passing the hard work into the engine, where it belongs.

For the time being, it may be better to stick to the map-style implementation, and add features only once I have it all going again in an open source way.

Pitfalls

Implementing this the first time around showed up some of the difficulties in actually building something like this.

Atomicity
One important problem was a difficulty in querying all the data needed to construct objects in an atomic way. I had hoped that subqueries could solve this, but I didn't find a way through it. If it isn't atomic, then a structure could be inconsistent if someone were modifying the data at the same time.

So far my code has all been "client side", but the requirement of atomicity has me wondering if I need to move to a server side API. I should talk with Andrae about transactions.

Updates
There is also the question of updating the data. This is an important question, while at the same time it is a non-issue for RDF.

RDF asserts simple statements of subject-predicate-object. A statement either exists, or it doesn't. If I have a statement like:
  [ S O P1 ]
and I want to change the predicate so that the statement becomes:
  [ S O P2 ]
then I have not changed this statement at all! Instead I have removed the first statement, and inserted a new one. So RDF only handle assertions and denials, nothing else. Keeps everything simple.

However, any structures built on RDF (such as RDFS) do require the ability to modify them. I suppose that it is possible to naïvley remove the whole structure, and insert a new one, but this is both inefficient, and possibly disastrous for any links to that object. However, keeping track of object deltas is not very easy. I also have concerns (that I haven't addressed yet) of how to ensure any changes are compatible with related structures.

Open World
Another problem is that the real world wants object structures to be in a "closed world" definition. This isn't the time to justify this assertion, but there are a lot of practical reasons for it. To date I've taken a couple of liberties with RDFS, where I've required that fields in an instance must have type definitions in the object definition, but this is not RDF. The closed world is certainly more practical for computer applications, but I'm thinking I should explore an open world interface, while still keeping the API practical.

Datatype Properties
I also ran into an unusual issue with strings.

While I could have used the org.mulgara.query.Query interface (like I know Andrae does) I have chosen to use iTQL instead. There were several reasons, but the most compelling at the time was the need to create the code quickly and to debug easily.

However, I have to say that I hate working with strings. They're messy, prone to typos, and reek like magic numbers. That's a problem with iTQL, where all the queries are constructed from strings. To counteract this effect, I created a series of classes and enumerations which did all the iTQL building for me. The irony was that I ended up with a similar looking interface to using org.mulgara.query.Query directly. I like to think that I gained a few important features though. :-)

One part of the code passes all subjects, predicates and object to a method to get the appropriate iTQL representation. For URIs this is simply the URI wrapped in angle brackets. For blank nodes during insertion, this becomes a variable. And for literals, the data is converted to a string, placed between single quotes, and gets followed by the datatype. It even spots the difference between a URI and a URI Literal.

Reconstructing data from a literal is also straightforward, using an enumeration which maps the XSD datatypes back to the required Java constructor.

The problem is that all strings end up looking like this:
  'A string'^^<http://www.w3.org/2001/XMLSchema#string>

Now strictly speaking, this is right. Unfortunately no one ever writes RDF data that uses typed literals for strings. It appears that untyped literals are always used for strings, and typed literals for anything else. Just to be clear, Mulgara considers a typed string to be distinct from the same value in an untyped string, making strings incompatible between imported RDF and the data I construct.

For the moment I've left the string datatype where I found it, as I haven't tried to integrate data from other sources yet. However, I will probably need to make an exception for strings, where they have the datatype stripped off during a write, and missing datatypes inferred as strings during a read.

This problem did point out that these two types of strings are incompatible in Mulgara. Subsequently I've been wondering for a while whether we need to address this. Maybe I should go and look at some more RDF theory. I suspect that they are intended to be distinct, but I'm not sure I see a practical reason for it.

New API

I've been thinking for a while that Mulgara needs an API at a higher level than RDF. I'm not talking about RDFS or OWL inferencing (which is what I'm usually discussing) but a solid way to manipulate structures at this level of abstraction. I'm even prepared to use a different abstraction to RDFS and OWL, though it makes sense to pursue these ones.

Everything I wrote here is really about object definitions and instances (MOF levels 0 and 1), rather than addressing RDFS or OWL directly. In fact, only a couple of OWL constructs are used, so I can't say that it supports OWL in any meaningful way, only that it uses a subset of OWL to represent some structural information.

I want to pursue this style of API for the time being, since many real applications want to refer directly to objects and their definitions. I think this is a very practical approach, and can even be extended to encompass most (or maybe all) of OWL. But I'm also thinking that I should consider a real honest-to-goodness OWL API as well. Perhaps something that resembles the OWL abstract syntax. After all, there are a lot of people using OWL out there, who may want direct access to OWL structures, but don't want to manipulate RDF (which can be verbose for simple OWL constructs). But before going too far down this road, I should take a closer look at others attempts at an OWL API (such as Sofa, which was integrated into Kowari at one point).

Late

As usual, I've stayed up much too late to write this, leaving no time to proof read. Caveat emptor.

1 comment:

Anonymous said...

Have you had a look at ActiveRDF? If you want to expose RDF data as objects, and also do Rails integration: that's exactly what we do.

I hadn't heard of Mulgara before (I had heard of Kowari) so we have no adapter for it yet, but that should be a just a couple of hours if you're interested.