Saturday, April 24, 2010

Program Modeling


The rise of Model Driven Architecture in development, has created a slew of modeling frameworks that can be used as tools for this design process. While some MDA tools involve a static design and a development process, others seek to have an interactive model that allows developers to work with the model at runtime. Since an RDF store allows models developed in RDFS and OWL to be read and updated at runtime, it therefore seems natural to use the modeling capabilities of RDFS and OWL to provide yet another framework for design and development. RDF also has the unusual characteristic of seamlessly mixing instance data with model data (the so-called "ABox" and "TBox"), giving the promise of a system that allows both dynamic modeling and persistent data all in a single integrated system. However, there appear to be some common pitfalls that developers fall prey to, which make this approach less useful than it might otherwise be.

For good or ill, two of the most common languages used on large scale projects today are Java and C#. Java in particular also has good penetration on the web, though for smaller projects more modern languages, such as Python or Ruby, are more often deployed. There are lots of reasons for Java and C# to be so popular on large projects: They are widely known and it can be easy to assemble a team around it; both the JVM and .Net engines have demonstrated substantial benefits in portability, memory management and optimization through JIT compiling; they have established a reputation of stability over their histories. Being a fan of functional programming and modern languages, I often find Java to be frustrating, but these strengths often bring me back to Java again, despite its shortcomings. Consequently, it is usually with Java or C# in mind that MDA projects start out trying to use OWL modeling.

On a related note, enterprise frameworks regularly make use of Hibernate to store and retrieve instance data using a relational database (RDBMS). Hibernate maps object definitions to an equivalent representation in a database table using an Object-Relational Mapping (ORM). While not a formal MDA modeling paradigm, a relational database schema forms a model, in the same way that UML or MOF does (only less expressive). While an ORM is not a tool for MDA, it nevertheless represents the code of a project in a form of model, with instance data that is interactive at runtime.

Unfortunately, the ORM approach offers a very weak form of modeling, and it has no capability to dynamically update at runtime. Several developers have looked at this problem and reasoned that perhaps these problems could be solved by modeling in RDF instead. After all, an RDF store allows a model to be updated as easily as the data, and the expressivity of OWL is far greater than that of the typical RDBMS schema. To this end, we have seen a few systems which have created an Object-Triples Mapping (OTM), with some approaches demonstrating more utility than others.

Static Languages


When OTM approaches are applied to Java and C#, we typically see an RDFS schema that describes the data to be stored and retrieved. This can be just as effective as an ORM on a relational database, and has the added benefit of allowing expansions of the class definitions to be implemented simply as extra data to be stored, rather than the whole change management demanded with the update of a RDBMS. An RDF store also has the benefit of being able to easily annotate the class instances being stored, though this is not reflected in the Java/C# classes and requires a new interface to interact with it. Unfortunately, the flow of modeling information through these systems is typically one-way, and does not make use of the real power being offered by RDF.

ORM in Hibernate embeds the database schema into the programming language by mapping table schemas to class descriptions in Java/C# and table entries to instances of those classes in runtime memory. Perhaps through this example set by ORM, we often see OTM systems mapping OWL classes to Java/C# classes, and RDF instances to Java/C# instances. This mapping seems intuitive, and it has its uses, but it is also fundamentally flawed.

The principle issue with OTM systems that attempt to embed themselves in the language, is that static languages (like the popular Java and C# languages) are unable to deal with arbitrary RDF. RDF and OWL work on an Open World Assumption, meaning that there may well be more of the model that the system is not yet aware of, and should be capable of taking into consideration. However, static languages are only able to update class definitions outside of runtime, meaning that they cannot accept new modeling information during runtime. It is possible to define a new class at runtime using a bytecode editing library, but then the class may only be accessed through meta-programming interfaces like reflection, defeating the purpose of the embedding in the first place. This is what is meant by the flow of modeling information being one-way: updates to the program code can be dynamically handled by a model in an RDF store, but updates to the model cannot be reflected by corresponding updates in the program.

But these programming languages are Turing Complete. We ought to be able to work with dynamic modeling in triples with static languages, so how do we approach it? The solution is to abandon the notion of embedding the model into the language. These classes are not dynamically reconfigurable, and therefore they cannot update with new model updates. Instead, object structures that can be updated can be used to represent the model. Unfortunately, this no longer means that the model is being used to model the programming code (as desired in MDA), but it does mean that the models are now accurate, and can represent the full functionality being expressed in the RDFS/OWL model.

As an example, it is relatively easy to express an instance of a class as a Java Map, with properties being the keys, and the "object" values being the values in the map. This is exactly the same as the way structures are expressed in Perl, so it should be a familiar approach to many developers. These instances should be constructed with a factory that takes a structure that contains the details of an OWL class (or, more likely, that subset of OWL that is relevant to the application). In this way it is possible to accurately represent any information found in an RDF store, regardless of foreknowledge. I can personally attest to the ease and utility of this approach, having written a version of it over two nights, and then providing it to a colleague who used it along with Rails to develop an impressive prototype ontology and data editor, complete with rules engine, all in a single day. I expect others can cite similar experiences.

Really Embedding


So we've seen that static languages like C# and Java can't dynamically embed "Open" models like RDF/OWL, but are there languages that can?

In the last decade we've seen a lot of "dynamic" languages gaining popularity, and to various extents, several of these offer that functionality. The most obvious example is Ruby, which has explicit support for opening up already defined classes in order to add new methods, or redefine existing ones. Smalltalk has Metaprogramming. Meta-programming isn't an explicit feature for many other languages, but so long as the language is dynamic there is often a way, such as these methods for Python and Perl.

Despite the opportunity to embed models into these languages, I'm unaware of any systems which do so. It seems that the only forms of OTM I can find are in the languages which already have an ORM, and are probably better used with that paradigm. There are numerous libraries for accessing RDF in each of the above languages, such as the Redland RDF Language bindings for Ruby and Python, the Rena, and ActiveRDF in Ruby, RDFLib and pyrple in Python, Perl RDF... the list goes and continues to grow. But none of the libraries I know perform any kind of language embedding in a dynamic language. My knowledge in this space is not exhaustive, but the lack of obvious candidates tells a story on its own.

Is this indicative of dynamic languages not needing the same kind of modeling storage that static languages seem to require? Java and C# often used Hibernate and similar systems in large scale commercial applications with large development teams, while dynamic languages are often used by individuals or small groups to quickly put together useful systems that aim at a very different target market. But as commercial acceptance of dynamic languages develops further, perhaps this kind of modeling would be useful in future. In fact a good modeling library like this could well show a Java team struggling with their own version of an OTM, just what they've been missing in their closed world.

Topaz


I wrote this essay partly out of frustration with a system I've worked with called Topaz. Topaz is a very clever piece of OTM code, written for Java, and built over my own project Mulgara. However, Topaz suffers from all of the closed world problems I outlined above, without any apparent attempt to mitigate the use of RDF by reading extra annotations, etc. It has been used by the Public Library of Science for their data persistence, but they have been unhappy with it, and it will soon be replaced.

While performance in Mulgara (something I'm working on), in Topaz's use of Mulgara, and in Topaz itself, has been an issue, I believe that a deeper problem lay in the use of a dynamic system to represent static data. My knowledge of Topaz has me wondering why the system didn't simply choose to use Hibernate. I'm sure the used of RDF and OWL provided some functionality that isn't easy accomplished by Hibernate, but I don't see the benefits being strong enough to make it worth the switch to a dynamic model.

For my money, I'd either adopt the standard static approach that so many systems already employ to great effect, or go the whole hog and design an OTM system that is truly dynamic and open.

1 comment:

Anonymous said...

Your conclusion with Topaz is the same that occurred with twine.com. We had actually built an ORM that embraced the open world, but the model employed by the application was static, rendering those capabilities useless. It also meant that many performance tweaks that could have been employed in a known-to-be-static model vs defacto-static-model weren't readily available.