Friday, February 27, 2009

More Programmable Logic

In the last post I gave a basic description of how the Krule rules engine works. I left out a number of details, but it provides some details on the overall approach.

The most important detail I skipped was the interaction with resolvers in order to identify resources with particular properties. This includes among other things, finding URIs in a domain, inequality comparisons for numeric literals, regular expression matching on strings, the "type" of a resource (URI, blank node, or literal), and optimized transitivity of predicates. We also manage "subtractions" in our data, in the same way that Evren Sirin described NOT two weeks ago. I should point out that Mulgara does subtractions directly, and not with a combined operation with the OPTIONAL/FILTER/!BOUND pattern. This was introduced this in 2004 (or was it 2005?). I have to say that I'm a little surprised that SPARQL never included anything to express it directly, particularly since so many people ask for it (and hence, the popularization of OPTIONAL/FILTER/!BOUND as a pattern) and because the SPARQL algebra provides a definition of a function called "Diff" that is used internally.

Anyway, these extensions are not necessary to understand Krule or RLog, but I think it's useful to know that they're there.

So now that I've described Krule, I've set the scene for describing RLog.

Krule Configuration

When I first wrote Krule, I was ultimately aiming at OWL, but I had a short term goal of RDFS. I find that I have to take these things one step at a time, or else I never make progress. Since I knew that my rules were going to expand, I figured I should not hard code anything into Mulgara, but that I should instead interpret a data structure which described the rules. That also meant I would be able to run rules for lots of systems: RDFS, SKOS, OWL, or anything else. Of course, some things would need more features than RDFS needed (e.g. both OWL and SKOS need "lists"), but my plan was to work on that iteratively.

At the time, I designed an RDF schema to describe my rules, and built the Krule engine to initialize itself from this. This works well, since the whole system is built around RDF already. I also created a new TQL command for applying rules to data:
  apply <rule_graph_uri> to <data_graph_uri> [<output_graph_uri>]
By default all of the entailed data goes into the graph the rules are being applied to, but by including the optional output graph you can send the entailed data there instead.

This worked as planned, and I was able to build a Krule configuration graph for RDFS. Then life and work interfered and the rules engine was put on the back burner before I got to add some of the required features (like consistency checking).

Then about 18 months ago I thought I'd have a go at writing OWL entailment, at least for that part that the rules engine would support. So I set out to write a new Krule file. The complexity of the file was such that I started writing out the rules that I wanted using a kind of Prolog notation with second order programming, in a very similar way to how Raphael Volz represented the same constructs in some of his papers. This grammar uses binary predicates to represent genereal triples, and unary predicates to indicate "type" statements, ie. statements with a predicate of rdf:type. As an example, the owl:sameAs predicate indicates that if the subject of a statement is the owl:sameAs another resource, then that statement can be duplicated with the other resource as the subject. This was easily expressed this as:
  A(Y,Z) :- A(X,Z), owl:sameAs(X,Y).
I wrote out about 3 rules before I realized that converting these to Krule was going to be tedious and prone to error. In fact, I had unthinkingly demonstrated that I already had a language I wanted to use, and the process of translation was an easily automated task. Since the language was allowing me to describe RDF with logic, I decided to call it RLog (for RDF Logic).

Iterations

Andrae and I had been discussing how much we disliked SableCC for generating the TQL parser for Mulgara, and so I started looking around at other parsers. The power of LALR parsers appealed to me, and so I went with Beaver. Along with the JFlex lexer, this software is a pleasure to use. I had learned how to use them both, and created the RLog grammar in about an hour. I then converted the Krule configuration for RDFS into this new grammar, and convinced myself that I had it right. Then life got in the way again, and I put it away.

Last year while waiting for some tests to complete, I remembered this grammar, and spent some of my enforced downtime making it output some useful RDF in the Krule schema. For anyone who's looked at Krule, they may have noticed that triggers for rules (which rules cause which other rules to be run) are explicitly encoded into the configuration. I did this partly because I already had the list of trigger dependencies for RDFS rules, and partly because I thought it would offer more flexibility. However, I had realized some time before that these dependencies were easy to work out, and had been wanting to automate this. I decided that RLog was the perfect place to do it, partly because it meant not having to change much, but also because it still allowed me the flexibility of tweaking the configuration.

Once I'd finished writing a system that could output Krule, I tested it against my RDFS RLog file, and compared the generated Krule to the original configuration. Initially I was disappointed to see to many dependencies, but on closer inspection I realized that they were all valid. The original dependencies were a reduced set because they applied some of the semantics of the predicates and classes they were describing, which was not something that a grammar at the level of RLog could deal with. Semi-naïve evaluation was going to stop unnecessary rules from running anyway, so I decided that these extra triggers were fine. I ran it against the various test graphs that I had, and was pleased to see that it all worked perfectly.

But once again, work and life got in the way, and I put it aside again.

SKOS

A couple of months ago Brian asked me about running rules for generating SKOS entailments, as he was writing a paper on this topic. I pointed him towards RLog and knocked up a couple of useful rules for him. However, as I got into it, I realized that I could actually do most of SKOS quite easily, and before I knew it, I'd written an entire RLog program for it. The only thing I could not do was "S35", as this requires a predicate for list membership (also a requirement for OWL, and on my TODO list).

The really interesting thing about this document, is that almost everything is an axiom and not a rule. It only requires 2 RDFS rules and 5 OWL rules to make the whole thing work. This is quite important, as the complexity in running the rules is generally exponential in the number of rules.

This is (IMNSHO) the power of ontologies. By providing properties of classes and properties, they reduce the need for many rules. To demonstrate what I mean, I've seen a few systems (such as DLV) which define a predicate to be transitive in the following way:
  pred(A,C) :- pred(A,B), pred(B,C).
This works, but it creates a new rule to do it. Every new transitive predicate also gets its own rule. As I have already said, this has a significant detrimental effect on complexity.

Conversely, models such as OWL are able to declare properties as "transitive". Each such declaration then becomes a statement rather than a rule. Indeed, all the transitive statements get covered with a single second-order rule:
  P(A,C) :- P(A,B), P(B,C), owl:TransitivePredicate(P).
"Second-order" refers to the fact that variables can be used for the predicates (such as the variable P in the expression P(A,B)), and that predicates can appear as parameters for other predicates, such as owl:TransitivePredicate(...). The symmetry of Mulgara indexes for RDF statements allows such second order constructs to be evaluated trivially.

Using the OWL construct for transitivity, any number of predicates can be declared as transitive with no increase to the number of rules. The complexity of rules does have a component derived from the number of statements, but this is closer to linear or polynomial (depending on the specific structure of the rules), and is therefore far less significant for large systems. It is also worth noting that several OWL constructs do not need an exhaustive set of their own rules, as their properties can be described using other OWL constructs. For instance, owl:sameAs is declared as being owl:SymmetricProperty. This means that the entailment rule for owl:sameAs (shown above) need only be written once for owl:sameAs(A,B) and is not needed for symmetric case of owl:sameAs(B,A).

General Acceptance

Brian wasn't the only one to like RLog. I've had reports and feature requests from a few other people who like using it as well. The most commonly requested feature has been the generation of blank nodes. The main reason for this is to handle existential formula, which makes me wary, as this can lead to infinite loops if not carefully controlled. On the other hand, I can see the usefulness of it, so I expect to implement it eventually.

A related feature is to create multiple statements based on a single matched rule. This can usually be handled by introducing a new rule with the same body and a different head, but if a blank node has been generated by the rule, then there needs to be some way to re-use it in the same context.

A problem with general usage is that the domains understood by RLog have been preset with the domains that I've wanted, namely: RDF, RDFS, OWL, XSD, MULGARA, KRULE, FOAF, SKOS, and DC. The fix to this can be isolated in the parser, so I anticipate this being fixed by Monday. :-)

Despite it being limited, RLog was proving to be useful, allowing me to encode systems like SKOS very easily. However, being a separate program that translated an RLog file into Krule configuration files that then had to be loaded and applied to data, was a serious impediment to the usage.

Integration

The Mulgara "Content Handler" interface is a mechanism for loading any kind of data as triples, and optionally writing it back out. The two main ones are the RDF/XML handler and the N3 handler, but there are others in the default distribution as well. There is a MBox handler for representing Unix Mailbox files as RDF, and an MP3 handler which maps ID3 metadata from MP3 files. These handlers compliment the "Resolver" interface which represents external data sources as a dynamic graph.

Since RLog has a well-defined mapping into RDF (something the RLog program was already doing when it emitted RDF/XML) then reimplementing this system as a content handler would integrate it into Mulgara with minimal effort. I had been planning on this for some time, but there always seemed to be more pressing priorities. These other priorities are still there (and still pressing!) but a few people (e.g. David) have been pushing me for it recently, so I decided to bite the bullet and get it done.

The first problem was that the parser was in Beaver. This is yet another JAR file to include at a time when I'm trying to cut down on our plethora of libraries. It also seemed excessive, since we already have both JavaCC and SableCC in our system - the former for SPARQL, the latter for TQL, and I hope to redo TQL in JavaCC eventually anyway. So I decided to re-implement the grammar parser in JavaCC.

Unfortunately, it's been over a year since I looked at JavaCC, and I was very rusty. So my first few hours were spent relearning token lookahead, and various aspects of JavaCC grammar files. I actually think I know it better now than I did when I first did the SPARQL parser (that's a concern). There are a few parts of the grammar which are not LL(1) either, which forced me to think through the structure more carefully, and I think I benefited from the effort.

I was concerned that I would need to reimplement a lot of the AST for RLog, but fortunately this was not the case. Once I got a handle on the translation it all went pretty smoothly, and the JavaCC parser was working identically to the original Beaver parser by the end of the first day.

After the parser was under control I moved on to emitting triples. This was when I was reminded that writing RDF/XML can actually be a lot easier than writing raw triples. I ended up making slow progress, but I finally got it done last night.

Testing

Before running a program through the content handler for the first time, I wanted to see that the data looked as I expected it to. JUnit tests are going to take some time to write, and with so much framework around the rules configuration, it was going to be clear very quickly if things weren't just right. I considered running it all through the debugger, but that was going to drown me in a sea of RDF nodes. That was when I decided that I could make the resolver/content-handler interfaces work for me.

Mulgara can't usually tell the difference between data in its own storage, and data sourced from elsewhere. It always opts for internal storage first, but if a graph URI is not found, then it will ask the various handlers if they know what to do with the graph. By using a file: URL for the graphs, I could make Mulgara do all of it's reading and writing to files, using the content handlers to do the I/O. In this case, I decided to "export" my RLog graph to an N3 file, and compare the result to an original Krule RDF/XML file that I exported to another N3 file.

The TQL command for this was:
  export <file:/path/rdfs.rlog> to <file:/path/rlog-output.n3>
Similarly, for the RDF/XML file was transformed to N3 with:
  export <file:/path/rdfs-krule.rdf> to <file:/path/krule-output.n3>
I love it when I can glue arbitrary things together and it all "just works". (This may explain why I like the Semantic Web).

My first test run demonstrated that I was allowing an extra # into my URIs, and then I discovered that I'd fiddled with the literal token parsing, and was now including quotes in my strings (oops). These were trivial fixes. The third time through was the charm. I spent some time sorting my N3 files before deciding it looked practically identical, and so off I went to run an RLog program directly.

As I mentioned in my last post, applying a set of rules to data is done with the apply command. While I could have loaded the rules into an internal graph (pre-compiling them, so to speak) I was keen to "run" my program straight from the source:
  apply <file:/path/rdfs.rlog> to <test:data:uri>
...and whaddaya know? It worked. :-)

Now I have a long list of features to add, optimizations to make, bugs to fix, and all while trying to stay on top of the other unrelated parts of the system. Possibly even more importantly, I need to document how to write an RLog file! But for the moment I'm pretty happy about it, and I'm going to take it easy for the weekend. See you all on Monday!

No comments: