Thursday, May 21, 2009

Federated Queries

A long time ago, TKS supported federated queries, though the approach was a little naive (bring all the matches of triple patterns in to a single place, and join them there). Then a few years ago I added this to Mulgaraas well. I've always wanted to make it more intelligent in order to reduce network bandwidth, but at the same time, I was always happy that it worked. Unfortunately, it was all accomplished through RMI, and was Mulgara specific. That used to be OK, since RDF servers didn't have standardized communications mechanisms, but that changed with SPARQL.

More recently, I've started running across distributed queries through another avenue. While working through the SPARQL protocol, I realized that the Mulgara approach of treating unknown HTTP URIs as data that can be retrieved can be mixed with SPARQL CONSTRUCT queries encoded into a URI. The result of an HTTP request on a SPARQL CONSTRUCT query is an RDF document, which is exactly what Mulgara is expecting when it does an HTTP GET on a graph URI. The resulting syntax is messy, but it works quite well. Also, while retrieving graph URIs is not standard in SPARQL, most systems implement this, making it a relatively portable idiom. I was quite amused at the exclamations of surprise and horror (especially the horror) when I passed this along on a mailing list a few weeks ago.

The ease at which this was achieved using SPARQL made me consider how federated querying might be done using a SPARQL-like protocol. Coincidentally, the SPARQL Working Group has Basic Federated Queries as a proposed feature, and now I'm starting to see a lot of people asking about it on mailing lists (was people always asking about this, or am I just noticing it now?). I'm starting to think this feature may be more important in SPARQL, and think that perhaps I should have made it a higher priority when I voted on it. As it is, it's in the list of things we'll get to if we have time.

Then, while I was thinking about this, one of the other Mulgara developers tells me that he absolutely has to have distributed queries (actually, he needs to run rules over distributed datasets) to meet requirements in his organization. Well, the existing mechanisms will sort of work for him, but to do it right it should be in SPARQL.

Requirements

So what would I want to see in federated SPARQL queries? Well, as an implementer I need to see a few things:
  1. A syntactic mechanism for defining the URI of a SPARQL endpoint containing the graph(s) to be accessed.
  2. A syntactic mechanism for defining the query to be made on that endpoint (a subquery syntax would be fine here).
  3. A means of asking the size of a query result.
  4. A mechanism for passing existing bindings along with a query.


The first item seemed trivial until I realized that SPARQL has no standard way of describing an endpoint. Systems like Mulgara simply use http://hostname/sparql/, which provides access to the entire store (everything can be referred to using HTTP parameters, such as default-graph-uri and query). On the other hand, Joseki can do the /sparql/ thing, but also provides an option to access datasets through the path, and Sesame can have several repositories, each of which is accessible by varying the path in the URL.

The base URL for issuing SPARQL queries against would be easy enough to specify, but it introduces a new concept into the query language, and that has deeper ramifications than should be broached in this context.

The query that can be issued against an endpoint should look like a standard query, and not just a CONSTRUCT, as this provides greater flexibility and also binds the columns to variable names that can appear in other parts of the query. This is basically identical to a subquery, which is exactly what we want.

Bandwidth Efficiency

The last 2 items are a matter of efficiency and not correctness. However, they can mean the difference between transferring a few bytes vs a few megabytes over a network.

(BTW, when did "bandwidth" get subverted to describe data rates? When I was a boy this referred to the range of frequencies that a signal used, and this had a mathematical formula relating it to the number of symbols-per-second that could be transmitted over that signal - which does indeed translate to a data rate. However, it now gets used in completely different contexts which have nothing to do with frequency range. Oh well.. back to the story).

If I want to ask for the identifiers of people named "Fred" (as opposed to something else I want to name with foaf:givenname), then I could use the query:
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?person WHERE {
?person foaf:givenname "Fred" .
?person a foaf:Person
}
Now what if the "type" data and the "name" data appear on different servers? In that case we would use a distributed query.

Using the HTTP/GET idiom I mentioned at the top of this post, then I could send the query to the server containing the foaf:givenname statements, and change it now to say:
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?person WHERE {
?person foaf:givenname "Fred" .
GRAPH <http://hostname/sparql/?query=
PREFIX+foaf%3A+%3Chttp%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%2F%3E%0A
CONSTRUCT+%7B%3Fperson+a+foaf%3APerson%7D+
WHERE+%7B%3Fperson+a+foaf%3APerson%7D> {
?person a foaf:Person
}
}
So now the server will resolver all the entities with the name "Fred", then it will retrieve a graph and ask it for all the entities that are a foaf:Person. Then it will join these results to create the final result.

But what happens if there are only 3 things named "Fred", but 10,000 people in the data set? In that case the server will resolve the first pattern, getting 3 bindings for ?person, and then make a request across the network, getting back 10,000 statement which are then queried for those statements the describe a foaf:Person (they all will), and only then does the join happen. Ideally, we'd have gone the other way, and asked the server with 10,000 people to request data from the server that had 3 entities named Fred, but we may not have known ahead of time that this would be better, and a more complex query could require a more complex access pattern than simply "reversing" the resolution order.

First of all, we need a way to ask each server how large a set of results is likely to be. The COUNT function that is being discussed in the SPARQL WG at the moment could certainly be used to help here, though for the sake of efficiency it would also be nice to have a mechanism for asking for the upper-limit of the COUNT. That isn't appropriate for a query language (since it refers to database internals) but would be nice to have in the protocol, such as with an HTTP/OPTION request (though I really don't see something like that being ratified by the SPARQL WG). But even without an "upper limit" option, a normal COUNT would give us what we need to find out how to move the query around.

So once we realize that the server running the query has only a little data (call it "Server A"), and it needs to join it to a large amount of data on a different server (call this one "Server B", then of course we want Server A to send the small amount of data to Server B instead of retrieving the large amount from it. One way to do this might be to invert the query at this point, and send the whole thing to Server B. That server then asks Server A for the data, and sends its response. Unfortunately, that is both complex, and requires a lot more hops than we want. The final chain here would be:
  1. Client sends query as a request to Server A
  2. Server A reverses the query and sends the new query as a request to Server B
  3. Server B resolves its local data, and sends the remainder of the query as a request to Server A
  4. Server A responds to Server B with the result of entities with the name "Fred"
  5. Server B joins the data it got with the local data and responds to Server A with the results of the entire query
  6. Server A responds to the client with the unmodified results it just received

Yuck.

Instead, when Server A detects a data size disparity like this, it needs a mechanism to package up its bindings for the ?person variable, and send these to Server B along with the request. Fortunately, we already have a format for this in the SPARQL result set format.

Normally, a query would be performed using an HTTP/GET, but including a content-body in a GET request has never been formally recognized (though it has not been made illegal), so I don't want to go that way. Instead, a POST would work just as well here. The HTTP request with content could look like this (I've added line breaks to the request):
POST /sparql/?query=
PREFIX+foaf%3A+%3Chttp%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%2F%3E%0A
SELECT+%3Fperson+WHERE+%7B%3Fperson+a+foaf%3APerson%7D HTTP/1.1
Host: www.example
User-agent: my-sparql-client/0.1
Content-Type: application/sparql-results+xml
Content-Length: xxx

<?xml version="1.0"?>
<sparql xmlns="http://www.w3.org/2005/sparql-results#">
<head>
<variable name="person"/>
</head>
<results distinct="false" ordered="false">
<result>
<binding name="person"><uri>http://www.example/FredFlintstone</uri></binding>
</result>
<result>
<binding name="person"><uri>http://www.example/FredKruger</uri></binding>
</result>
<result>
<binding name="person"><uri>http://www.example/FredTheDog</uri></binding>
</result>
</results>
</sparql>

I can't imagine that I could be successful in suggesting this as part of the underlying protocol for federated querying, but I'm thinking that I'll be incorporating it into Mulgara all the same.

Friday, February 27, 2009

More Programmable Logic

In the last post I gave a basic description of how the Krule rules engine works. I left out a number of details, but it provides some details on the overall approach.

The most important detail I skipped was the interaction with resolvers in order to identify resources with particular properties. This includes among other things, finding URIs in a domain, inequality comparisons for numeric literals, regular expression matching on strings, the "type" of a resource (URI, blank node, or literal), and optimized transitivity of predicates. We also manage "subtractions" in our data, in the same way that Evren Sirin described NOT two weeks ago. I should point out that Mulgara does subtractions directly, and not with a combined operation with the OPTIONAL/FILTER/!BOUND pattern. This was introduced this in 2004 (or was it 2005?). I have to say that I'm a little surprised that SPARQL never included anything to express it directly, particularly since so many people ask for it (and hence, the popularization of OPTIONAL/FILTER/!BOUND as a pattern) and because the SPARQL algebra provides a definition of a function called "Diff" that is used internally.

Anyway, these extensions are not necessary to understand Krule or RLog, but I think it's useful to know that they're there.

So now that I've described Krule, I've set the scene for describing RLog.

Krule Configuration

When I first wrote Krule, I was ultimately aiming at OWL, but I had a short term goal of RDFS. I find that I have to take these things one step at a time, or else I never make progress. Since I knew that my rules were going to expand, I figured I should not hard code anything into Mulgara, but that I should instead interpret a data structure which described the rules. That also meant I would be able to run rules for lots of systems: RDFS, SKOS, OWL, or anything else. Of course, some things would need more features than RDFS needed (e.g. both OWL and SKOS need "lists"), but my plan was to work on that iteratively.

At the time, I designed an RDF schema to describe my rules, and built the Krule engine to initialize itself from this. This works well, since the whole system is built around RDF already. I also created a new TQL command for applying rules to data:
  apply <rule_graph_uri> to <data_graph_uri> [<output_graph_uri>]
By default all of the entailed data goes into the graph the rules are being applied to, but by including the optional output graph you can send the entailed data there instead.

This worked as planned, and I was able to build a Krule configuration graph for RDFS. Then life and work interfered and the rules engine was put on the back burner before I got to add some of the required features (like consistency checking).

Then about 18 months ago I thought I'd have a go at writing OWL entailment, at least for that part that the rules engine would support. So I set out to write a new Krule file. The complexity of the file was such that I started writing out the rules that I wanted using a kind of Prolog notation with second order programming, in a very similar way to how Raphael Volz represented the same constructs in some of his papers. This grammar uses binary predicates to represent genereal triples, and unary predicates to indicate "type" statements, ie. statements with a predicate of rdf:type. As an example, the owl:sameAs predicate indicates that if the subject of a statement is the owl:sameAs another resource, then that statement can be duplicated with the other resource as the subject. This was easily expressed this as:
  A(Y,Z) :- A(X,Z), owl:sameAs(X,Y).
I wrote out about 3 rules before I realized that converting these to Krule was going to be tedious and prone to error. In fact, I had unthinkingly demonstrated that I already had a language I wanted to use, and the process of translation was an easily automated task. Since the language was allowing me to describe RDF with logic, I decided to call it RLog (for RDF Logic).

Iterations

Andrae and I had been discussing how much we disliked SableCC for generating the TQL parser for Mulgara, and so I started looking around at other parsers. The power of LALR parsers appealed to me, and so I went with Beaver. Along with the JFlex lexer, this software is a pleasure to use. I had learned how to use them both, and created the RLog grammar in about an hour. I then converted the Krule configuration for RDFS into this new grammar, and convinced myself that I had it right. Then life got in the way again, and I put it away.

Last year while waiting for some tests to complete, I remembered this grammar, and spent some of my enforced downtime making it output some useful RDF in the Krule schema. For anyone who's looked at Krule, they may have noticed that triggers for rules (which rules cause which other rules to be run) are explicitly encoded into the configuration. I did this partly because I already had the list of trigger dependencies for RDFS rules, and partly because I thought it would offer more flexibility. However, I had realized some time before that these dependencies were easy to work out, and had been wanting to automate this. I decided that RLog was the perfect place to do it, partly because it meant not having to change much, but also because it still allowed me the flexibility of tweaking the configuration.

Once I'd finished writing a system that could output Krule, I tested it against my RDFS RLog file, and compared the generated Krule to the original configuration. Initially I was disappointed to see to many dependencies, but on closer inspection I realized that they were all valid. The original dependencies were a reduced set because they applied some of the semantics of the predicates and classes they were describing, which was not something that a grammar at the level of RLog could deal with. Semi-naïve evaluation was going to stop unnecessary rules from running anyway, so I decided that these extra triggers were fine. I ran it against the various test graphs that I had, and was pleased to see that it all worked perfectly.

But once again, work and life got in the way, and I put it aside again.

SKOS

A couple of months ago Brian asked me about running rules for generating SKOS entailments, as he was writing a paper on this topic. I pointed him towards RLog and knocked up a couple of useful rules for him. However, as I got into it, I realized that I could actually do most of SKOS quite easily, and before I knew it, I'd written an entire RLog program for it. The only thing I could not do was "S35", as this requires a predicate for list membership (also a requirement for OWL, and on my TODO list).

The really interesting thing about this document, is that almost everything is an axiom and not a rule. It only requires 2 RDFS rules and 5 OWL rules to make the whole thing work. This is quite important, as the complexity in running the rules is generally exponential in the number of rules.

This is (IMNSHO) the power of ontologies. By providing properties of classes and properties, they reduce the need for many rules. To demonstrate what I mean, I've seen a few systems (such as DLV) which define a predicate to be transitive in the following way:
  pred(A,C) :- pred(A,B), pred(B,C).
This works, but it creates a new rule to do it. Every new transitive predicate also gets its own rule. As I have already said, this has a significant detrimental effect on complexity.

Conversely, models such as OWL are able to declare properties as "transitive". Each such declaration then becomes a statement rather than a rule. Indeed, all the transitive statements get covered with a single second-order rule:
  P(A,C) :- P(A,B), P(B,C), owl:TransitivePredicate(P).
"Second-order" refers to the fact that variables can be used for the predicates (such as the variable P in the expression P(A,B)), and that predicates can appear as parameters for other predicates, such as owl:TransitivePredicate(...). The symmetry of Mulgara indexes for RDF statements allows such second order constructs to be evaluated trivially.

Using the OWL construct for transitivity, any number of predicates can be declared as transitive with no increase to the number of rules. The complexity of rules does have a component derived from the number of statements, but this is closer to linear or polynomial (depending on the specific structure of the rules), and is therefore far less significant for large systems. It is also worth noting that several OWL constructs do not need an exhaustive set of their own rules, as their properties can be described using other OWL constructs. For instance, owl:sameAs is declared as being owl:SymmetricProperty. This means that the entailment rule for owl:sameAs (shown above) need only be written once for owl:sameAs(A,B) and is not needed for symmetric case of owl:sameAs(B,A).

General Acceptance

Brian wasn't the only one to like RLog. I've had reports and feature requests from a few other people who like using it as well. The most commonly requested feature has been the generation of blank nodes. The main reason for this is to handle existential formula, which makes me wary, as this can lead to infinite loops if not carefully controlled. On the other hand, I can see the usefulness of it, so I expect to implement it eventually.

A related feature is to create multiple statements based on a single matched rule. This can usually be handled by introducing a new rule with the same body and a different head, but if a blank node has been generated by the rule, then there needs to be some way to re-use it in the same context.

A problem with general usage is that the domains understood by RLog have been preset with the domains that I've wanted, namely: RDF, RDFS, OWL, XSD, MULGARA, KRULE, FOAF, SKOS, and DC. The fix to this can be isolated in the parser, so I anticipate this being fixed by Monday. :-)

Despite it being limited, RLog was proving to be useful, allowing me to encode systems like SKOS very easily. However, being a separate program that translated an RLog file into Krule configuration files that then had to be loaded and applied to data, was a serious impediment to the usage.

Integration

The Mulgara "Content Handler" interface is a mechanism for loading any kind of data as triples, and optionally writing it back out. The two main ones are the RDF/XML handler and the N3 handler, but there are others in the default distribution as well. There is a MBox handler for representing Unix Mailbox files as RDF, and an MP3 handler which maps ID3 metadata from MP3 files. These handlers compliment the "Resolver" interface which represents external data sources as a dynamic graph.

Since RLog has a well-defined mapping into RDF (something the RLog program was already doing when it emitted RDF/XML) then reimplementing this system as a content handler would integrate it into Mulgara with minimal effort. I had been planning on this for some time, but there always seemed to be more pressing priorities. These other priorities are still there (and still pressing!) but a few people (e.g. David) have been pushing me for it recently, so I decided to bite the bullet and get it done.

The first problem was that the parser was in Beaver. This is yet another JAR file to include at a time when I'm trying to cut down on our plethora of libraries. It also seemed excessive, since we already have both JavaCC and SableCC in our system - the former for SPARQL, the latter for TQL, and I hope to redo TQL in JavaCC eventually anyway. So I decided to re-implement the grammar parser in JavaCC.

Unfortunately, it's been over a year since I looked at JavaCC, and I was very rusty. So my first few hours were spent relearning token lookahead, and various aspects of JavaCC grammar files. I actually think I know it better now than I did when I first did the SPARQL parser (that's a concern). There are a few parts of the grammar which are not LL(1) either, which forced me to think through the structure more carefully, and I think I benefited from the effort.

I was concerned that I would need to reimplement a lot of the AST for RLog, but fortunately this was not the case. Once I got a handle on the translation it all went pretty smoothly, and the JavaCC parser was working identically to the original Beaver parser by the end of the first day.

After the parser was under control I moved on to emitting triples. This was when I was reminded that writing RDF/XML can actually be a lot easier than writing raw triples. I ended up making slow progress, but I finally got it done last night.

Testing

Before running a program through the content handler for the first time, I wanted to see that the data looked as I expected it to. JUnit tests are going to take some time to write, and with so much framework around the rules configuration, it was going to be clear very quickly if things weren't just right. I considered running it all through the debugger, but that was going to drown me in a sea of RDF nodes. That was when I decided that I could make the resolver/content-handler interfaces work for me.

Mulgara can't usually tell the difference between data in its own storage, and data sourced from elsewhere. It always opts for internal storage first, but if a graph URI is not found, then it will ask the various handlers if they know what to do with the graph. By using a file: URL for the graphs, I could make Mulgara do all of it's reading and writing to files, using the content handlers to do the I/O. In this case, I decided to "export" my RLog graph to an N3 file, and compare the result to an original Krule RDF/XML file that I exported to another N3 file.

The TQL command for this was:
  export <file:/path/rdfs.rlog> to <file:/path/rlog-output.n3>
Similarly, for the RDF/XML file was transformed to N3 with:
  export <file:/path/rdfs-krule.rdf> to <file:/path/krule-output.n3>
I love it when I can glue arbitrary things together and it all "just works". (This may explain why I like the Semantic Web).

My first test run demonstrated that I was allowing an extra # into my URIs, and then I discovered that I'd fiddled with the literal token parsing, and was now including quotes in my strings (oops). These were trivial fixes. The third time through was the charm. I spent some time sorting my N3 files before deciding it looked practically identical, and so off I went to run an RLog program directly.

As I mentioned in my last post, applying a set of rules to data is done with the apply command. While I could have loaded the rules into an internal graph (pre-compiling them, so to speak) I was keen to "run" my program straight from the source:
  apply <file:/path/rdfs.rlog> to <test:data:uri>
...and whaddaya know? It worked. :-)

Now I have a long list of features to add, optimizations to make, bugs to fix, and all while trying to stay on top of the other unrelated parts of the system. Possibly even more importantly, I need to document how to write an RLog file! But for the moment I'm pretty happy about it, and I'm going to take it easy for the weekend. See you all on Monday!

Programmable Logic

I had hoped to blog a lot this week, but I kept putting it off in order to get some actual work done. I still have a lot more to do, but I'm at a natural break, so I thought I'd write about it.

I have finally integrated RLog into Mulgara! In some senses this was not a big deal, so it took a surprising amount of work.

To explain what RLog is, I should describe Krule first.

Krule

The Mulgara rules engine, Krule, implements my design for executing rules on large data sets. For those familiar with rules engines, it is similar to RETE, but it runs over data as a batch, rather than iteratively. It was designed this way because our requirements were to perform inferencing on gigabytes of RDF. This was taking a long time to load, so we wanted to load it all up, and then do the inferencing.

RETE operates by setting up a network describing the rules, and then as statements are fed through it, each node in that network builds up a memory of the data it has seen and passed. As data comes in the first time, the nodes that care about it will remember this data and pass it on, but subsequent times through, the node will recognize the data and prevent it from going any further. In this way, every statement fed into the system will be processed as much as it needs to be, and no more. There is even a proof out there that says that RETE is the optimal rules algorithm. Note that the implementation of an algorithm can, and does, have numerous permutations which can allow for greater efficiency in some circumstances, so RETE is often treated as the basis for efficient engines.

One variation on algorithms of this sort is to trade time for space. This is a fancy way of saying that if we use more memory then we can use less processing time, or we can use more processing time to save on memory. Several variations on RETE do just this, and so does Krule.

When looking at the kinds of rules that can be run on RDF data, I noticed that the simple structure of RDF meant that each "node" in a RETE network corresponds to a constraint on a triple (or a BGP in SPARQL). Because Mulgara is indexed in "every direction", this means that every constraint can be found as a "slice" out of one of the indexes (while our current system usually takes O(log(n)) to find, upcoming systems can do some or all of these searches in O(1)). Consequently, instead of my rule network keeping a table in memory associated with every node, there is a section out of an index which exactly corresponds to this table.

There are several advantages to this. First, the existence of the data in the database is defined by it being in the indexes. This means that all the data gets indexed, rules engine or not. Second, when the rules engine is run, there is no need to use the data to iteratively populate the tables for each node, as the index slices (or constraint resolutions) are already fully populated, by definition. Finally, our query engine caches constraint resolutions, and they do not have to be re-resolved if no data has gone in that can affect them (well, some of the caching heuristics can be improved for better coverage, but the potential is there). This means that the "tables" associated with each node will be automatically updated for us as the index is updated, and the work needed to handle updates is minimal.

During the first run of Krule, none of the potential entailments have been made yet, so everything is potentially relevant. However, during subsequent iterations of the rules, Krule has no knowledge of which statements are new in the table on any given node. This means it will produce entailed statements that already exist, and are duplicates. Inserting these is unnecessary (and hence, suboptimal) and creates unwanted duplicates. We handle this in two ways.

The first and simpler mechanism is that Mulgara uses Set semantics, meaning that any duplicates are silently (and efficiently) ignored. Set semantics are important when dealing with RDF, and this is why I'm so frustrated at non-distinct nature of SPARQL queries... but that's a discussion for another time. :-)

The more important mechanism for duplicate inserts is based on RDF having a property of being monotonically increasing. This is because RDF lets you assert data, but not to "unassert" it. OWL 2 has introduced explicit denial of statements, but this is useful for preventing entailments and consistency checking... it does not remove previously existing statements. In non-monotonic systems a constraint resolution may keep the same size if some statements are deleted while an equal number of statements are inserted, but in a monotonic system like RDF, keeping the same size means that there has been no change. So a node knows to pass its data on if the size of its table increases, but otherwise it will do nothing. I stumbled across this technique as an obvious optimization, but I've since learned that it has a formal name: semi-naïve evaluation.

Krule Extensions

While this covers the "batch" use-case, what about the "iterative" use-case, where the user wants to perform inferences on streams of data as it is inserted into an existing database? In this case, the batch approach is too heavyweight, as it will infer statements that are almost entirely pre-existing. We might handle the non-insertion of these statements pretty well, but if you do the work again and again for every statement you try to insert, then it will add up. In this case, the iterative approach of standard RETE is more appropriate.

Unfortunately, RETE needs to build its tables up by iterating over the entire data set, but I've already indicated that this is expensive for the size of set that may be encountered. However, the Krule approach of using constraint resolutions as the tables is perfect for pre-populating these tables in a standard RETE engine. I mentioned this to Alex a few months ago, and he pointed out that he did exactly the same thing once before when implementing RETE in TKS.

I haven't actually done this extension, but I thought I'd let people know that we haven't forgotten it, and it's in the works. It will be based on Krule configurations, so a lot of existing work will be reused.

I don't want to overdo it in one post, so I'll write about RLog in the next one.

Thursday, February 19, 2009

Is it Me?

After struggling against nonsensical errors all day, I finally realized that either Eclipse was going mad, or I was. In frustration I finally updated a method to look like this:
  public void setContextOwner(ContextOwner owner) {
contextOwner = owner;
if (owner != contextOwner) throw new AssertionError("VM causing problems");
}
This code throws an AssertionError, and yes, there is only 1 thread. If it were multithreaded at least I'd have a starting point. The problem only appears while debugging in Eclipse.

I'm not sure whether I'm happy to have isolated my problem down to a single point, or to be unhappy at something that is breaking everything I am trying to work on. I guess I'm unhappy, because it's kept me up much later than I'd hoped.

Tuesday, February 17, 2009

Resting

I've had a couple of drinks this evening. Lets see if that makes me more or less intelligible.

Now that I've added a REST-like interface to Mulgara, I've found that I've been using it more and more. This is fine, but to modify data I've had to either upload a document (a very crude tool) or issue write commands on the TQL endpoint. Neither of these were very RESTful, and so I started wondering if it would make sense to do something more direct.

From my perspective (and I'm sure there will be some who disagree with me), the basic resources in an RDF database are graphs and statements. Sure, the URIs themselves are resources, but the perspective of RDF is that these resources are infinite. Graphs are a description of how a subset of the set of all resources are related. Of course, these relationships are described via statements.

Graphs and Statements

So I had to work out how to represent a graph in a RESTful way. Unfortunately, graphs are already their own URI, and this probably has nothing to do with the server it is on. However, REST requires a URL which identifies the host and service, and then the resource within it. So the graph URI has to be embedded in the URL, after the host. While REST URLs typically try to reflect structure in a path, encoding a URL makes this almost impossible. Instead I opted to encode the graph URI as a "graph" parameter.

Statements posed a similar though more complex challenge. I still needed the graph, so this had to stay. Similarly, the other resources also needed to be encoded as parameters, so I added this as well. This left me with 2 issues: blank nodes and literals.

Literals

Literals were reasonably easy... sort of. I simply decided that anything that didn't look like a URI would be a literal. Furthermore, if it was structured like a SPARQL literal, then this would be parsed, allowing a datatype or language to be included. However, nothing is never really easy (of course) and I found myself wondering about relative URIs. These had never been allowed in Mulgara before, but I've brought them in recently after several requests. Most people will ignore them, but for those people who have a use, they can be handy. That all seems OK, until you realize that the single quote character " is an unreserved character in URIs, and so the apparent literal "foo" is actually a valid relative URI. (Thank goodness for unit tests, or I would never have realized that). In the end, I decided to treat any valid URI as a URI and not a literal, unless it starts with a quote. If you really want a relative URI of "foo" then you'll have to choose another interface.

Blank Nodes

Blank nodes presented another problem. Initially, I decided that any missing parameter would be a blank node. That worked well, but then I started wondering about using the same blank node in more than one statement. I'm treating statements as resources, and you can't put more than one "resource" into a REST URL, so that would mean referring to the same "nameless" thing in two different method calls, which isn't possible. Also, adding statements with a blank node necessarily creates a new blank node every time, which breaks idempotency.

Then what about deletion? Does nothing match, or does the blank node match everything? But doing matches like that means I'm no longer matching a single statement, which was what I was trying to do to make this REST and not RPC for a query-like command.

Another option is to refer to blanks with a syntax like _:123. However, this has all of the same problems we've had with exactly this idea in the query language. For instance, these identifiers are not guaranteed to match between different copies of the same data. Also, introducing new data that includes the same ID will accidentally merge these nodes incorrectly. There are other reasons as well. Essentially, you are using a name for something that was supposed to be nameless, and because you're not using URIs (like named things are supposed to use) then you're going to encounter problems. URIs were created for a reason. If you need to refer to something in a persistent way, then use a name for it. (Alternatively, use a query that links a blank node through a functional/inverse-functional predicate to uniquely identify it, but that's another discussion).

So in the end I realized that I can't refer to blank nodes at all in this way. But I think that's OK. There are other interfaces available if you need to work with blank nodes, and some applications prohibit them anyway.

Reification

Something I wanted to come back to is this notion of representing a statement as 3 parameters in a URL (actually 4, since the graph is needed). The notion of representing a statement as a URI has already been addressed in reification, however I dismissed this as a solution here since reifying a statement does not imply that statement exists (indeed, the purpose of the reification may be to say that the statement is false). All the same, it's left me thinking that I should consider a way to use this interface to reify statements.

Methods

So the methods as they stand now are:
method/ resourceGraphStatementOther
GETN/AN/AUsed for queries.
POSTUpload graphsN/AWrite commands (not SPARQL)
PUTCreates graphCreates statementN/A
DELETEDeletes graphDeletes statementN/A

I haven't done HEAD yet (I intend to indicate if a graph or statement exists), and I'm ignoring OPTION.

I've also considered what it might mean to GET a statement or a graph. When applied to a graph, I could treat this as a synonym for the query:
  construct {?s ?p ?o} where {?s ?p ?o}
Initially I didn't think it made much sense to GET a statement, but while writing this it occurs to me that I could return a reification URI, if one exists (this is also an option for HEAD, but I think existence is a better function there).

Is There a Point?

Everything I've discussed here may seem pointless, especially since there are alternatives, none of it is standard, and I'm sure there will be numerous criticisms on my choices. On the other hand, I wrote this because I found that uploading documents at a time to be too crude for real coding. I also find that constructing TQL command to modify data to be a little too convoluted in many circumstances, and that a simple PUT is much more appropriate.

So, I'm pretty happy with it, for the simple fact that I find it useful. If anyone has suggested modifications or features, than I'll be more than happy to take them on board.

Friday, February 13, 2009

Prodding

Going back through a recent flurry of activity by Webster Mudge on Google Groups, I noticed a couple of things directly related to me.

First was a link to David Wood's post from last month in which he talks about how I did SKOS using RLog (some nice compliments BTW, thanks David). Both in this post and personally, David has been hassling me to integrate RLog into Mulgara. I'd love to get this done, but SPARQL and scalability have been priorities for me, and no one ever asked for RLog before. But it's been shuffling to the top of my list recently, so I'm going to see what I can get done in the next week, before I get loaded with new priorities.

The other link was to SquirrelRDF and included the comment, “Great idea, bummer it's tied to Jena.” This intrigued me, and I wondered if it was something Mulgara could do, so I checked it out. Only, once I got there I discovered that Mulgara already does it, and has done for years!

That's one of the biggest problems with Mulgara: lack of documentation. People just aren't aware of what the system can do, and there's no easy way to find out. I'd love to fix this on the Wiki, but when I'm accountable for getting things done, and not for telling people how to do it, then I tend to opt for the "getting things done" work instead.

Resolvers

For anyone interested, Mulgara has a system we call the "Resolver" interface. The purpose of this is to present any data as live RDF data. Included with Mulgara are resolvers for accessing: Lucene indexes; RDF files over HTTP; filesystem data; GIS data; RDBMSs (via JDBC, and using D2RQ); JAR files; plus a few resolvers specifically for representing aspects of literals and URIs stored in the database. Most are read-only interpretations of external data, but some are writable.

We also have a related system called "Content Handlers". These are for handling raw file formats and returning RDF triples. We support the obvious RDF/XML and N3 file formats, but also interpret Unix MBox files and MP3 files (the latter was done as a tutorial). This mixes well with things like the HTTP and file resolvers, as it lets us refer to a graph such as http://www.w3.org/2000/01/rdf-schema in a query. In this example the graph will not be in the local database (it could be, but only if you'd created it), so the HTTP resolver will be asked to retrieve the contents from the URL. Once the data arrived, it would be sent to the RDF/XML content handler (havind recognized the "application/rdf+xml" MIME type), which will then turn it into a queryable local graph in memory. The query can continue then as if everything was local. If the data is on the local filesystem, or MIME type isn't recognized, then it will fall back to relying on filename extensions.

It's because of the way these things hook together that allows us to hook SPARQL sources together easily. It may be messy, but it is perfectly possible to select from a graph with a URI like:
http://host/sparql?default-graph-uri=my%3Agraph&
query=%40prefix+rdfs%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E+.
+create+%7B+%3Fs+%3Fp+%3Fo+%7D+where+%7B+%3Fs+%3Fp+%3Fo+.
+%3Fp+rdfs%3Adomain+%3Cmy%3AClass%3E+%7D
I've split the URI over a few lines to make it fit better, and I also used the graph name of my:graph just to keep it shorter. It's legal, though unusual.

Mulgara originally aimed at being highly scalable, and we're in the process of regaining that title (honest... the modest improvements we've had recently are orders of magnitude short of XA2). However, the sheer number of features and flexibility of the system is probably it's most compelling attribute at the moment. If only I could document it all, and spread the word.

Oh well, back to the grind. At the moment I'm alternating between RESTful features (I want to PUT and DELETE individual statements) and a class that will transparently memory map a file larger than 2GB. For the latter, I'd love to offer and extension to java.nio.Buffer, but this package has been completely locked down by Sun. I hate not being able to extend on built-in functionality. :-(