Thursday, May 21, 2009

Federated Queries

A long time ago, TKS supported federated queries, though the approach was a little naive (bring all the matches of triple patterns in to a single place, and join them there). Then a few years ago I added this to Mulgaraas well. I've always wanted to make it more intelligent in order to reduce network bandwidth, but at the same time, I was always happy that it worked. Unfortunately, it was all accomplished through RMI, and was Mulgara specific. That used to be OK, since RDF servers didn't have standardized communications mechanisms, but that changed with SPARQL.

More recently, I've started running across distributed queries through another avenue. While working through the SPARQL protocol, I realized that the Mulgara approach of treating unknown HTTP URIs as data that can be retrieved can be mixed with SPARQL CONSTRUCT queries encoded into a URI. The result of an HTTP request on a SPARQL CONSTRUCT query is an RDF document, which is exactly what Mulgara is expecting when it does an HTTP GET on a graph URI. The resulting syntax is messy, but it works quite well. Also, while retrieving graph URIs is not standard in SPARQL, most systems implement this, making it a relatively portable idiom. I was quite amused at the exclamations of surprise and horror (especially the horror) when I passed this along on a mailing list a few weeks ago.

The ease at which this was achieved using SPARQL made me consider how federated querying might be done using a SPARQL-like protocol. Coincidentally, the SPARQL Working Group has Basic Federated Queries as a proposed feature, and now I'm starting to see a lot of people asking about it on mailing lists (was people always asking about this, or am I just noticing it now?). I'm starting to think this feature may be more important in SPARQL, and think that perhaps I should have made it a higher priority when I voted on it. As it is, it's in the list of things we'll get to if we have time.

Then, while I was thinking about this, one of the other Mulgara developers tells me that he absolutely has to have distributed queries (actually, he needs to run rules over distributed datasets) to meet requirements in his organization. Well, the existing mechanisms will sort of work for him, but to do it right it should be in SPARQL.

Requirements

So what would I want to see in federated SPARQL queries? Well, as an implementer I need to see a few things:
  1. A syntactic mechanism for defining the URI of a SPARQL endpoint containing the graph(s) to be accessed.
  2. A syntactic mechanism for defining the query to be made on that endpoint (a subquery syntax would be fine here).
  3. A means of asking the size of a query result.
  4. A mechanism for passing existing bindings along with a query.


The first item seemed trivial until I realized that SPARQL has no standard way of describing an endpoint. Systems like Mulgara simply use http://hostname/sparql/, which provides access to the entire store (everything can be referred to using HTTP parameters, such as default-graph-uri and query). On the other hand, Joseki can do the /sparql/ thing, but also provides an option to access datasets through the path, and Sesame can have several repositories, each of which is accessible by varying the path in the URL.

The base URL for issuing SPARQL queries against would be easy enough to specify, but it introduces a new concept into the query language, and that has deeper ramifications than should be broached in this context.

The query that can be issued against an endpoint should look like a standard query, and not just a CONSTRUCT, as this provides greater flexibility and also binds the columns to variable names that can appear in other parts of the query. This is basically identical to a subquery, which is exactly what we want.

Bandwidth Efficiency

The last 2 items are a matter of efficiency and not correctness. However, they can mean the difference between transferring a few bytes vs a few megabytes over a network.

(BTW, when did "bandwidth" get subverted to describe data rates? When I was a boy this referred to the range of frequencies that a signal used, and this had a mathematical formula relating it to the number of symbols-per-second that could be transmitted over that signal - which does indeed translate to a data rate. However, it now gets used in completely different contexts which have nothing to do with frequency range. Oh well.. back to the story).

If I want to ask for the identifiers of people named "Fred" (as opposed to something else I want to name with foaf:givenname), then I could use the query:
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?person WHERE {
?person foaf:givenname "Fred" .
?person a foaf:Person
}
Now what if the "type" data and the "name" data appear on different servers? In that case we would use a distributed query.

Using the HTTP/GET idiom I mentioned at the top of this post, then I could send the query to the server containing the foaf:givenname statements, and change it now to say:
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?person WHERE {
?person foaf:givenname "Fred" .
GRAPH <http://hostname/sparql/?query=
PREFIX+foaf%3A+%3Chttp%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%2F%3E%0A
CONSTRUCT+%7B%3Fperson+a+foaf%3APerson%7D+
WHERE+%7B%3Fperson+a+foaf%3APerson%7D> {
?person a foaf:Person
}
}
So now the server will resolver all the entities with the name "Fred", then it will retrieve a graph and ask it for all the entities that are a foaf:Person. Then it will join these results to create the final result.

But what happens if there are only 3 things named "Fred", but 10,000 people in the data set? In that case the server will resolve the first pattern, getting 3 bindings for ?person, and then make a request across the network, getting back 10,000 statement which are then queried for those statements the describe a foaf:Person (they all will), and only then does the join happen. Ideally, we'd have gone the other way, and asked the server with 10,000 people to request data from the server that had 3 entities named Fred, but we may not have known ahead of time that this would be better, and a more complex query could require a more complex access pattern than simply "reversing" the resolution order.

First of all, we need a way to ask each server how large a set of results is likely to be. The COUNT function that is being discussed in the SPARQL WG at the moment could certainly be used to help here, though for the sake of efficiency it would also be nice to have a mechanism for asking for the upper-limit of the COUNT. That isn't appropriate for a query language (since it refers to database internals) but would be nice to have in the protocol, such as with an HTTP/OPTION request (though I really don't see something like that being ratified by the SPARQL WG). But even without an "upper limit" option, a normal COUNT would give us what we need to find out how to move the query around.

So once we realize that the server running the query has only a little data (call it "Server A"), and it needs to join it to a large amount of data on a different server (call this one "Server B", then of course we want Server A to send the small amount of data to Server B instead of retrieving the large amount from it. One way to do this might be to invert the query at this point, and send the whole thing to Server B. That server then asks Server A for the data, and sends its response. Unfortunately, that is both complex, and requires a lot more hops than we want. The final chain here would be:
  1. Client sends query as a request to Server A
  2. Server A reverses the query and sends the new query as a request to Server B
  3. Server B resolves its local data, and sends the remainder of the query as a request to Server A
  4. Server A responds to Server B with the result of entities with the name "Fred"
  5. Server B joins the data it got with the local data and responds to Server A with the results of the entire query
  6. Server A responds to the client with the unmodified results it just received

Yuck.

Instead, when Server A detects a data size disparity like this, it needs a mechanism to package up its bindings for the ?person variable, and send these to Server B along with the request. Fortunately, we already have a format for this in the SPARQL result set format.

Normally, a query would be performed using an HTTP/GET, but including a content-body in a GET request has never been formally recognized (though it has not been made illegal), so I don't want to go that way. Instead, a POST would work just as well here. The HTTP request with content could look like this (I've added line breaks to the request):
POST /sparql/?query=
PREFIX+foaf%3A+%3Chttp%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%2F%3E%0A
SELECT+%3Fperson+WHERE+%7B%3Fperson+a+foaf%3APerson%7D HTTP/1.1
Host: www.example
User-agent: my-sparql-client/0.1
Content-Type: application/sparql-results+xml
Content-Length: xxx

<?xml version="1.0"?>
<sparql xmlns="http://www.w3.org/2005/sparql-results#">
<head>
<variable name="person"/>
</head>
<results distinct="false" ordered="false">
<result>
<binding name="person"><uri>http://www.example/FredFlintstone</uri></binding>
</result>
<result>
<binding name="person"><uri>http://www.example/FredKruger</uri></binding>
</result>
<result>
<binding name="person"><uri>http://www.example/FredTheDog</uri></binding>
</result>
</results>
</sparql>

I can't imagine that I could be successful in suggesting this as part of the underlying protocol for federated querying, but I'm thinking that I'll be incorporating it into Mulgara all the same.