Thursday, May 21, 2009

Federated Queries

A long time ago, TKS supported federated queries, though the approach was a little naive (bring all the matches of triple patterns in to a single place, and join them there). Then a few years ago I added this to Mulgaraas well. I've always wanted to make it more intelligent in order to reduce network bandwidth, but at the same time, I was always happy that it worked. Unfortunately, it was all accomplished through RMI, and was Mulgara specific. That used to be OK, since RDF servers didn't have standardized communications mechanisms, but that changed with SPARQL.

More recently, I've started running across distributed queries through another avenue. While working through the SPARQL protocol, I realized that the Mulgara approach of treating unknown HTTP URIs as data that can be retrieved can be mixed with SPARQL CONSTRUCT queries encoded into a URI. The result of an HTTP request on a SPARQL CONSTRUCT query is an RDF document, which is exactly what Mulgara is expecting when it does an HTTP GET on a graph URI. The resulting syntax is messy, but it works quite well. Also, while retrieving graph URIs is not standard in SPARQL, most systems implement this, making it a relatively portable idiom. I was quite amused at the exclamations of surprise and horror (especially the horror) when I passed this along on a mailing list a few weeks ago.

The ease at which this was achieved using SPARQL made me consider how federated querying might be done using a SPARQL-like protocol. Coincidentally, the SPARQL Working Group has Basic Federated Queries as a proposed feature, and now I'm starting to see a lot of people asking about it on mailing lists (was people always asking about this, or am I just noticing it now?). I'm starting to think this feature may be more important in SPARQL, and think that perhaps I should have made it a higher priority when I voted on it. As it is, it's in the list of things we'll get to if we have time.

Then, while I was thinking about this, one of the other Mulgara developers tells me that he absolutely has to have distributed queries (actually, he needs to run rules over distributed datasets) to meet requirements in his organization. Well, the existing mechanisms will sort of work for him, but to do it right it should be in SPARQL.

Requirements

So what would I want to see in federated SPARQL queries? Well, as an implementer I need to see a few things:
  1. A syntactic mechanism for defining the URI of a SPARQL endpoint containing the graph(s) to be accessed.
  2. A syntactic mechanism for defining the query to be made on that endpoint (a subquery syntax would be fine here).
  3. A means of asking the size of a query result.
  4. A mechanism for passing existing bindings along with a query.


The first item seemed trivial until I realized that SPARQL has no standard way of describing an endpoint. Systems like Mulgara simply use http://hostname/sparql/, which provides access to the entire store (everything can be referred to using HTTP parameters, such as default-graph-uri and query). On the other hand, Joseki can do the /sparql/ thing, but also provides an option to access datasets through the path, and Sesame can have several repositories, each of which is accessible by varying the path in the URL.

The base URL for issuing SPARQL queries against would be easy enough to specify, but it introduces a new concept into the query language, and that has deeper ramifications than should be broached in this context.

The query that can be issued against an endpoint should look like a standard query, and not just a CONSTRUCT, as this provides greater flexibility and also binds the columns to variable names that can appear in other parts of the query. This is basically identical to a subquery, which is exactly what we want.

Bandwidth Efficiency

The last 2 items are a matter of efficiency and not correctness. However, they can mean the difference between transferring a few bytes vs a few megabytes over a network.

(BTW, when did "bandwidth" get subverted to describe data rates? When I was a boy this referred to the range of frequencies that a signal used, and this had a mathematical formula relating it to the number of symbols-per-second that could be transmitted over that signal - which does indeed translate to a data rate. However, it now gets used in completely different contexts which have nothing to do with frequency range. Oh well.. back to the story).

If I want to ask for the identifiers of people named "Fred" (as opposed to something else I want to name with foaf:givenname), then I could use the query:
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?person WHERE {
?person foaf:givenname "Fred" .
?person a foaf:Person
}
Now what if the "type" data and the "name" data appear on different servers? In that case we would use a distributed query.

Using the HTTP/GET idiom I mentioned at the top of this post, then I could send the query to the server containing the foaf:givenname statements, and change it now to say:
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?person WHERE {
?person foaf:givenname "Fred" .
GRAPH <http://hostname/sparql/?query=
PREFIX+foaf%3A+%3Chttp%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%2F%3E%0A
CONSTRUCT+%7B%3Fperson+a+foaf%3APerson%7D+
WHERE+%7B%3Fperson+a+foaf%3APerson%7D> {
?person a foaf:Person
}
}
So now the server will resolver all the entities with the name "Fred", then it will retrieve a graph and ask it for all the entities that are a foaf:Person. Then it will join these results to create the final result.

But what happens if there are only 3 things named "Fred", but 10,000 people in the data set? In that case the server will resolve the first pattern, getting 3 bindings for ?person, and then make a request across the network, getting back 10,000 statement which are then queried for those statements the describe a foaf:Person (they all will), and only then does the join happen. Ideally, we'd have gone the other way, and asked the server with 10,000 people to request data from the server that had 3 entities named Fred, but we may not have known ahead of time that this would be better, and a more complex query could require a more complex access pattern than simply "reversing" the resolution order.

First of all, we need a way to ask each server how large a set of results is likely to be. The COUNT function that is being discussed in the SPARQL WG at the moment could certainly be used to help here, though for the sake of efficiency it would also be nice to have a mechanism for asking for the upper-limit of the COUNT. That isn't appropriate for a query language (since it refers to database internals) but would be nice to have in the protocol, such as with an HTTP/OPTION request (though I really don't see something like that being ratified by the SPARQL WG). But even without an "upper limit" option, a normal COUNT would give us what we need to find out how to move the query around.

So once we realize that the server running the query has only a little data (call it "Server A"), and it needs to join it to a large amount of data on a different server (call this one "Server B", then of course we want Server A to send the small amount of data to Server B instead of retrieving the large amount from it. One way to do this might be to invert the query at this point, and send the whole thing to Server B. That server then asks Server A for the data, and sends its response. Unfortunately, that is both complex, and requires a lot more hops than we want. The final chain here would be:
  1. Client sends query as a request to Server A
  2. Server A reverses the query and sends the new query as a request to Server B
  3. Server B resolves its local data, and sends the remainder of the query as a request to Server A
  4. Server A responds to Server B with the result of entities with the name "Fred"
  5. Server B joins the data it got with the local data and responds to Server A with the results of the entire query
  6. Server A responds to the client with the unmodified results it just received

Yuck.

Instead, when Server A detects a data size disparity like this, it needs a mechanism to package up its bindings for the ?person variable, and send these to Server B along with the request. Fortunately, we already have a format for this in the SPARQL result set format.

Normally, a query would be performed using an HTTP/GET, but including a content-body in a GET request has never been formally recognized (though it has not been made illegal), so I don't want to go that way. Instead, a POST would work just as well here. The HTTP request with content could look like this (I've added line breaks to the request):
POST /sparql/?query=
PREFIX+foaf%3A+%3Chttp%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%2F%3E%0A
SELECT+%3Fperson+WHERE+%7B%3Fperson+a+foaf%3APerson%7D HTTP/1.1
Host: www.example
User-agent: my-sparql-client/0.1
Content-Type: application/sparql-results+xml
Content-Length: xxx

<?xml version="1.0"?>
<sparql xmlns="http://www.w3.org/2005/sparql-results#">
<head>
<variable name="person"/>
</head>
<results distinct="false" ordered="false">
<result>
<binding name="person"><uri>http://www.example/FredFlintstone</uri></binding>
</result>
<result>
<binding name="person"><uri>http://www.example/FredKruger</uri></binding>
</result>
<result>
<binding name="person"><uri>http://www.example/FredTheDog</uri></binding>
</result>
</results>
</sparql>

I can't imagine that I could be successful in suggesting this as part of the underlying protocol for federated querying, but I'm thinking that I'll be incorporating it into Mulgara all the same.

4 comments:

Anonymous said...

Hi Paul,

Having written a federation for Aduna/Sesame, I know the above approach generally will not scale that well. You have to consider the case when both the left and right side might be really large, but the resulting inner join maybe small.

You have to use prepared queries with bindings. This will allow the federation to pipline the binding results from the left side to the the right side, while the right side streams the matched results back. Otherwise you end up repeating the query - dwarfing the actual results.

Sesame v2 protocol does not support prepared queries and this hampers the ability to create an effective federation using it. Building the federation for Sesame drove most of the changes to the Sesame protocol for v3, including support for prepared queries.

If the SPARQL WG wants to support federation, it should start by supporting clear parameters bindings (different from variables) and prepared queries. That is what is lacking in the language and protocol in my opinion.

Paula said...

"You have to consider the case when both the left and right side might be really large, but the resulting inner join maybe small."OK, but how do you work around this issue? It is my understanding that the only way to get to the results of the inner join is to bring those two bring the data from both sides together. If one side is much smaller than the other side, then you obviously want to move the smaller data set over to the location of the larger set. But if they're both big, then you're stuck.... aren't you?

The only shortcut I can think of here is to only transfer over the bindings of the variables that participate in the join. So if the Left Hand Side (LHS) binds ?x and ?y and the Right Hand Side (RHS) binds ?y and ?z, then instead of sending all of the bindings for ?x and ?y from the location of the LHS to the location of the RHS, only the bindings for ?y need be sent. This could be significantly less data in some instances (and even if it's not, then at least you've cut down the data from 2 variables to 1, which is approximately half).

Or is there something here that I'm just not seeing at all?

"You have to use prepared queries with bindings."Sorry, but I don't really follow what you're saying here. I thought that this was exactly what I was proposing? What are you suggesting that's different?

Anonymous said...

Hi Paul,

Sorry about the late replay.

Reading over your post again - I see that we are indeed discussing a similar solution. Although, we are coming at it with technically different approaches, from a performance/effeciency point of view they are similar/equivalent.

We never got a chance to efficiently implement Sesame 3 native store's COUNT method and therefore were not able to achieve out of box performance with Sesame 3 Federation (needs warming up). However, we did get experience at evaluating distributed queries.

> But if they're both big, then you're stuck.... aren't you?

Yes, you have to pipeline one side to the other side.

> Or is there something here that I'm just not seeing at all?

You have the idea, by sending bindings instead of the graph you can potentially reduce communication. However, before you can send bindings, you have to have a server side prepared query that accepts bindings.

> I thought that this was exactly what I was proposing? What are you suggesting that's different?

Very similar (apologize for reading your post too quickly before), I am suggesting sending a stream of query bindings to a prepared query (and not a CONSTRUCT query).

Paula said...

I don't mind that you skimmed it the first time. I'm asking for that kind of treatment when I write too verbosely. :-)

Your comment about accepting bindings is well taken. We've have a mechanism in Mulgara for such things for a long time, but I have to glue this to an XML parser before I can implement it here. I was more-or-less thinking about it out loud before I committed to writing the code.

Thanks for your responses. They've been helpful.