Tuesday, February 17, 2009

Resting

I've had a couple of drinks this evening. Lets see if that makes me more or less intelligible.

Now that I've added a REST-like interface to Mulgara, I've found that I've been using it more and more. This is fine, but to modify data I've had to either upload a document (a very crude tool) or issue write commands on the TQL endpoint. Neither of these were very RESTful, and so I started wondering if it would make sense to do something more direct.

From my perspective (and I'm sure there will be some who disagree with me), the basic resources in an RDF database are graphs and statements. Sure, the URIs themselves are resources, but the perspective of RDF is that these resources are infinite. Graphs are a description of how a subset of the set of all resources are related. Of course, these relationships are described via statements.

Graphs and Statements

So I had to work out how to represent a graph in a RESTful way. Unfortunately, graphs are already their own URI, and this probably has nothing to do with the server it is on. However, REST requires a URL which identifies the host and service, and then the resource within it. So the graph URI has to be embedded in the URL, after the host. While REST URLs typically try to reflect structure in a path, encoding a URL makes this almost impossible. Instead I opted to encode the graph URI as a "graph" parameter.

Statements posed a similar though more complex challenge. I still needed the graph, so this had to stay. Similarly, the other resources also needed to be encoded as parameters, so I added this as well. This left me with 2 issues: blank nodes and literals.

Literals

Literals were reasonably easy... sort of. I simply decided that anything that didn't look like a URI would be a literal. Furthermore, if it was structured like a SPARQL literal, then this would be parsed, allowing a datatype or language to be included. However, nothing is never really easy (of course) and I found myself wondering about relative URIs. These had never been allowed in Mulgara before, but I've brought them in recently after several requests. Most people will ignore them, but for those people who have a use, they can be handy. That all seems OK, until you realize that the single quote character " is an unreserved character in URIs, and so the apparent literal "foo" is actually a valid relative URI. (Thank goodness for unit tests, or I would never have realized that). In the end, I decided to treat any valid URI as a URI and not a literal, unless it starts with a quote. If you really want a relative URI of "foo" then you'll have to choose another interface.

Blank Nodes

Blank nodes presented another problem. Initially, I decided that any missing parameter would be a blank node. That worked well, but then I started wondering about using the same blank node in more than one statement. I'm treating statements as resources, and you can't put more than one "resource" into a REST URL, so that would mean referring to the same "nameless" thing in two different method calls, which isn't possible. Also, adding statements with a blank node necessarily creates a new blank node every time, which breaks idempotency.

Then what about deletion? Does nothing match, or does the blank node match everything? But doing matches like that means I'm no longer matching a single statement, which was what I was trying to do to make this REST and not RPC for a query-like command.

Another option is to refer to blanks with a syntax like _:123. However, this has all of the same problems we've had with exactly this idea in the query language. For instance, these identifiers are not guaranteed to match between different copies of the same data. Also, introducing new data that includes the same ID will accidentally merge these nodes incorrectly. There are other reasons as well. Essentially, you are using a name for something that was supposed to be nameless, and because you're not using URIs (like named things are supposed to use) then you're going to encounter problems. URIs were created for a reason. If you need to refer to something in a persistent way, then use a name for it. (Alternatively, use a query that links a blank node through a functional/inverse-functional predicate to uniquely identify it, but that's another discussion).

So in the end I realized that I can't refer to blank nodes at all in this way. But I think that's OK. There are other interfaces available if you need to work with blank nodes, and some applications prohibit them anyway.

Reification

Something I wanted to come back to is this notion of representing a statement as 3 parameters in a URL (actually 4, since the graph is needed). The notion of representing a statement as a URI has already been addressed in reification, however I dismissed this as a solution here since reifying a statement does not imply that statement exists (indeed, the purpose of the reification may be to say that the statement is false). All the same, it's left me thinking that I should consider a way to use this interface to reify statements.

Methods

So the methods as they stand now are:
method/ resourceGraphStatementOther
GETN/AN/AUsed for queries.
POSTUpload graphsN/AWrite commands (not SPARQL)
PUTCreates graphCreates statementN/A
DELETEDeletes graphDeletes statementN/A

I haven't done HEAD yet (I intend to indicate if a graph or statement exists), and I'm ignoring OPTION.

I've also considered what it might mean to GET a statement or a graph. When applied to a graph, I could treat this as a synonym for the query:
  construct {?s ?p ?o} where {?s ?p ?o}
Initially I didn't think it made much sense to GET a statement, but while writing this it occurs to me that I could return a reification URI, if one exists (this is also an option for HEAD, but I think existence is a better function there).

Is There a Point?

Everything I've discussed here may seem pointless, especially since there are alternatives, none of it is standard, and I'm sure there will be numerous criticisms on my choices. On the other hand, I wrote this because I found that uploading documents at a time to be too crude for real coding. I also find that constructing TQL command to modify data to be a little too convoluted in many circumstances, and that a simple PUT is much more appropriate.

So, I'm pretty happy with it, for the simple fact that I find it useful. If anyone has suggested modifications or features, than I'll be more than happy to take them on board.

No comments: