Wednesday, April 21, 2010

Work

I've had a number of administrative things to get done this week, since work will be taking a dramatic new turn soon. I've been missing working in a team, so that part will be good, but there are too many unknowns right now, including a visa nightmare that has been unceremoniously dumped in my lap. So, I'm stressed and have a lot to do. But that doesn't mean I'm not working.

Multi-Results


I'd recently been asked to allow the HTTP interfaces to return multiple results. One look at the SPARQL Query Results XML Format makes it clear that SPARQL isn't capable of it, but the TQL XML format has always allowed it - or at least, I think it did. The SPARQL structure is sort of flat, with a declaration of variables at the top, and bindings under it. The TQL structure is similar, but embeds it all in another element called a "query". That name seems odd (since it's a result, not a query), so I wonder if someone had intended to include the original query as an attribute of that tag. Anyway, the structure is available, so I figured I should add it.

This was a little trickier than I expected, since I'd tried to abstract out the streaming of answers. This means that I could select the output type simply by using a different streaming class. For now, the available classes are SPARQL-XML, SPARQL-JSON and TQL-XML, but there could easily be others. However, I now had to modify all of those classes to handle multiple answers. Of course, the SPARQL streaming classes had to ignore them, while the TQL class didn't, but that wasn't too hard. However, I came away feeling that it was somehow messier than it ought to have been. Even so, I thought it worked OK.

One bit of complexity was in handling the GET requests of TQL vs. SPARQL. In SPARQL we can only expect a single query in a GET, but TQL can have multiple queries, separated by semicolons. While I like to keep as much code as possible common to all the classes, in the end I decided that the complexity of doing this was more than it was worth, and I put special multi-query-handling code in the TQL servlet.

All of this was done a little while ago, but because I was waiting on responses on the mulgara.org move, I decided not to release just yet. This was probably fortunate, since I got an email the other day explaining that subqueries were not being embedded properly. They were starting with a new query element tag, but not closing with them. However, these tags should not have appeared at this level at all. The suggested patch would have worked, but it relied on checking the indentation used for pretty-printing in order to find out if the query element should be opened. This would work, but was covering the problem, rather than solving it. A bit of checking, and I realized that I had code to send a header for each answer, code to send the data for the answer, but no code for the "footer". The footer would have been the closing tag for the query element, and this was being handled in other code, meaning that it only came up at the top level, and not in the embedded sub-answers. This in turn meant that it wasn't always matching up to the header. So I introduced a footer method for answers (a no-op in SPARQL-XML and SPARQL-JSON) which cleaned up the process well enough that avoiding the header (and footer) on sub-answers was now easy to see and get right.

So was I done? No. The email also commented on warnings of transactions not being closed. So I went looking at this, and decided that all answers were being closed properly. In confusion, I looked at the email again, and this time realized that the bug report said that they were using POST methods. Since I was only dealing with queries (and not update commands) I had only gone to the GET method. So I looked at POST, and sure enough it was a dogs breakfast.

Part of the problem with a POST is that it can include updates as well as queries. Not having a standard response for an update, I had struggled a little with this in the past. In the end, I'd chosen to only output the final result of all operations, but this was causing all sorts of problems. For a start, if there was more than one query, then only the last would be shown (OK in SPARQL, not in TQL). Also, since I was ignoring so many things, it meant that I wasn't closing anything if it needed it. This was particularly galling to have wrong, since I'd finally added SPARQL support for POST queries.

I'd really have liked to use the same multi-result code that I had for GET requests, but that didn't look like it was going to mix well with the need to support commands in the middle. In the end I copied/pasted some of the GET code (shudder) and fixed it up to deal with the result lists that I'd already built through the course of processing the POST request. It doesn't look too bad, and I've commented on the redundancy and why I've allowed it, so I think it's all OK. Anyway, it's all looking good now. Given that I also have a major bugfix from a few weeks back, then I should get it out the door despite the mulgara.org shuffle not being done.

I didn't mention that major bug, did I? For anyone interested, some time early last year a race bug was avoided by putting a lock into the transaction code. Unfortunately, that lock was to broad, and it prevented any other thread from reading while a transaction was being committed. This locked the database up during large commit operations. It's not the sort of thing that you're likely to see with unit tests, but I was still deeply embarrassed. At least I found it (a big thanks to the guys at PLoS for reporting this bug, and helping me find where it was).

So before I get dragged into any admin stuff tomorrow morning (office admin or sysadmin), I should try to cut a release to clean up some of these problems.

Meanwhile, I'm going to relax with a bit of Hadoop reading. I once talked about putting a triplestore on top of this, and it's an idea that's way overdue. I know others have tried exactly this, but each approach has been different, and I want to see what I can make of it. But I think I need a stronger background in the subject matter before I try to design something in earnest.

No comments: