Tuesday, May 11, 2004

iTQL Grammar
After a slow start today the grammar is coming along nicely. I've finally worked out what SableCC is doing with the grammar file. I now have code that successfully sees transitive predicates and an optional anchor for chaining.

Each expression in the grammar file gets turned into a class. The ITQLInterpreter class has methods which takes a full expression and hands it off to the parser to convert it into a tree of objects, the types of which were defined by SableCC from the grammar. The code then passes this tree into methods which break it up and recursively call into methods which break it up further. These methods check the type of the object they were given (will normally only be two or three possibilities) with instanceof tests and then extracts sub-parts out of these objects with accessor methods that SableCC provided in the classes. Each sub-part is then sent off to another function which is designed specifically for its type, to be broken up in the same way. The methods at the bottom of this recursive stack build primitive ConstraintExpression objects and return them.

I'm in the process of building a new ConstraintExpression class to take the transitive data and return it up the call stack. So long as I meet that interface it should go down to the lower query layers with no problems. Once it's there I can start the code that builds the chain.

If I'd stuck to it with no other distractions I could have had it done by today, but I'm pretty sure I can get through it for tomorrow.

Distributed Queries
TA forwarded this message from the RDF interest group mailing list. We've been looking at putting scalable distributed queries into the system, and I think he's interested in making sure we are aware of what others have been doing.

TKS/Kowari had a distributed query system that worked, but sub-optimally. It went through a phase where it wasn't supported but I'm pretty sure it's working fine at the moment. However, it has never been scalable, and we need to do that.

The type of distributed query described in the email above is what we have normally referred to as "load balancing". We don't support this and it's not yet on our roadmap, but DM and I have been keen to do it for a few years, and it may be time to consider it again. In the meantime we might give it a go in the open source code which we're designing. It wouldn't be in the first release of the code, but we could design it in so we can do it afterwards.

The Kowari/TKS idea of distributed queries allows data from differing data sources to be correlated together. While this works now, it's done by asking models on each server for the relevant data, shipping it all back to the server that the client is connected to, and doing the joins there. This is sub-optimal, as it means moving far more data over the network than is neccessary.

I looked at this a couple of years ago, but other priorities came up at the time. It looks like I have to pick it up again.

A contrived example of the kind of query we want to perform might be to ask for the licence plates of all people who have travelled to the middle east recently, and who have had conversations with a known individual. The conversation data is stored by the NSA, the travel info is stored by the FBI, and the license plates are with the DMV. We can assume that the NSA and FBI are allowed to see each other's data (see? It's really contrived!), and the DMV's data. But the DMV can't see anything from the two agencies. Note that security like this is a TKS feature, and is not available in Kowari. (We have to sell something, right?) If anyone wanted to then I guess I could be added to Kowari, since it is an Open Source project, but I doubt that will happen.

My 2-year-old plan for this went something like:

1. When we get a query, break it down into its constituent join operations on constraints, and work out which operations apply to which models. This creates subqueries for us which can be expressed as valid iTQL.

2. Any queries to be performed locally can be executed locally. If this forms the full query then return the data to whomever asked for it. Otherwise this value is kept to provide the numbers for step 3, and cached for a future call back to this machine during step 5.

3. Send out constraints to all servers holding the models in the query, and request the count. Constraints have an optional model, which can restrict which servers it will get sent to. eg. we only need to send a constraint like <$car isOwnedBy $person> to a model on the DMV server. This can be done as a variation on iTQL sub-query where the select is changed to a "count".

4. Get back the count of all sub-queries from each server. Use this to make a decision of which server should do the joins. This is based on the following:
a. The server with the highest count is the nominally preferred server for joins.
b. If the winning server (with the highest count) has low bandwidth AND the
combined results for the join from all the other servers have a larger
count than that server's single result, then choose another server to do the
count. This rule needs to be weighted depending on the bandwidth, and maybe
processing power of the server.
c. If the winning server does not have permission to see another server's data,
then go to the next most eligible server. e.g. The DMV server may not see
data from the NSA servers, but the NSA and FBI are both allowed to see that
data.
Part (b) is based on the idea of keeping the traffic moving on or off a slow machine to a minimum.

5. If the current machine is the winning server then send out all subqueries that aren't for models on this server to the other servers, and join the results when they are returned. If another machine is the winning server then send it the full query. That server will start at step 1.

This plan means that being a "remote server" and simply answering local queries is essentially the same operation. Step 4 keeps bandwidth usage to a minimum.

Kowari has the potential to do all of this, sans the security work.

In case it isn't clear that distributed queries for Kowari/TKS don't do what the email was talking about, I should point out the comment about selecting the servers to perform the query. It should be clear from the plan above that knowledge about which stores we are using is essential, and the query doesn't make a lot of sense if we don't know. So having a process to choose which server to ask is not an available option.

If we are load balancing, then choosing the server to query makes sense. Hopefully this will eventually make it onto our roadmap.

Note that distributed queries (as I defined them above) and load balanced queries are not mutually exclusive. Indeed, we always envisaged that the two would work together.

Blogging
I started this blog as a way of keeping notes. I record what I'm doing at work, and I also record some other things I find of interest. It started as a .plan file, but blogging is de rigueur these days, so I transferred it exactly 2 weeks ago. I thought it might be handy for others at work (including AN who I'm working closely with at the moment).

It also crossed my mind that maybe there would be others interested in what was happening inside an open source project like Kowari. I didn't think that would be much of a concern, but I decided to refer to my colleagues by their initials, to protect their modesty if nothing else. So now you know why I refer to AN, DM, AM, an so on. :-) (AN want to know who RMI is)

As for the non-open source parts of my work (for instance, on TKS), then I figured it would just generate interest, and not be giving away "trade secrets". After all, even companies like Microsoft have internal developers blogging about their work on the OS, and MS claim that security on the OS source code base is highly important.

Now AN has had a hit tracker on his blog for some time, and in the past he has had interesting things to say about it, so I decided to give it a go too. Anyone visiting since yesterday should have seen the icon near the bottom of this page. It mostly accumulates statistics, but it does show the last 20 hits. Out of curiosity I looked at it after 24 hours, and I have to admit, for a 2 week old blog I was surprised that anyone had looked at it at all! So, ummm, hi everyone. :-)

No comments: