Monday, May 10, 2004

iTQL Syntax
Unfortunately today was procrastination city. Mondayitis I guess. I'll have to improve on that. I used some of that time to include a couple of extra links on this page.

In the meantime, I have been working on the itql syntax. SableCC is causing frustrations, mostly because I haven't taken the time to learn how to use it properly. I think it's coming together though. I have most of it right, but it's complaining about a stack conflict. I think that's because I have it trying to differentiate between triples that look like:
  <resource> <resource> <variable>
  <variable> <resource> <resource>

Maybe I should just go with what I have and check for errors in the query code. Just tried it and it compiles. I might just stay with that and move on.

Coding Standards
There was some dispute here today over coding style and standards. Some of the standards are causing headaches for some, and their attempts to keep things in order for themselves are causing inconsistency hassles for others. It's agreed that some changes should be made, so I made some email contributions on my own preferences, and hopefully we'll come up with a reasonable compromise that will help every code effectively and faster.

Redland and SQL
More architecture discussions with DM. He's interested in looking at Redland, and working to that interface, giving it something faster than MySQL behind it. Seems interesting, though I haven't had time to look at it before now. I'll go through the architecture document in my copious free time. JRDF might still be easier for a first cut.

Looking at Redland took me to this page by Sergey Melnik. Since this page was compiled ages ago it shows me just how much I should have been reading the mailing lists in the past. I'm better now, but I don't tend to go through the archives very much. Anyway, I found this list interesting, as we looked at database implementations over 3 years ago before rejecting them and building our own.

Building Kowari/TKS and writing some MySQL tests before that has given me a reasonably good feel for the data and how it would fit into a database, so I was curious as to how other people felt this should be done (I say "felt" because it was last updated a few years ago. Maybe the current state of thinking has moved beyond this). I've yet to see what schema Redland is using.

I wasn't all that happy with Brian McBride's "Explicit models" schema, though he admitted that he is inexperienced with SQL. At face value it looks like it would need way too many joins for most queries. Simplicity makes these things scale, improves speed, and reduces space as well. More importantly, no indexes are used, and that is the key to managing this type of data.

The Jonas Liljegren "Specs loyal" schema has more promise in terms of simplicity, though no indexes still. I don't get the need to have an "id" field in each of the tables... unless he wants to reify the statements. This seemed to be the case with the "fact" field. We decided against reification for Kowari a long time ago, and I was under the impression that most implementations didn't worry about it anymore. I approve of the "resources" table which collapses all the data types into the one table. That reduces the join complexity against the "statements" table dramatically. Interesting inclusing of a "prefix" table as well, though I'm not sure I like the idea of it being a cache, as these can get out of sync. I picked up on it, as efficient usage of namespace prefixes has started to come to my attention recently with the RDFS rules.

Sergey Melnik's "Hashed with origin" schema finally includes indexes! In fact, this is the schema that most closely resembles Kowari, so I'll look at it a little more closely.

I'm not keen on the split up of resources, namespaces, literals and models. This means that you need to join numerous tables for pretty much any query. There's a reasonable argument to pull literals out into another table, but the others should all be in the one table, along with a type field (though types aren't REALLY necessary, as resources, namespaces and models are all represented with URIs, and RDF statements can differentiate if it's really needed). Since literals can be huge then they might go off to another table, but they would be in the general resource table with a type describing what they are and a foreign key to link it to the literals table.

The indexes are also a little inadequate, though they look like they indicate usage patterns (meaning that some types of queries don't have to be as effiencient as we know that the data is structured in a particular way). In Kowari we are able to perform the function of several SQL-type indexes with a single type, but after expansion they essentially translate into MySQL as:

 - idx_model (model)
 - idx_model_subject (model,subject)
 - idx_model_subject_predicate (model,subject,predicate)
 - idx_model_subject_object (model,subject,object)
 - idx_model_subject_predicate_object (model,subject,predicate,object)
 - idx_model_predicate (model,predicate)
 - idx_model_predicate_object (model,predicate,object)
 - idx_model_object (model,object)
 - idx_subject (subject)
 - idx_subject_predicate (subject,predicate)
 - idx_subject_object (subject,object)
 - idx_subject_predicate_object (subject,predicate,object)
 - idx_predicate (predicate)
 - idx_predicate_object (predicate,object)
 - idx_object (object)


Due to the structure of our indexes, this actually collapses into 6 indexes (eg. idx_subject, idx_subject_predicate and idx_subject_predicate_object are all performed by one index). In fact, the statements themselves are the index, so the storage is reasonably efficient. As you can see above, the MySQL equivalent would take up a lot more space.

It might be fun to take the above and compare its performance to Kowari. In fact, DM and I agreed that there are a few metrics that we really need to test on Kowari. One of them is the ability to handle thousands of concurrent connections. It's supposed to do it, but until we test it we can't say that it does. I'm thinking of a simple test on the numbers.rdf test data, which takes random numbers and queries for their factors. It can then add all their factors, and see if any results are prime, getting the string representation of a result if it is. Simple code, and would test the data reasonably well.

The other SQL schema approaches aren't worth going over.

Models
AN sent an email around about someone wanting to use lots of models with Kowari. This was always the intent, as the field that is now our model was kind of our answer to reification and security (among other things). However, it's now just used as a model field, and most data sets have few models. Some recent optimisations have made this assumption of having few models, which left me feeling uncomfortable about unforseen consequences. I guess this might get us reconsidering the idea of numerous models before we stray too far away from a system that handles them quickly and efficiently.

No comments: