Friday, March 04, 2005

Databases 101
I had a very interesting day today, though unfortunately I'm not able to blog all of it. I'll probably say something about this in a few weeks time.

In the meantime, I saw Bob in the lift this morning, which reminded me to return his book, plus a past thesis I'd borrowed from him. On my way up I met someone else who'd obviously been riding a bike. I asked where his bike is kept, and he explained that there is a lockup at the bottom of the building. Given the traffic and bus situation getting into the university, this might be a good option for me. I'd been worried about bringing a bike, as people will often cut security chains, and I really wouldn't want to lose my bike.

On returning Bob's book, I commented on how the structure of his chapter followed the history of the design (though not always the implementation) of Kowari. When first designing how a query should be evaluated, we essentially had a top-down evaluation pattern, but since Kowari constraints resolve sets, our implementation was actually implemented as a bottom-up solution. Several of the efficiency techniques which were then described have all been applied in the query engine over the years, with semi-naïve evaluation coming about partly in the query layer, and more recently in my own rule engine.

Bob suggested that I consider another chapter which describes magic sets. I've read about these, but only in generalities, with no one explaining how to implement them in practice. Bob explained that the implementation is rather easy, with the system creating temporary assertions which are used for further evaluations.

This is exactly what Andrew and I were discussing in order to make OWL reasoning more efficient. It would seem that I am doomed to reinvent the wheel, though in the context of a new type of database. At least some of my work has been original, though less of it than I initially thought. I guess this is no surprise, as I'd been pushing for more people at Tucana to read the literature on this and knew that I had to read more myself. It's just a shame that some of our "commercial realities" prevented the time for reading to be assigned, as it would have saved us in the long run.

One good thing to come from all of this is that all of the most advanced methods described in the literature have been independently implemented by us. This tells me that we were making the right decisions as we went. While we may have saved a lot of time by reading the literature and trying to implement what we found there, at least this way we really know what we are doing. Also, there were no guarantees that we would have converted the established techniques correctly, given the unusual nature of the Kowari structure.

OWL Discussion
Bob also wanted to ask me about blank nodes in RDF. I don't know if my explanation was adequate, but I gave it a go. My understanding of their need basically comes down to two things.

First, blank nodes are essentially an artifact of RDF-XML (and I think I've mentioned in the past how I feel about that particular encoding scheme). When writing XML it would be annoying and wasteful of space to have to find a unique name for EVERY unique tag in the document. Having the parser build some automatic identifiers (for internal use) for nodes which will not be re-used makes quite a bit of sense in this context.

The second reason comes down to my own experience with blank nodes. While every node in Kowari requires its own ID, only URI References and Literals take up space in the string pool. If every node were named, then the size of the string pool would blow out unmanageably. In fact, every possible implementation of an RDF database would also suffer from this. If you use unique names for every node, then these names will have to be stored somewhere.

There may be other reasons mentioned in the literature, but I haven't seen them yet.

This discussion led to the process of merging blank nodes, which everyone seems to agree is a bad idea (except when the same document gets re-loaded, but even then there is debate). The accepted method here seems to be to use the owl:sameAs predicate between the blank nodes to do the work.

That then led to a question of how to treat nodes in RDF which were declared to be the same as each other. Apparently, this question has come up on some UML mailing lists recently, and it demonstrates some of the problems of bringing pre-conceptions into a new domain.

The question started out with an assertion of two classes, A and B, and instances of each of these, a and b respectively. Class A declares n properties: Pa1...Pan, and class B has m properties: Pb1...Pbm. Instances a and b have assertions for each of these properties (in other words, a has n specific properties, which are all described by its class A. Similarly for b).

So what happens when we assert the statement: <A> <owl:sameAs> <B> ?

Some of the UML people expected that the properties of b should be copied onto a, and vice versa. I have no idea why they would have thought this, but anyone familiar with RDF will know that this is wrong. All m property declarations for B should be copied to A (and vice versa), but this does not affect the properties of any instances.

The main difference for the instances, is they each end up with new types:
<a> <rdf:type> <B>
<b> <rdf:type> <A>

After this is done, both a and b will be able to have properties of Pa1...Pan,Pb1...Pbm. This was already legal under the open world assumption of RDF, but now the properties are completely described.

UML has numerous diagrams as a part of its standard, including the class description and the component diagrams. However, about the only things I ever see discussed or presented are class descriptions, particularly when UML is being compared to OWL. Perhaps it was this narrow view which led to this confusion between ABox and TBox data.

Kowari Background
I took up Janet's offer to come back for coffee this morning, but I discovered that no one was there today. So I returned to my own building for a coffee.

While lining up I met one of the staff members, named Richard Cole. He asked what I am researching, and when I told him, he asked if I knew about "Kowari". Isn't it a small world?

Richard has a project which measures how "connected" classes in a program are. He would like to store a lot of this data using RDF. Until this week he had been using his own implementation of an RDF store, but he discovered that the complexity was getting too much, and went looking for existing implementations. I don't know his exact selection criteria, but he ended up settling on Kowari. A fine choice. :-)

Richard had several questions about how Kowari works, and how to implement his data structures. I was only too pleased to go through this with him. As is usually the case, describing anything helps me to clarify the concepts in my own mind, so I find this process useful. That's one of the reasons I try to describe so much in my blog.

Speaking of blogs, Richard told me about a number of things he had been learning about RDF and Kowari from someone's blog that he had found. It was only later in the day that I discovered that the blog he was talking about was mine. :-)

I like to encourage people who are using Kowari or RDF. This has two effects. The first is that Kowari gets used more, which means RDF is used more. The second effect is that RDF is being used more, even when a non-Kowari system is in use. This serves to expand the Semantic Web, giving it more functionality, and getting it closer to the mainstream. I'm interested in this, because the resulting technologies are very useful (both because they are more fun, and they make things easier) and because the more semantic web applications there are, the more Kowari will be used.

So the more Kowari is used, the more RDF is used. The more RDF is used, the more Kowari is used. See? It all comes a full circle. :-)

Today's lunch was spent attending the BioMOBY presentation by Mark Wilkinson that I described yesterday. I found the whole presentation to be quite interesting.

Bioinformatics suffers from a large number of databases which contain incompatible data. This data is all available on the net, but to take information from one database and compare it to data in another database is a difficult manual process, due to the disparity in the data formats. It can also be very difficult to find the type of data needed from all of the databases which are out there.

BioMOBY does a few things. First, the data from each database gets wrapped in some XML which describes what that data is. Note that the data is not pulled apart, so it would still need to be parsed by any system looking at it. However, this "wrapping" technique makes it relatively easy for a biologist to encode. Ease is important, as the participation of databases owners is an important factor in this project.

The XML which wraps the raw data also provides a lot of metadata about the payload. This makes it possible for a system to recognise the type of data, where the data came from, what it might be transformable into, along with several other features, depending on the data.

Once a configuration is in place, and a server is capable of handing out this data, it then gets registered with the main BioMOBY servers. These servers act as a Yellow Pages index for all registered databases.

A BioMOBY client can query the index, asking for a particular type of data. The index will return a list of servers, and the types of data available from them. The client will connect to one of these, retrieve the information, and then ask the BioMOBY directory what it can do with this retrieved data. The index will respond with a number of operations along with servers which perform them, and again the client can select one of these from the list.

The consequence of this is an interface where the user can start with one type of data, and seamlessly move through various transformations, picking up relevant data on the way. The significance is that this is done over a series of unrelated servers across the net. In other words, it is a semantic web application.

The key to making the data interoperate is describing it all with ontologies. The ontologies are all quite complex and broad, and have been built completely by the biology community. It makes sense that this was done by the community rather than the project itself, as databases can only be added when the owner writes the ontology to describe it, and then registers the database with the BioMOBY index.

The most surprising aspect of this project was that it was not using RDF. The model they use is a graph of nodes with directed, labeled arcs. Even the XML syntax duplicates a significant amount of RDF functionality. Similarly, the ontologies are in their own vocabulary. So I was surprised that they did not try to leverage off existing RDF tools.

After the presentation, I asked Mark where the project stands in relation to RDF. He explained that the BioMOBY syntax was developed independently because RDF was not widely known back when BioMOBY started. The BioMOBY syntax is also simpler than RDF-XML, making it easier for biologists to encode their systems.

However, now that RDF has become the prevailing standard, the BioMOBY system has internally migrated to RDF. To maintain the ease of use for the biology community the BioMOBY syntax is still in use, and a set of scripts are used to convert this into RDF and back.

Given that the system is now using RDF, I asked about scalability issues. They are nonexistent at the moment, as the entire dataset on the index server can be measured in megabytes. As a result, the whole system is stored in MySQL.

A future plan is to move the served data out of large "blobs" in strings, and into a parsed structured in RDF. Once this occurs then the system will be able to provide significantly more functionality to process the data, and at that point scalability will become a concern.

Another important consideration at this point is with the ethics associated with "human" data. For instance, a person may give permission for their cells to be used for cancer research, but under no circumstances may they be used for cloning research. This will need a complex security system which can provide authority information for each individual's record. The plan is to describe this with an ontology. After having worked on the security on Kowari, I know that this is a tough problem, and I thought that the use of an ontology was a very clever idea.

Once the system gets to this level of RDF complexity, scalability will become an issue. Mark then suggested that they would move onto Jena or something like it. I expressed concern at the scalability of Jena (sorry Brian, but it's true), and suggested Sesame or Kowari. Jena has the best RDF reasoning engine, I don't deny that, but scalability is just not there.

With the security requirements that Mark will be facing, he is concerned about scalable reasoning on ontologies. Of course, this was my opportunity to explain that this is exactly what I'm researching now.

I'd like the opportunity to contribute to a project like this, whether Kowari were involved or not. Mark left me his card, so I'll get in touch with him again and see where it goes.

In amongst all of this I've been building iTQL to extract the rules as far as I can. I could take the easy way out and do simple queries, and join the results in Java, but I would end up making a large number of queries and writing a lot of Java code. It takes more effort to do all the work with iTQL, but with only a little thought a lot of work can be saved further down the track, and the program will be much faster.

Once I have that I'll be wrapping it all in Java code which will execute the queries through an ItqlInterpreterBean. That is coming next week.

It always makes me feel self-conscious to think about people actually reading my blog. After all, it's mostly for me. I know that others read it (often to my benefit), but it is convenient that I can usually forget that as I write.

Janet made a comment to me about reading my blog, and the captivating effect that a blog has when it concerns people you know. That had me thinking about my use of full names here. The long time reader will note that I used to use initials for the people I worked with. I figured it would make sense for anyone who knew those people directly, and it would serve as a useful label for anyone I didn't know while preserving some anonymity.

Since leaving Tucana there has been no context for these labels, and I am regularly interacting with people from all over. At this point initials became useless.

I debated whether I should write a person's whole name here, or if I should leave people with their anonymity. I thought it might be worthwhile explaining why I've opted to use individuals' full names here.

There are a few reasons. First, providing context may well be useful for people who read this. For instance, when I was talking about ontologies and abductive reasoning then it was probably going to be of use for people to know that I was speaking with Peter Bruza. Second, almost all of these people have staff pages, personal web pages, etc, and their presence on the web already lacks anonymity (shades of Scott McNealy here?). Finally, I realised that few others seem to care about these issues and full names are regularly used in other blogs.

If it bothers anyone, then just let me know and I'll stop using your name. I can even track down old entries and modify them accordingly.


Rob said...

Where: A owl:sameAs B
I assume: B owl:sameAs A

Is this always the case? Are there any instances where the owl:sameAs predicate is not commutative?

If there was a case where owl:sameAs wasn't commutative, I guess the relationship would be a (??mono??)morphism rather than owl:sameAs? - Andrae would have a lot of fun with this question :D.

Quoll said...

Don't worry. It's ALWAYS commutative. :-)