Working notes

Another SemTech Day

Writing about SemTech is taking so long because I've had to do a lot of offline writing as well (not to mention my day job). Quite a bit of my writing was Mulgara related, so it'll end up here shortly - promise.

I'm still only halfway through recalling the conference. I suppose that's because a lot happened. More importantly, I met a lot of people, all of whom I wish I could have given more time.

For instance, Henry Story and I had a series of short discussions, involving RDF, Sun, and Mulgara. He was interested in learning more about Mulgara, but the lack of SPARQL was a major impediment for him. I know how important that feature is, so it's something I hope we have soon. Indy is working on that in his own time, so we may have something soon. However, Indy is on vacation at the moment, so I won't hear any more about it until he gets back. Henry also introduced me to one of the managers at Sun, pointing out that Mulgara was developed way back at the start of 2001, and was using NIO from the outset. He didn't seem to know much about RDF, but was interested in our early use of NIO. We had a brief discussion about 64 bit support (which we manage with NIO, but not without jumping through hoops). I was pleased to hear that Sun are planning on new APIs to address the 64 bit problem.

I kept trying to find time (over lunch for instance) when I could sit down and have a long and uninterrupted conversation with Henry, but somehow it never eventuated. Maybe next time. In the meantime, I now have a face to put with his emails. Somehow, having spoken to a person about their work gives their emails much more meaning.

Another person I wanted to speak with was Eyal Oren, due to his work on ActiveRDF. One of the places that Mulgara needs to be used is as the backend for web applications, and in this day and age that means Ruby on Rails. SOAP support (which we have) is only a stopgap measure, and doesn't cut it. We need a real API for non-Java languages. In fact, given the importance of RoR, I fully support an API specifically for Ruby, and ActiveRDF looks like the best option, particularly since it's starting to get wider use.

I saw Eyal from a distance on the first day, but I wasn't sure that the guy I was looking at was the same one I'd seen in Eyal's photo. Even when I realized it was definitely him, I always seemed too busy to give him the time and attention I wanted, so I held back. Finally, on the last day I introduced myself, and we had a productive conversation. The guys in ActiveRDF would like to get another store behind their API, so our goals are aligned there. Also, Eyal tells me that they only use simple SPARQL queries, so these might already parse fine as TQL (since Andrae added features to the parser to make it mostly compatible). We have someone new at fourthcodex who needs to learn about programming with and in Mulgara, so this may be one of the earlier tasks we set him (not the very first task - I'd like him to get some experience with simpler things first).

I have to get in touch with Eyal soon to make this happen.

Wednesday Morning

So after being given 2 beers and 1 (large) glass of red wine on Tuesday night (all over a period of 6 hours), I thought I'd be fine by morning. After all, I'd had much more on Monday night, with no ill effects (surprisingly). Yet, I woke up on Wednesday feeling a little "seedy". The beer was the same I'd had on Monday, so it must have been the red wine. I'll have to be more careful when someone offers me a glass like that (even if it was my boss).

The consequence was that I didn't feel like concentrating during the first session. I'd picked something on NLP. I've done a little work here, but it's not that relevant to me at the moment, so I decided to catch up on emails and RSS instead. I guess that turned out well, as this was when I learned about Danny's new job at Talis, before meeting the Talis guys just a little later that morning.

I ran into Henry again just before the next session, and we went into the presentation given by Susie Stephens, now from Eli Lilly (despite the hundred bios on the web which still say she's from Oracle) and Jeff Pollock from Oracle, who were discussing practical (and commercial) "Enterprise" semantic web systems. Features of various systems were discussed, such as Dartgrid mapping RDBMSs to one another (this had a nice accessible look to it), TopQuadrant's TopBraid IDE for integrating RDBMSs, and also a recap on the WebMethods acquisition of Cerebra to gain their RDF product (WebMethods had been looking at TKS before Tucana went the way it did). I think I was the only person in the room who hadn't realized that Jeff used to work for Cerebra.

Jeff also pointed out a number of commercial RDF systems, including the TKS system from Northrop Grumman, which left me bemused. We already have some improved features over TKS, but I'm looking forward to leaving it in our dust. Should be Real Soon Now. ;-)

Next came lunch, and another exhibit session. Fortunately, this was less crazy than the night before, so there was opportunity to take a break in order to eat. Unfortunately it conflicted with a couple of commercial sessions, and dragged into a standard conference session that I wanted to get to. These included a presentation from AllegroGraph that would have been interesting to see, and talks by both Eyal and Elisa Kendall from Sandpiper Software. Fortunately, I got to hear Elisa speak later, even if it was just a SIG.

Reverse NLP

The next session I made was called Applications of Analogical Reasoning and was by John Sowa from VivoMind Intelligence. I'd heard of John before, but wasn't sure what to expect in this session. Regardless, this was the only session of the whole conference that made me take a step back and say, "Wow".

John started out with the quote,
A technique is a trick you use twice.
The idea is that analogies allow us to use tricks (or techniques) in multiple places that may not appear to be equivalent at first glance.

Analogical reasoning is the process of finding correlations for objects and relationships in separate ontologies. He provided an example of an analogy discovery system called VAE, which was applied to WordNet to compare the concept types of cat and car. The automatically generated result was:

Cat	Car
head	hood
eye	headlight
cornea	glass plate
mouth	fuel cap
stomach	fuel tank
bowel	combustion chamber
anus	exhaust pipe
skeleton	chassis
heart	engine
paw	wheel
fur	paint

The correlation of paths of relationships, and any similarities along the way can help these algorithms that search for analogies. So "paws" matched to "wheels" because they had a similar function, and there were 4 of each of them. The algorithms specifically cater for extra, or missing, elements along path, with these elements being recorded as a "deviation". So the paths of:

Cat: mouth, esophagus, stomach, bowel, anus
Car: fuel cap, fuel tank, combustion chamber, muffler, exhaust pipe

matched, but received reduced confidence due to the esophagus not matching anything in the car, and the muffler not matching anything in the cat.

This is all given in much more detail in a paper written by John on his site.

Anyway, it all sounded good, but what can it be used for?

I didn't record all the details (so apologies if I get some of the specifics wrong), but the example given was a task of documenting a banking system in preparation of merging with another system. There were some 30 years of development with source code in several languages, including Cobol and Java. Every stage of development had used extensive documentation, the process had continued over 30 years, building up to a corpus of 100MB of documentation. Consequently there was a need for new documentation in the form of:

A glossary.
Data flow diagrams.
Process architecture.
System context diagrams.
Data dictionary.

All of this was supposed to be done in about 6 weeks (a seemingly impossible task).

It is well understood that NLP isn't good enough today to understand the documentation to the extent required here. However, one of the developers working on this decided to approach the problem laterally. Instead of trying to parse the unstructured documentation, instead he adjusted the NLP system to parse the source code. Of course, the source code must be parseable else it would be much good as source code, would it? The results were then put into a concept graph, which was combined with ontologies to provide the "analogy" for the 100MB of documentation, which could then be parsed with relative ease.

I love it when someone shifts my brain sideways with an idea like this.

Initially the developers asked the system to flag if the deviation between the concept graphs of the code and the documentation was greater than a couple of percent. These were then perused by hand. It didn't take long to decide that the system was proceeding just fine, and the deviation cut-off was increased to something significantly higher. However, a couple of items were still flagged, indicating possible contradictions that had to be manually reviewed to figure out why.

One such item started with the following pair of facts:

No humans are computers
All employees are human

The problem was flagged when the system tried to describe 2 employees as computers. After carefully going back through the system it was discovered that 20 years before an employee in one department received assistance from 2 computers. However, at the time there was no way to bill for computer time. The workaround was to name the computers "Bob" and "Sally" and enter them into the system as employees. The unintended consequence of this was that Bob and Sally had been issued paychecks for 20 years that they had not been cashing!

So not only did this approach successfully document the system, it also discovered contradictions between the document and the implemented system. This was my "Wow" moment.

Evening

The keynote didn't really capture me that afternoon, so it was tempting when the others from fourthcodex trid to invite me to lounge on the pool deck. But I stuck with it, and made it into the talk on Advanced OWL by Deborah McGuinness from Stanford and Elisa Kendall (whom I mentioned earlier).

When I first considered postgraduate research at UQ, I tried to get some idea of the research background of those who might be interested in supervising me. Bob was at the top of the list, and in looking him up I found that he had done a lot of work with IBM Almaden, DSTC, and Sandpiper Software. The last was specifically with Elisa Kendall. I was fortunate enough to be introduced to her on Tuesday evening, so I was pleased to get a chance to see one of her talks.

In this case, the talk was really a Special Interest Group meeting (SIG) where various details of OWL were discussed, rather than the formal "Sessions". I would have gained more if there was some discussion about OWL 1.1, which I'm only just starting to look at seriously. However, I was already pretty comfortable with the various concepts in OWL that were being discussed.

The one point I picked up on was to be very careful about using owl:InverseFunctionalProperty (IFP) as it can lead to OWL Full. This is worth noting, because many modeling/schema environments want to use IFP to simulate the RDBMS concept of the "primary key" of an object. However, IFP cannot be used on datatype properties in OWL-DL, which is what you usually want IFP for if you want to use it to indicate a primary key. Not using IFP on datatype properties is well documented for OWL-DL, but the whole "primary key" thing is likely to lead to forgetting this rule in heat of moment when just trying to make it work.

I spent the rest of the time working out that I can build a rule for owl:IntersectionOf using the TQL minus operator. Unfortunately, the resulting query is nested into several parts (of the form: A and B minus (C and (D minus E))), making me think that I should really write a resolver to do it easily. Algorithmically it's easy: once you have an intersection, you just have to check that an object has all of those types in order to meet it. The hard part of a new resolver is usually the syntax for providing everything that it needs. In this case, how do I tell the resolver which graphs to query?

Zepheira

That night was a little soirée put on by Zepheira in a hotel suite backing onto the pool deck. I was invited by Bernadette, but told that it would be focused on potential clients, and I may not want to stay long. Besides, it would only be running for 90 minutes.

Ha!

The party had included people of all ilks engaged in everything from flirting to deep technical discussions on semantic technologies. The latter by the younger, single set, and the former by at least half the room. A couple of times I noticed both going on in the same conversation. The 90 minute limit came and went, and the party went on until sometime around midnight. I got to meet Dave Beckett (mentioned several blog entries ago), several W3C notables, and a number of people from all over, each with their own contribution to make to the semantic web. It was a lot of fun.

Due to the exhibit that day, I was still wearing my fourthcodex shirt. After the exhibition sessions I knew that fourthcodex had products providing something unique and useful, and there had been some buzz around what we were offering. But I was surprised when seemed to be accorded minor celebrity status in some areas of the room. I was even cornered by one potential client who wanted to grill me over the features of our products. Luigi would have been thrilled.

Oracle

I also got to speak with Susie Stephens about the Oracle RDF implementation.

Susie's a lovely girl, and I enjoyed the conversation. She is also on a horrible conference circuit at the moment, spending the previous 4 weeks going from conference to conference, and about to head the Austria for another one the following week. She'd been getting a couple of days at home to water the plants every few weeks, and then she was off again. It seems a little rough of Eli Lilly to put Susie through this immediately after joining them, but she seemed to be coping. She had my sympathy nonetheless.

I have lamented before that Oracle had the opportunity to do this right, but instead their implementation has become an advertisement for using Mulgara. The reason is because the implementation is done entirely at the application layer. TKS/Kowari/Mulgara was developed specifically because we discovered that this approach didn't work, and we reasoned that a storage layer specifically designed to handle triples rather than a table of arbitrary arity could be much more efficient. I even recall discussing this with DavidM in the Toowong food court, back in 2000. <Insert nostalgia here>

Now Oracle have pretty much captured the scalable data storage market, so they should have the know how and resources to build the appropriate structures that would make an RDF system fly. But instead they build it at the application layer. My question was, "Why?"

Susie explained that it was her who built the RDF layer. She's a biologist, and she built it as a tool for another project she was working on at the time. The RDF layer is actually built on Oracle's Network Database layer (which is also built at the application layer). So while it offers nice abstraction, this is where the scalability fails.

Susie didn't initially anticipate that her code would go into wide circulation. However, it was apparently in the right place, at the right time, and even showcased the Network Database layer. So the powers-that-be picked it up and ran with it. The rest is history.

Working notes

Thursday, June 07, 2007

Another SemTech Day

Wednesday Morning

Reverse NLP

Evening

Zepheira

Oracle

No comments:

Semantic Links

About Me

Blog Archive