Thursday, May 31, 2007

Random Software

We take a break from our regular conference schedule to think about software.

Random Numbers and π

During a talk being given at the conference by a friend, an example of calculation complexity was given of determining the digits of π. This got me musing about something I did several years ago, where I demonstrated the Monte Carlo technique for someone by showing how it could be used to calculate the value of π.

The idea is pretty simple. Take a square with a side of length 2, and draw a circle of radius 1 inside, so it touches all 4 sides. The area of the square is therefore 4, and the area of the circle is π. This means that the ratio of the total area (the square) to the circle's area is 4:π, or 1:π/4.

The Monte Carlo technique is to select random points, and determine the ratio of points inside the circle to the total number of points. This gives an approximation of the ratios of the areas. The more points you have, the better the approximation. Eventually, with an infinite number of points, the value of the ratio will approach π/4;.

Picking a random point is trivial, as you just need values for x and y which are between -1 and 1. This gives a square which is centered on the origin. The point is then within the circle if:
  x2+y2 < 1

Actually, we can simplify a little further if we just pick the top right quadrant. This means that x and y are now between 0 and 1. The area of the quadrant of the square is 1, and the area of the quarter circle is now π/4, which is the same ratio.

So I knocked up a short program to compare these ratios and see how well it converges. I did it just a couple of minutes, so rather than test for convergence, I printed out every 10th iteration so I could eyeball how it was going. This manual step was interesting, as it helped me to see what the data was doing.

An important aspect of doing this is the "Randomness" of the Random Number Generator (RNG). For an approximate answer you don't need a particularly good one, but if you want the result to get better over time then you want some important properties. For instance, you want a flat distribution over the space. This means that every possible output has the same probability. You want to ensure that for any 2 observations of a sequence of n outputs from the RNG, then the probability of the n+1 being the same in both sequences is negligible. This is another way of saying that you don't want cycles. There are a lot of other properties, but these are the first 2 that come to mind.

Unfortunately, the random number generators in POSIX aren't that great. They are in fact pseudo-random number generators (PRNG). The rand(3) function is actually terrible, so it was replaced with rand48(3), and eventually random(3) (and the associated srandom). However, these are all based on pseudo-random functions. This means that they are cyclical functions, with only the low order bits returned as output (otherwise the output would encode the entire state of the function, meaning you could calculate the next number in the series - making it even less random). While they look sort of random, the distribution often leaves a lot to be desired.

The Linux /dev/random driver does its best it accumulate entropy in the system, from timing and content of keyboard/mouse actions, hard drive delays, and network packet timing, but that is usually an easily exhaustible pool of data, and so this is typically just used to "seed" one of the PRNGs. (I once used an email encryption program that just used the raw data from this device. Every so often it would lock up while waiting on "random" data. I could restart it again by wiggling the mouse, hence providing /dev/random with some more data.)

A better alternative is an external device that generates "true" random data by measuring thermal noise, background radiation, or something similarly obscure. However, to my knowledge these are generally only used in high end security applications.

The immediate thing that I noticed was that the ratio (×4) approached 3.14 very quickly, but then it started diverging again. No matter how many points were added, it would fluctuate as far out as 3.09 and 3.2.

My first thought was that the accuracy of my intermediate values may have truncated some bits. So I moved the integer calculations of the random number up to 64 bits (making sure I didn't cut off any high order bits by accident), and moved the floating point calculations out to 128 bits. This is far more than is actually required, but at least I could now guarantee that I wasn't truncating anything.

The resulting code isn't very good, but in my defense I was trying to pay attention to the talk that I mentioned at the beginning:
#include <stdio.h>

#define FREQ 10
#define MAX_RND 0x7FFFFFFFULL

int main(void) {
long all;
long in = 0;
long double x, y;
long double pi;
unsigned long long maxRndSq = MAX_RND * MAX_RND;

for (all = 1; ; all++) {
x = random();
y = random();
if (((x * x + y * y) / maxRndSq) < 1) in++;
if (!(all % FREQ)) {
pi = (long double)(4 * in) / all;
printf("%lf\n", pi);
}
}
return 0;
}
Despite all this, the numbers didn't converge any better than they did when I was using fewer bits of accuracy. There may be other reasons, but I believe that it's just because the PRNG I was using is not really random enough.

Thinking along these lines, it occurred to me that this might be a simple and easy test to check if a PRNG (or even a RNG) is any good. Of course, you'd automate the test to check for:
a) Convergence
b) Convergent value

Convergence is a little awkward, since the value fluctuates around a bit before settling down. But if you check the deltas over time it wouldn't be hard.

Testing for the convergent value is easy, since we already know the value for π. Speaking of which, I counted this the other day and realized that I only know 27 decimal places. It occurred to me that if I can get up to 35, then I will know the same number of places as my age. Then I only need to remember one digit every year to keep up. By the time I'm an old man it might even be kind of impressive (not 67,000 digits impressive, but impressive nonetheless). David calls me a geek.

(Thanks go to Bruce Schneier for his books describing PRNGs for me all those years ago).

Interviews

A little while ago I had an idea for some audio software that might clean up some of the digital distortion that radio broadcasters have to put up with when interviewing people remotely. I'd forgotten about it until I listened to the recent Talis interview with David.

This interview was done over Skype, and recorded with Call Recorder for Skype, from Ecamm Network. Forgetting some of the unfortunate ambient sound at the Talis end, the recording is still of poor quality specifically because of the some of the audio degradation which occurs occasionally on the sound coming from David. This degradation is an essential compromise for a real-time conversation, but because anyone downloading the podcast is not getting the data in real time, this is not a compromise they should be forced to make.

The solution that came to me is to have a Skype plugin at both ends (this should be easy to arrange for the interview). The plugin then does local recordings at both ends, and incorporates out-of-band timestamps into the audio data stream. Once the interview is over, the plugin then negotiates a file transfer from the interviewee to the interviewer. The process at the interviewer's end can now merge the two local recordings, and voilà, you have a recording with neither participant sounding like they are speaking across a digital chasm.

Anyone want to build this? I can think of a dozen radio stations that need it. :-)

Tuesday, May 29, 2007

Microformats

After the keynote, my first session for the day was on Semantic Microformats, given by Uche Ogbuji of Zepheira. After having had my first exposure to microformats in the form of GRDDL only the day before, I figured that this talk would provide some more info, which it did.

Everything you need to know about microformats can be found on the web, but attending a talk like this was exactly the kind of thing I needed. A talk like this can save hours of going through websites learning specifications, when a simple example can provide 99% of what I need.

RDF from Atom

Next I followed David down to a talk by Bradley Allen from Siderean Software, entitled: A Semantic Web Without RDF/XML: Building RDF Applications in Atom. This was held in the only conference room on the ground floor, so I was a little late after taking a few minutes to find it.

Bradley had a number of things to say, but I'd summarize it as saying that it is too hard to go out there an create RDF for everything, but we have a lot of perfectly good Atom out there to work with. He gave a brief rundown on how Atom can be easily converted to RDF, and how this then forms the basis of a semantic web (ie. everything already marked up with semantic [RDF] data) with no need to rewrite any data.

The main problem with the current definition of the semantic web (as opposed to the general field of "semantics") is that it requires a lot of metadata to be provided by content authors. There are a number of attempts to automate this by trying to extract the metadata with natural language processing, statistical analysis, even fancy greps, but this is still in its infancy. Then there are ways to provide hints (blatant or otherwise) to metadata extractors, which is what microformats can provide. The approach of using Atom is even more pragmatic, in that it is using metadata that already exists, and is easily convertible to the semantic web's standard structure.

I'm sure Siderean see their approach as being much more advanced than I characterize here, but that was the general gist of the talk. (Apologies to Bradley, but it was over a week ago now, and my notes are sparse).

Syndicated OWL

My favorite talk of the day was from Christian Halaschek-Wiener from MINDSWAP. The thing I like about MINDSWAP is that their theoretical work gets expressed really neatly and efficiently in Pellet. This is one of the many reasons I'm interested in getting Pellet integrated into Mulgara (something I have yet to discuss on this blog - don't worry, we still need a rules engine as well!)

Christian had a lot to get through, and I'm kicking myself for missing the first 5 minutes. (Not only was that annoying for me, but it's rude to the presenter). All the same, the talk was fascinating, and has much wider application that the domain of syndication.

Essentially, syndication with OWL means that you're bringing in data from a number of different places, usually at different times. More importantly, it means that you will be getting updates to your information. This has ramifications for any inferences you've made, or non-inferable queries you wish to make.

It turns out that most of the state that Pellet works out in performing its calculations is applicable for small deltas on the TBox (and ABox) data. This means that only a few things need be recalculated with each update. Christian showed reasoning timing graphs starting with 10,000 statements (I think that was the number), and then over a range where new statements were added. The previous version of Pellet took about the same amount of time (with small increases) as the new statements came in. In contrast, the newer version of Pellet took about the same time for the original data, but had a massive drop in time for the calculations when new data was added. In other words, it wasn't having to recalculate the state for the majority of the system, and was only having to deal with the differences.

I've thought about and discussed change management in OWL for a few years now, but I never put any effort into it. Christian's work [PDF] looks to provide a solid foundation for this. Well done Christian.

I'd really like to learn more about what he's done, and see if it only applies to tableaux reasoning, or if there is some applicability to rules inferencing (since Mulgara wants both). I suppose that will happen in my Copious Free Time.

SemTech Keynote

Day two had "Communities of Practice" sessions during breakfast. The Pragmatic Rule Standards may have been interesting, but I was more interested in breakfast, especially after the previous night. (Fortunately I didn't have a headache. Will wonders never cease?)

Straight after breakfast there was a keynote session with 5 presenters. I can't quite recall their order, but it started with Robert Shrimp from Oracle.

Oracle

Robert made a presentation which basically said, "You should use semantic technologies". Maybe that's harsh, but I didn't find it very compelling. Someone next to me (I'd love to say who, as he's well known) asked if I recognized the slides, and indeed they looked just like the corporate presentations made by Oracle in the 80's and early 90's about how and why you should use relational databases. The only apparent difference was the labeling. I don't believe that Oracle have made great headway in the field of semantics (or even semantic stores), so it was always going to be hard for Robert to make this presentation.

OMG

Richard Mark Soley from the OMG made a point of saying that he wanted to "upset everyone" with his controversial statements. However, he wasn't that controversial. Instead he seemed to use this to cover his lack of confidence in his claims, which I think was a shame. After all, if his statements were to conflict with anyone's views then he had the fallback position of claiming that he'd intended just that, and had in fact given prior notice. I do the same thing (even in this blog) but it didn't seem an appropriate position of the Chairman and CEO of a group like the OMG.

NASA HQ

Andrew Schain of NASA Headquarters spoke for a short while. I remember listening to what he said, but unfortunately none of it stuck. I fear that was more my fault than his.

One thing that I did note is that they are using a system called XACML-DL [PDF] for security descriptions. I've heard about using ontologies to describe security for a while, so I was quite interested in this. Of course, the paper I've linked to has Bijan as an author. That guy seems to be everywhere I stick my nose into (and sometimes chopping it off by pointing out my ignorance). Mindswap have done work for NASA in the past, so I'm not surprised to see that XACML-DL was a proposal from them.

Yahoo!

The two that caught my attention in this session were Dave Beckett from Yahoo! and Tom Ilube of Garlik.

Dave Beckett is a name that has been around in the Semantic Web for a long time now. I'm sure anyone following this blog would have already heard of him. Personally, I associate him mostly with his Redland RDF library for storing and accessing RDF, the Raptor RDF syntax parser, and the Rasqal query library. Whenever I've gone long enough that I might have forgotten about Dave, a new version of one of these libraries shows up on the Semantic Web mailing list. The last one was an update of Redland and the Redland language bindings just 3 weeks ago. I've often considered using these libraries, but the fact that they are written in C is a problem when everything else you distribute is in Java. Still, it's supposed to be good stuff.

Dave's appearance was of interest to me, because of the position of Yahoo! vs. Google (I don't need to create a link for that second company, do I?). Google overwhelmingly have market share for web search and several other services, which will naturally force their competition to look for that competitive edge.

Interestingly, Google have always favored syntactic search, and have often denigrated the field of semantic search (especially the operation of semantic extraction). While they have a point that the field is in its infancy, I don't think it will stay that way forever. Google claims that they meet everything the market requires with their syntactic technologies, and from one perspective they do. However, the purely syntactic approach that Google employs is not suitable for every purpose. Links which are not on the first page of a Google search are almost never seen. I think the statistics claim that almost no-one ever goes beyond a few pages. Yet, I have seen many searches which return a lot of uninteresting results and the pages I know exist somewhere are impossible to find. It is only when I can remember (or guess) a word specific to the page in question that I can find these specific links, and even then they can be hidden in amongst many other unrelated pages.

People are happy to accept that picking the right search terms is the way to interact with Google, but I think that this is a result of being trained to think this way over the last 10 years. Certainly, 10 years ago Google was offering far and away the best search engine, with more relevant hits than anyone else. Over time, Google have done well to grow with the internet and deal with increasingly sophisticated and blatant attempts at manipulating their search results. However, computing capability has increased over time, and the internet has expanded exponentially. It seems reasonable that we should be able to consider new approaches.

While Google is offering an incredibly powerful service (one that I use all the time), it is still a service that would benefit from the inclusion of semantics. As the web continues to grow, the difficulty in finding relevant information will only get worse. Google have publicly been reluctant to work with the semantic web, which (at face value) appears naïve. On the other hand, Yahoo! have been willing to look at this area. Obviously the fledgling stage of the state of the art makes it impractical for general deployment, but there are still a lot of useful things you can do using RDF and OWL. The title of Dave's presentation said it all: A Little Semantic's Can Go a Long Way.

Dave's job as "Chief Semantic Yahoo" is to make a lot of these thing happen for Yahoo! They have several things happening already, and these will expand over time while new systems are also developed and deployed. It sounds like a lot of challenging (read: fun) work. Dave finished his short talk with a droll invitation to apply for a job. For about half a second I thought it was just supposed to be amusing, before I realized that the size and scalability of Yahoo! must require a lot of good people in this area, and they're probably quite hard to find. Unfortunately for Dave, this conference had a low ratio of developers present.

Garlik

All on its own, the title of this talk was enough to make my pay attention: Billions of Triples and Counting. Geez, is everyone scaling this high now?

Tom's presentation didn't have a lot of technical detail to it, except to say that his business was reliant on having this much information, and after getting significant investment he'd managed to pull it off. The inspiring part for me was that Garlik was able to find that much venture capital in a field that until recently had not attracted much interest from the VC community. This and subsequent talks demonstrated that semantics are fast becoming a field that VCs find acceptable, and are even starting to actively pursue.

A point of interest was that Tom said that they'd raised 11 million pounds in funding, which just about knocked everyone off their chairs. I later heard that he misspoke, and that the figure was actually in dollars. Even so, that is an impressive accomplishment in a market that until recently knew nothing about semantic technology.

I was also interested to learn later that one of the reasons for the scalability of Garlik's triple store is that it is not indexed extensively. They only have a need for very simple queries, as opposed to the features of complex analysis that stores like Mulgara provide. With this in mind, the ability to scale that high makes a lot more sense.

Just because Garlik doesn't use extensive indexing, it doesn't mean Mulgara can't scale the same way too. It appears that we can doing just that in the foreseeable future, if we make appropriate use of clustering. I've been haranguing Brian with proposed architectures for clustering, and their scalability justifications. I hope to get all this together and blog about it in the coming days.

Monday, May 28, 2007

Day 2

Once again, I'm writing a lot of my own thoughts here, rather than just trying to keep a track of everything that happened at the conference. If you're interested in that stuff, then I'm sure lots of people wrote about it. For me, I found that it got me thinking about a lot of things, which is the main reason for me to write.

Random Thoughts

This doesn't mention any sessions at all. I just had a series of thoughts on various topics as they got mentioned at the conference, and thought I'd note them down.

AllegroGraph

My last couple of posts mentioned AllegroGraph and immediately started diving into the Mulgara implementation. One justification for this is that such a well advanced system inspired me to examine where Mulgara is, and where we want to take it. However, on some level I confess to feeling a little jealous. After all, we had the chance to be even more advanced than this, but lost it. However, we still seem to have a few unique features, and have more coming (which I'm sure you'll hear about soon).

From a philosophical point of view, having high quality open source infrastructure will help enable the Semantic Web to progress in the same way that it helped the Web and Web 2.0 to develop. Even when a project demands access to some of the features that only commercial vendors provide, an open source alternative can help bootstrap a project in its early phases, and provides competition for all to benefit from.

So while I may feel a little jealous, I'm hoping this will help inspire me to get Mulgara to do more. It can't just be Andrae and I though. Fortunately, this conference appears to have helped us there. We may be getting some more help from places like Topaz, but I'm also hoping to lift the profile of the software. After all, the more people who use it, the more likely it is that someone will need to scratch an itch and submit some code!

Codex/DLV

The core of the fourth codex suite is the Codex system. This is the new commercial name for OntoDLV [PDF]. This blog isn't an advertisement for my current employer, but it's worth mentioning the system all the same, as I find it interesting.

One of the features of Codex is to translate its ontology language into a program that can be run on DLV, a disjunctive logic reasoner. DLV offers an interesting set of features, including closed world reasoning, true negation along with negation as failure, non-monotonicity, and disjunctions in the head of a rule. Some of these are the opposite of OWL reasoning (closed world, negation as failure, non-monotonicity) while the disjunctive feature is completely orthogonal. Building an ontology language on top of this reasoner creates a very different kind of modeling environment to OWL. Interestingly, the guys in Italy have recently been looking at the consequences of importing OWL queries into the "rules" part of the OntoDLV language. This would make for an interesting complement of features.

OWL is built the way it is because of the interactions with the World Wide Web that it was designed to describe. An open world assumption is vital on the web, as things are being added all the time. Being monotonic may not work perfectly (as different sources may disagree) but is still important, as there is no way to assert priority of one source over another. The lack of a unique name assumption is also important, as it is very common that things are referred to by more than one name. In fact, it occurs to me that having a truly open world assumption means that you must not have a unique name assumption, particularly if there is an equality or equivalent operation in a language. Now that I think about it, this make it seem strange that many description logics have an open world assumption, but also presume the unique name assumption.

Contrary to the properties of the web, the corporate environment is often quite different. It may be because "the enterprise" environment is a consequence of decades of training to conform to existing systems, such as relational databases, or the type of modeling needed in this environment is inherently different. Whichever way it is, corporations model, collect, and think about data in a particular way, most of which doesn't work well with OWL. Records are often taken over, and over again, with the same structure each time (yes, corporations may have several nearly identical databases with different record structures, and OWL can help here. But bear with me). Entities have been allocated unique identifiers, and do not need to be "declared" to be distinct. Most importantly, if a company does not know about data, this almost invariably means that such data does not exist. (e.g. an employee record cannot exist until it has been entered by someone authorized to do so. Accountants don't want to know about possible future employees).

While listening to people at the conference I started to recognize that people want each of these different properties, depending on their environment. It seems that Codex has something to offer many people, especially with it's ability to import OWL data.

Reasoning

Another thought that I had while listening at the conference was that OWL reasoning is often done in a semi-closed way. Inferences are usually made in a way that accepts the possibility of unknown facts (open-world), but it can necessarily only reason on the data that it knows. This has a closed world flavor for me, though it is definitely kept consistent with the open world model.

There are lots of things that might be true that we don't consider when reasoning. No one ever declares that all individuals are in fact distinct (when it is known that they are). Classes are never declared to be disjoint, unless there is a specific reason for it (such as sharing a common superclass).

I'm not saying that people make inferences that presume unique names, nor that they do anything which would be invalid were two things revealed to be identical. The math behind valid inferences is too solid for that. But people see reasoning happen on the limited data they provide, and tend to think of that data as being their entire Universe of Discourse (I've wanted to use that phrase again for years). Consequently, people are repeatedly making closed world presumptions, and wonder why inferences and calculations don't come up with the answers they expected.

We also never see the "possible worlds" that a current model allows for. Given the infinite nature of the open world presumption, these possibilities might surprise people. In some cases, the model may prove to be completely inconsistent with reality, indicating that more modeling information is required for the system to be really useful. I know that people like Ian and Bijan know how all this works intimately, but most people are surprised when they see owl:cardinality apparently violated, so this sort of consideration is way beyond where most people are thinking.

I suppose I had this thought partly because almost no-one declares every owl:Thing as owl:differentFrom every other distinct owl:Thing. Similarly, classes are rarely declared as owl:disjointFrom unrelated classes. Sure Man and Woman may be declared disjoint, but I've yet to see Vehicle and BillingCenter get declared as disjoint from one another. This becomes more noticeable when you consider the differences between a closed world reasoner (like Codex) and an open world one. It's not that an open world ontology is any less expressive, it's just that you need more information to encode the same scenario. The funny thing is that people often have this information, but they don't put it in. Consequently, open world reasoners don't come up with as many inferences - not because they are less capable, but because they don't know as much about their system as a closed world reasoner does.

More AllegroGraph and Mulgara

This blog is all about helping me to remember stuff. The fact that anyone else is reading it is just a little too strange for me to remember most of the time. Please remember this when I go on and on about something I'm trying to write about, or when I'm a little too candid for my own good. It's more the former that concerns me in this post.

One important architectural item of AllegroGraph that I forgot to note was the description of their index writer. Richard mentioned that writing to the graph is fast, and that it plays "catch up" in the background while keeping the data live the whole time. If you perform a query during this period, then part of the query will be fast, and the "unindexed" data will come in more slowly. This is so close to the Mulgara XA2 design that I just had to mention it.

Aside from making XA2 faster to read the indexes, we really want to make them faster to write. Reading is actually fine, and I've yet to hear of any index-related problems in reading speed (there have been problems at various times, but they usually come from data type issues).

The fastest way to write triples is to simply write them down in the order they were given to you. You can't do a lot with this, but you can do this faster than any process can practically give them to you. This is the limit we can aim for.

Our original plan (based on skip lists) was going to accept writes in exactly this way. Indeed, this is exactly what you have to do with skip lists, as they are a write-once structure. This means that any new data has to be accumulated before it can be added to the index. Any subsequent query first uses the index (with efficient log(n) complexity) and then adds in the results of a linear search through the "writable" file. A background thread then merges this file in with the main index (into a separate file) and then brings the new file online. This is actually how Lucene does it.

Andrae has since left behind the idea of skip lists (they had a certain advantage in reading, due to forward-file access, but writing can be expensive) but the idea of building up an unordered list and flushing it in a set of background threads is still what we want to do. The result will be a fast write, like AllegroGraph has, but querying will get slower and slower as writes get further ahead of the indexing threads, again - just like AllegroGraph. The big difference is that we only have it planned, while they have it working. If only Mulgara had a dozen programmers still working on it. Oh well - if wishes were fishes...

Thinking of using a flat file with raw writes going into it reminds me of a related architectural piece.

One of the very first features we decided on for XA2 was an ability to have several concurrent writers. The phase trees and two-phase commit process we employ at the moment are perfect for maintaining ACID compliance, but they limit us to only a single writer at a time. If we wanted to go to multiple writers then we'd need some form of merging for multiple concurrent transactions. While it might be possible to merge branches of a phase tree, this approach is guaranteed to make ACID a total nightmare. Besides, it wouldn't really be a phase tree anymore if you try to merge branches (and nodes within those branches). The way to manage this is the way it's always done, and that is to keep transaction logs.

A transaction log is a list of two types of entry: insertions and deletions (or whiteouts). This information is basically what I was describing above. Just as I already stated, the current transaction needs to search this list along with the index whenever a query is resolved.

Fortunately, RDF is very kind to us when it comes to keeping a log. Statements either exist, or they don't. This means we only need to consider merging logs where opposite operations are recorded. e.g. If one transaction removes a statement, and a concurrent one removes and re-inserts that statement. In fact, an argument can even be made for not caring about merging these things, since almost any sequence of actions is valid.

Unfortunately RDFS/OWL isn't anywhere near as forgiving. These systems rely on multiple RDF statements for their structures. While the open world may make this perfectly legal (for most purposes), any practical application trying to talk to inconsistent data is going to choke. Merging transaction logs with these constraints will be more challenging, especially if a simple RDF interface is being used for insertion/deletion of statements. An RDFS or OWL (or even SKOS) interface could keep a track of the consistency of the structures, but trying to do this at the RDF level will require a lot of inferring of what the transaction is actually trying to accomplish. Indeed to allow the commit of OWL data, you'll need a full reasoner to check for consistency before allowing for a commit. The problem with doing something at the RDFS/OWL level is that we don't have an API at those levels. Hmmm, that reminds me that I need to get Sofa properly integrated again. That might even help us here.

But these are concerns for another time.

Back to the Conference

After the last talk on Monday we all went to the "Metaland" party. A presentation was going on, but it had been a long day for everyone, and there was free beer on offer. You can imagine how many people were paying attention. Maybe in future the beer can be handed out after the presentation, rather than before and during.

Amit and Rich from Topaz then went out to dinner and drinks with all of us from fourth codex. It was more fun that I thought any evening at a conference was going to be. If I embarrassed myself then no one has mentioned it... yet. Every phase of the evening seemed to involve a new type of drink.

While out, we had two people show up for fourth codex. The first was a new university graduate working for us in sales. This was her first day at work, and the commercial exhibits at the conference were on the next day. Talk about a trial by fire. The other guy was from the Exeura side of fourth codex, and had just flown in from Calabria, after 29 hours of traveling. He must have done something horrible in a previous life to have to go straight into a conference after a trip like that!

Reasoners

As promised, I checked the license for Pellet. It uses the MIT license, making it compatible with practically anything. Yay!

SemTech Day 1 (Still)

After David and Brian's talk I went to lunch, only to happen on Amit and Rich from Topaz. Topaz is an Open Source development group for the Public Library of Science (PLoS).

For anyone who doesn't know, it was Topaz along with the Software Freedom Law Center, who came to our defense when we encountered difficulties dealing with NGC last year. An explanation of this, along with the reason for the name change from Kowari to Mulgara can be heard in a recent podcast interview that David had with Paul Miller of Talis. Topaz has been developing with Mulgara, and used it in the deployment of the PLoS ONE system, which is why our logo made it onto their front page. More recently, they have started contributing to the project, and have even provided some direct funding. I like these guys. :-)

Lunch was spent discussing what Mulgara is doing, and where we hope to take it. Amit also asked about some of the internal operation of Mulgara, and the resulting description took me well into the next session. I hope I didn't miss anything important, but right now I can't even recall what I was hoping to attend, so with that justification I don't feel so bad. The discussions with Amit were worthwhile anyway, and it is my opinion that these impromptu meetings provide the real value of any conference.

AllegroGraph

I wrapped the discussion up in time to attend a talk I was very interested in, by Richard Newman of Franz Inc. It went by the unwieldy title of Working with SPARQL - Large Datasets, Fast Queries: What You Need to Know. This sounded fascinating, and I wasn't disappointed.

I'm embarrassed to say that I didn't know much about Franz Inc, nor the AllegroGraph RDF database. Such an obvious lapse is a clear indication that I need to get out on the web more. Still Richard's talk and subsequent Q&A went a long way to describing the system for me. In some cases he directly described internal details, while the explanation of feature implementations made it clear exactly what they had done in other instances.

It appears that AllegroGraph is architecturally similar to Mulgara in several ways, so I'm going to provide a bit of a comparison here.

Their approach to querying has had the advantage of being directly applied to SPARQL, which Mulgara is still working towards (hopefully soon), but since the basic structure of TQL and SPARQL is the same, then the parallels are easy to draw. There are some distinctions between the two projects, but most of the extras in AllegroGraph are already in the pipeline for Mulgara, such as storing 5-tuples, and a tableaux reasoner (they have Racer, while I'd like to bring in Pellet - which reminds me that I need to check Pellet's license).

Richard's explanation of simple unordered joins returning an immediate solution with no consumption of server memory revealed that AllegroGraph contains the full 6 indexes we use, and also does the same lazy join evaluation as Mulgara. However, when we move to 5-tuples we will only be putting a tuple ID into the first index (and not the remaining 5 indexes). We will also be introducing a new index based on this ID, so that a tuple can be found based on it's ID. Surprisingly, according to their introduction, AllegroGraph doesn't appear to have this index. Then again, maybe that's just a documentation issue. Incidentally, AllegraGraph seems to have a tuple ID for every entry, while our plan is to only include it for those statements which have been reified. This won't affect the storage requirements on the first index (the one being widened to 5), but will save space and time on the new index, which will only get entries when that statement gets reified.

AllegraGraph is using an ID-to-object mapping structure just like our "String Pool". I've commented in the past how this is an historical name, based on it originally storing only URIs and string literals, and that it should be renamed given that it now stores all the XML datatypes (and even user types). Franz must have gone through a similar process with AllegraGraph, given that their equivalent structure is called a "String Dictionary".

Getting back to the talk, Richard spent quite a bit of time talking about the ramifications of UNIQUE and ORDER BY. The UNIQUE keyword results in memory usage at the server end, though the result set is returned immediately. This is pretty obvious, given that the server need not do anything beyond store (and index!) those lines already encountered so they are not duplicated.

More difficult is implementing ORDER BY. Richard referred to server memory, but I suspect he was including secondary memory in that description. In our own case, we use a class called HybridTuples. This is a class that I started, but Andrae was the one to really implement. Now that I think about it, he probably didn't even look at my code. I think this class may have been one of the first things he did for Mulgara. The class uses memory up to a point, but then uses a file. It would be nice to memory map it when it can fit, but on a 32 bit system that is just too difficult to test for in Java, and it could cause other connections to fail with an Out Of Memory error (OOM). So on 32 bit systems HybridTuples uses standard explicit IO for managing the file, while on 64 bit systems it uses memory mapping (yes, even if the file is over 2GB in size).

All up, the talk did not provide me with many useful tips for SPARQL, though that may have been due to my experience with Mulgara. However, I found the description of AllegraGraph, its features and internal details to be invaluable.

After the talk Amit offered to introduce me to Jans Aasman, who is the director of engineering at Franz. When he heard that I was involved with Mulgara he made a comment I was to hear several times during the week, "It was a real shame. You guys were way ahead of the curve 2 years ago!" This comment was a great endorsement of our work, and yet very frustrating that a great opportunity was lost. All the same, Mulgara is starting to move along nicely again, so we should still have something worthwhile to offer.

I also commented to Jans about the internal structure of AllegraGraph, and he admitted that there isn't much you can do to hide the implementation of these things. I suppose much of what we do is straight forward. It's just a matter of competent people integrating the right components and techniques. On those few occasions where you DO come up with something new (like when we started indexing 3 ways on triples, and 6 ways on quads) then it doesn't take long before everyone does it. While I may form an emotional attachment to my own ideas (see the section marked "Indexing" for the reference here), I still love to see everyone helping everyone else improve the current state of the art (and hence, I hate patents - especially on software).

I'm not sure what others got out of it, but overall, I liked Richard's talk, though probably not for the insights into SPARQL, but rather the exposé of the engineering behind AllegraGraph.

Saturday, May 26, 2007

Argh

After writing all morning, someone from Italy wanted me to check my email in the office using Zimbra. Of course, this made Firefox crash (again). Fortunately, Blogger autosaves your posts, so I lost nothing. However, after starting Firefox I discovered that editing was very slow. The "top" program revealed that Firefox was using 30% of the CPU, so I decided to restart it. Again, I made sure the blog was saved (I manually pressed "SAVE NOW").

After firing up the program I first confirmed that the CPU wasn't being used heavily, then went back to the blog. Only this time the draft is empty! It's there, but there's nothing in it. Given how much time I put into it (and the links I chased down) I'm kind of miffed. Software has come some way in recent years, and I haven't experienced this kind of data loss for some time. It feels like the 1990's all over again.

Thinking that Blogger would want to know about the potential for data loss, I went looking for a place to report the problem. However, like several other sites I've found recently, making a bug report (or a warranty claim) is not a trivial task. Fortunately, I found it at last, only to discover that when you click the "Report a Bug or Problem" button, and then choose the sub-item of "I found a bug with Blogger" results in a page saying, "Invalid problem". Selecting the more general option of just "Report a Bug or Problem" takes you to a Help page.

So how do I report a bug in the bug reporting system?

Thursday, May 24, 2007

SemTech Day 1

After registering, I looked around for where I needed to be, and for any familiar faces. Right off the bat I saw Bernadette talking with Eric Miller. I've been looking forward to seeing Bern again, and DavidW has been telling me for some time that he wanted to introduce me to Eric, so this was a pleasant coincidence. Eric is as sharp as I was led to believe, but far more personal and amusing than I expected. He's a really nice guy, and I can see why he's been so influential in the semantic web community.

The first talk I attended was by Brian and David, titled, Putting the Web Back into the Semantic Web. I already had some notion of what this was about, and I knew that they leverage NetKernel quite heavily.

Brian had shown me last year how NetKernel provides a RESTful interface and allows data and processing operations to be described in URLs. This basically allows efficient remote access to a arbitrary functions, with nestable function calls. It seemed cute when I first saw it, but did see that I needed it, despite the papers on efficiency that Brian had sent me. I certainly didn't understand his desire to use it as a framework inside Mulgara. But he knows what he's doing, so I've been prepared to give him the benefit of any doubt.

I won't go into the details of the talk, but I was very impressed at the ease in which they were able to access and process data using NetKernel. More impressive were the reports of what a NetKernel infrastructure could provide. Efficient transport over HTTP is a given, but near-linear scalability with threads and clustering really caught my attention. I guess it makes sense, since Google get analogous scalability with their stateless Map-Reduce libraries. I could hear similar murmurs of appreciation from the audience.

David went on to describe GRDDL, showing GRDDL markup on his own home page. I've been curious about microformats for some time, but too busy to look into them, so this was a nice, practical introduction. Introduce a NetKernel module for doing GRDDL parsing into RDF, and now you can specify URLs which refer to the RDF from a web page. They've also written another module which takes RDF data and inserts it into a Mulgara database. Using these together, a single call to NetKernel will perform an extraction of RDF from a web page, and insert it into a local RDF store. A little presentation layer work backed by the Mulgara store means that you can now browse through all the extracted RDF. This gets particularly interesting if the RDF contains linkages across the store.

I was impressed by this talk, and have since heard several others say the same thing. But it was a later conversation with Eric that made me really rethink the idea.

RDF can be used in several ways. The most trivial is simply for storing any kind of information. It works particularly well for information with a network topology, but it can be applied to anything really, even binary file formats like MP3 or PNG. I'm not saying that this is necessary a good idea (for a binary format it's a terrible idea), but it is certainly possible.

Another approach to using RDF is for modeling. This uses RDFS, and possibly OWL or SKOS. Almost every application which uses RDF is doing exactly this. Unfortunately, while RDF is a great structure for this type of data, it is a terrible API for developers to work with. This explains the need for projects like ActiveRDF, and Alexandria's upcoming modeling API (which works well, but needs more features).

Yet another approach is simply for linking resources. When I first heard about RDF in 2001, one of my first thoughts for using it was to describe links between arbitrary resources. Since those resources are described with URIs, my natural thought was to link URLs together, thereby creating a "Web" of physical resources. This is already achieved in the World Wide Web by the "anchor" linkages described inside HTML pages, but these links can only be created and modified by a page's author. As an author of a page, I will often link to other pages, but I am unable to link those pages back to mine. There are also occasions when there is value in linking arbitrary third party pages, such as when one page provides specifics for concepts described in another page, or when describing a workflow.

Of course, all 3 of these approaches can be used at once, to varying degrees. RDF is great that way. However, developers usually use RDF to meet a specific need, meaning that the variety of ways that RDF can be used is often overlooked. I think the last approach of linking arbitrary resources is underutilized, though I've seen it used here and there. Some people even complain that you shouldn't use RDF to link pages together, as any application using this data is expecting the resources in question to be URLs which refer to resources with a physical presence on the net. The RDF specification specifically uses URIs to indicate that the resources need not have a physical presence, and many people get upset when you create an application that presumes that a URI is a dereferencable URL. But insisting that those resources cannot have a physical presence is stating that RDF isn't allowed to describe certain types of resources, hence, I disagree. But I digress...

Since NetKernel allows you to reference any operation on any kind of data via a single URL, then using RDF it is possible to link any kind of resource to any data you can dream of. Not just static data as we do now, but real-time, arbitrarily complex, calculated data. Not only that, but it's trivial to do.

This got my head spinning. It opens up a lot of possibilities.

Continued

I put this post down while traveling back to Chicago, so it makes sense to break it here and continue later. There's lots to talk about, and it got me thinking about a lot of things that I want to write down while it's still fresh, so I can clarify my perspective.

San Jose

Sunday saw me brave the torrential rain in order to hail a cab for Midway. These cab trips are expensive, so fingers crossed that the reimbursement comes sooner rather than later.

I met Chris at the airport, and then we were off for San Jose, and my first visit to California outside of an airport. We arrived late, checked in, and went to bed.

Being in a later timezone, and having no children with me, I had no problems waking up early the next day. Looking out from the 14th floor was really impressive. While in the middle of a very urban setting, you don't have to go far to see that we're surrounded by dry hills. It's an attractive contrast.

The immediate local is quintessentially California/Bay area. The bright skies of lower latitudes (I miss that) palm trees (I think I miss them too), and a dense network of trams (cool). The only detractors were that I would be too busy to get out to experience it up close, and the brown tint to the sky down towards the horizon. I've heard that Chicago has very clear air because of the lake. I suppose that it right, as this is a region with fewer people yet the air seemed dirtier than I can ever recall seeing in Chicago (and certainly Brisbane). On the other hand, I don't get up to the height of the 14th floor very often when I'm at home, so maybe that's just a perception thing. If this is the Bay, then I shudder to think Los Angeles would be like.

The room is nice, but I was a little surprised at the water situation. Every hotel I've visited in Australia or the USA for the last couple of years has had a notice saying that they would let you reuse your towels if you hang them up, thereby saving water. California is just on the cusp of a major water problem as the winter snowpack diminishes, and yet there was no such notice about water savings.

I thought that the hotel may be trying to pick up its savings with the reduced (nearly non-existent?) flow of hot water in the bathroom sink. However the cold water and the shower changed my mind. I was pleased to discover that the shower was easily set to a comfortable temperature, but dismayed to learn that the flow control's only settings were "off", and "Help! I can't swim!"

I wanted an early start, so I decided to forgo exercise and go straight down to the conference. I spent money on an exorbitantly priced meal, before discovering that the conference was providing this for free in a separate area. Sigh. At least I knew about it every subsequent day. I'm currently typing this in an antisocial fashion while sitting at breakfast on the final day.

So now I can talk about what I actually did while here...

Disjunctive Logic

Wow. Has it been that long since I last blogged? No wonder people think I've dropped of the face of the earth...

Life has been crazy in the last week. On Friday I spent the whole day with Nicola from UniCal discussing architecture and products, especially OntoDLV. He's a very knowledgeable guy, and it didn't take long for him to set me straight. I also learned a lot about the internal architectural decisions (I knew how it was built, but didn't necessarily know why it was built that way). I also discovered that the apparently non-monotonic behavior which I'd discovered was due to the fact that DLV is non-monotonic by design. Doh!

We didn't leave the office until about 9:30pm, as we wanted to avoid going to work on Saturday morning. That left the guys from Italy with a day to look around Chicago, while I got a day and a half with my family before going to San Jose for the SemTech conference.

Wednesday, May 02, 2007

Multiple Blanks

It turned out that re-mapping the blanks was quite easy. It required the use of a new getColumnValue method in the AnswerResolution class. This just tests to see if the node being returned from the remote server is blank, and if it is, then to create a different type of blank node instead. This is a new blank node called a ForeignBlankNode. While pretty much identical to existing blank nodes, it takes the server URI that it was returned from as a construction parameter, and it re-implements equals(Object), hashCode() and toString(). These differences allow it to be seen as distinct from another blank node with the same internal ID. Also, the localization process recognizes the node as new (and no longer a conventional blank node) and allocates something new in the temporary string pool.

This worked perfectly, but that interaction with the temporary string pool made me remember bug MGR-43, as it was making use of unknown elements going into a string pool. So I thought I'd have a look at what is involved in a SELECT/INSERT command.

To save myself time, I just attempted an insert from a selection coming from a different server:
insert
select $s $p $o
from <rmi://localhost/server1#foo>
where $s $p $o
into <rmi://localhost/server2#bar>;
Immediately I got a response from the distributed resolver factory telling me that it doesn't support write operations. So that told me that the query went to the server with the model to be read, and not the model to be written to.

To support write operations on a distributed resolver I have to implement the modifyModel(...) method. This takes a Statements object, which is cursor where each location has a subject, predicate, and object. The remote session is happy to accept a Statements object, but this is inappropriate here. If the object were simply sent as it is, then every Cursor.next() call would be done in RMI, along with every call to getSubject(), getPredicate() and getObject(). This would be horrible, even for a tiny set of statements.

There are two ways I could manage this situation.

The first is to not make the distributed resolver writable, and instead send the original INSERT/SELECT to the server containing the INTO model (the model being written to). This server would then use the read-only distributed resolver to do all the querying, and the resulting Statements would be local. This would work, but it would mean doing resolutions for every constraint across the network, even if the entire SELECT part of the query were from the same server. The entire query could be offloaded to the other server for the optimized query resolver, but that is not written yet, and we'll be keeping it commercial for a while anyway. I want/need something that will work for everyone.

The second approach is to make the the distributed resolver writable, while making it possible to send Statements across the network easily. For small statement sets we'd want a simple array of statements, and for large sets we'd want something that paged the statements, thereby minimizing the calls over the network, but still allowing large amounts to data to be moved.

This is exactly the solution I implemented for Answers a few years ago. In fact, I recently revisited the Answer code due to an unexpected interaction with transactions, so I feel familiar with it. While I'm reluctant to extend our usage of RMI even further, it's not going away any time soon, and this seems to be a good place for it.

I had to put the implementation aside a couple of weeks ago, as it was getting late. It's been difficult to get back to it, as I've needed to do some documentation at work. When I tried to set up my notebook computer (which I use in the office) to work with the latest checkout from Subversion, the poor little G4 PowerBook took so long to get anything done in Eclipse that I basically wasted the day. Since then I've had to look at other areas of the project (hmmm, this is starting to sound like my RLog project) and it's only today that I'm getting back to it.