Saturday, May 08, 2004

I was dead by the end of yesterday, so I put blogging off until today. I'll see what I can remember...

I started the day tired, because I chose to stay up coding on Thursday night. It all started with an off hand comment to AN about JRDF where I suggested that an in-memory store which met the interface would be useful. AN agreed, and that got me thinking about it.

An in-memory store could be used for testing the interface, and testing code that wants to use the interface. On top of that, I've noticed that there seem to be a lot of people out there who just want a fast and efficient RDF store, and most of the implementations are neither. Sure, a store that only meets the JRDF interface and nothing else isn't going to have a lot of features (like transactions, distributed queries, inferencing, security, and so on), and an in-memory version won't scale, but then again, many people who are working in this space at the moment don't need all of those features (with the possible exception of scaling). There are also people who do need these features, but we're already trying to cater for them with Kowari and TKS.

So I started thinking of the RDF stores that we were testing a few years ago. I still remember the architecture on a memory based one that I built, and I realised that it would be perfect. So on Thursday night I sat down and started. It came together amazingly quickly, which was probably a bad thing. It meant that I kept putting off going to bed because I thought that I could just implement the next thing, and then the next thing... you know the drill. Well I got well over halfway (minus the JUnit tests), and went to bed feeling pretty excited about it, because I think it will be pretty quick. The next day (Friday morning) I took AN aside and told him what I nearly had for him. He seemed happy enough, but not all that enthusiastic. I guess it deflated me a little, because I'd been really enjoying the coding, and was looking forward to getting back to it so I could finish it. This is the sort of work that I really like, and it's been a while since I've been able to do it.

One thing that came up in my conversations that morning was that a scalable version of an RDF store would be really useful. Sure, Kowari and TKS do that, but now we were talking about something that wouldn't do security or transactions or any of those other "Enterprise" features. It was at that point that I remembered just how fast we were able to store and search for triples way back before we introduced any of those features. I started getting the idea of building a whole new RDF store, based on some of our earlier designs. Only this time I wouldn't worry about any of the Enterprise features. An RDF store like that would truely fly.

Fortunately for me, DM came into the conversation with AN at this point. I felt a little embarrassed about him listening in, as I was probably getting a little carried away, and he really does see the details in a big picture much better than I do. However, he was also enthusiastic, as he's been considering coding the same thing himself. Now this is fortuitous, as DM and I were the ones who pair-programmed the original Kowari/TKS triplestore. Having DM to bounce idea off would make any idea I had work 10 times better. So we got into a discussion, and started making some real architecural decisions.

We agreed that the code should be open source, meaning that it would be usable at work, but not belong to it. Once we made that decision we decided to cut the conversation short (after all, we were on work time). So it will be an after-hours and weekend thing.

One reason for DM's excitement was that he's had a new idea for how to store triples. After hearing him out, I have to say that I'm impressed, and we're going with it. Compared to Kowari, it sacrifices speed of iterating through a set of results, for the sake of increasing loading and querying speed. Given that iterating through results rarely needs to be at enourmously high speeds, and loading and querying do, then I think that this is a perfect tradeoff.

The basic structure involves a hashmap as an index. 7 hashmaps are used, one for each possible type of query. The hashmap then gives the offset into a datafile which holds the first triple corresponding to that query. The triple then has 6 indexes attached to it, with each one pointing to the next triple that matches the query. The 7th index is not needed, as that is for the query that specifies (subject,predicate,object), and that query only has a single triple as its result.

So finding a result involves a single IO into the hashmap (maybe more if there is a collision, but we would try to avoid those), and a single IO into the data file. An insertion involves a read/write into the 7 hashtables (maybe more than read/write for collisions, but we're still trying to avoid those), and a single write to the block file (no reading). That's a fraction of what Kowari does!

I was never keen on doing transactions, but DM also thinks that we could do transactions by simply copying the whole thing, and letting old sessions keep reading from the old files. After all, copying files, even large ones, is pretty quick these days. It's simple, and effective. It also doesn't scale, but transactions were never a priority for this idea, so I have no problems with it.

We have also agreed to implement the store in C, and wrap it in a JNI interface. We've both wanted to do this for a while, for a few very good reasons.

Firstly, Java's NIO provides no way of unmapping a file. Unmapping only seems to get done when the garbage collector picks up a MappedByteBuffer. Java has good reason for this, as otherwise every single get or put operation would need to be tested to see if the mapping was still open, and that would defeat the speed improvements that mapping was supposed to provide in the first place. Unfortunately for us this is a problem if we ever decide that another area would be better to be mapped, or if we need to change the size of a file. It causes particular headaches under Windows.

Secondly, the AVL trees and block files we currently use are simple fixed structures, written over and over again in the files. To access these structures we have Java classes with accessor methods that know about the exact offset of each field in the structure. However, this means that even trivial operations on the data require numerous function calls, which each do their own arithmetic on integer offsets. In C we can simply use structures to access the data directly. The compiler does the arithmetic, and from there on the data access is fast and trivial.

Certainly, at higher levels we want to use Java, but for the file access we can definately pick up some speed with C. The code is going to be pretty easy, so it might be worthwhile rewriting it in Java once we get it going, just to see the difference.

Unfortunately, doing it in C will mean that we need to use #defines around our file mapping functions to cater for the differences between Windows and every other operating system, but we should be able to do almost everything else completely portably. I think the inconvenience of compiling the C code will be more than made up for by the advantages we'll be able to confer to the user. GCC is available on almost every system now. We might need to supply a pre-compiled binary for Windows.

So, now I know what I'm going to do with DM, but I've promised myself that I'll finish the in-memory version first. As soon as I finish this blog I'll be onto it. I've had to put it off until now, as I couldn't do anything after work last night. Trying to get fitness back, I swam a km, and after my lack of sleep from the previous night I was just wrecked. Hence my lack of blog until now. Today was also a write off, as I had to look after a sick Anne, and at the same time go out and find a Mother's Day present for Luc to give her.

So much for the "fun" stuff. As for work yesterday, I started looking at the implementation of transitive rules. The simple rule for transitivity works without modification, but it doesn't chain at all. So if A is a subclass of B, B is a subclass of C, and C is a subclass of D, then we can generate the rules that A is a subclass of C and B is a subclass of D. However, we would need to apply the rule again to discover that A is a subclass of D.

To do this inference as a select is quite simple:
  select $xxx <&rdfs;subClassOf> $zzz
  from <namespace:#model>
  where $xxx <&rdfs;subClassOf> $yyy and $yyy <&rdfs;subClassOf> $zzz;

What we need is some way to tell the query layer that a rule is transitive, so it should keep applying it along a chain (while avoiding loops). Applying the rule to the whole dataset recursively would also work (and would seem more mathematically complete), but would be inefficient. We also want the option to tell the query layer where to start a chain, so we can make the rule apply efficiently, and only to those statements the user is interested in. On Thursday I described some of this. Well after meeting with AN and TJ in the morning we decided to go with a simple syntax. It's not as elegant as the one SR proposed (meaning that it doesn't fit in with the current syntax structure all that well, and it doesn't look like a mathematical expression), but it's also concise, easy to read, and describes exactly what we want the query layer to know.

To apply the "rdfs:subClassOf" predicate transitively will have the following syntax:
  select $xxx <&rdfs;subClassOf> $zzz
  from <namespace#model>
  where trans( $xxx <&rdfs;subClassOf> $zzz );

Wrapping the expression in the "trans" statement tells the query layer that this predicate is to applied transitively. By explicitly telling the query layer that the predicate is transitive, we don't need the second part of the statement that the original query used to imply transitivity. Hopefully making this explicit will make it easier for the query layer to do its work.

To fix one end of the chain changes the where clause to:
  trans(
    $xxx <&rdfs;subClassOf> <http://namespace#FixedSuperClass>
    and
    $xxx <&rdfs;subClassOf> $zzz
  );
Or to fix the other end:
  trans(
    <http://namespace#FixedSubClass> <&rdfs;subClassOf> $zzz
    and
    $xxx <&rdfs;subClassOf> $zzz
  );

So now I need to get in and change it to do this.

AN pointed me to the sablecc/itql.grammar file, and after fiddling for a while I've managed to get the parser to accept "trans" as shown above. However, it still isn't complete, as it will allow any kind of statement inside the brackets, and I need to restrict it explicitly to the form I've shown. I think I have enough of a handle on it to get this done, but I don't think I'll be quick if I want to get it right.

I've also been inside the query layer code, trying to find the right place to put the changes. It is possible to do this as a sequence of queries, at a relatively high level, just below the iTQL layer. However, this doesn't seem all that efficient, so I've gone down into AbstractDatabaseSession. Queries here are passed in as objects, and get sent to resolveLocalQuery(). This in turn constructs a LocalQuery based on the query object that was passed in.

So I need to modify the LocalQuery Object to do the transitive work. I also need to modify the constraints in the Query class to convey the new information down to this layer. This information is the chain anchor (if any), and the fact that the predicate is transitive.

So once I have the grammar constructed correctly, I'll be using it to modify the constraint field in Query (this means modifying the Constraint class). Then I'll be modifying LocalQuery to handle this new type of constraint, and implement this constraint as "chained transitivity".

It sounds reasonably straight forward, and it probably is. However, I've never worked in these layers before, and it's all rather confusing. I'm concerned that I'll take longer than I should. Unfortunately, because I have the current mandate of "implementing inferencing", TJ and AN are both assuming that it is complex, will involve some dead ends, and will take me significant time. This is true, but since I am expected to take ages, the fact that I'm slowed up on something that shouldn't take long isn't going to register with them as being a problem (when I think it is).

Conversely, AN wants to have someone else familiar with this part of the code (a goal which I agree is important). So having me do the work regardless of the time is important. I just felt a bit lost doing it on Friday. Maybe it was the lack of sleep. :-)

Hmmm, AN professes to read these, so as least he's aware of where I am and what I'm up to now!

I think that's about all. I'd better get onto this in-memory store if I'm to have any hope of getting onto the scalable version soon.

No comments: