Thursday, January 24, 2008

More Correspondence

Going back through my emails with Andy, I realize there's still a lot that might be of interest to include here. Hopefully the parts I choose to copy/paste don't appear to disjoint.

JRDF

Andy was confused about the role of JRDF in Mulgara, as well he might be. He was trying to work with a minimal set of classes, yet he kept needing JRDF even when he wasn't using the JRDF interfaces.

You can think of JRDF as having 2 faces. First, it provides definitions for classes that represent RDF nodes - URIResource, Literal, BlankNode. Second, there's an interface for inserting and querying for RDF statements. Initially, someone decided to use JRDF as the definition for nodes (I think it was Andrew, and I think he chose to use it since he'd already written this code while he was at home, and it made sense for him to reuse it). Some time later, Andrew decided that the interfaces for manipulating and querying for statements should also be implemented, since we were already using the JRDF code. (Now this is in my blog, I'm sure Andrew will clarify this point!)

So internally, yes we use JRDF. It's mostly for the interfaces and abstract classes associated with URIResource, Literal, and BlankNode. There are also interfaces for SubjectNode, PredicateNode, and ObjectNode which are used when putting triples in and out of Mulgara. At the lower levels, Mulgara is 100% symmetric around all 4 nodes (it used to be 3 nodes, but as most people should know now, we moved it up to 4). However, when the data gets pushed through these interfaces, this imposes certain type restrictions. This is why Mulgara won't let you use a literal as a subject, or a blank node as a predicate.

For this reason, you'll need the JRDF classes, even if you never use the JRDF interfaces. Yes, I know it's annoying. One of the many reasons I want to reimplement a lot of Mulgara (another big reason is that I want to use a less restrictive licence - specifically Apache).

Blank Nodes

Blank nodes can have any label an implementor chooses to use, so long as it meets certain criteria. In Mulgara, they are shown as an underscore, colon, and then a number. Andy was trying to figure out the significance of the numbers shown here, and how they get allocated.

These numbers are actually the raw graph node identifiers (a 64 bit long), or gNodes. All gNodes are allocated from the Node Pool, which is just a Free List.

... describing free lists..... Oh boy.....

To start with, any new requests for gNodes just come from an incrementing long. However, if you ever delete all the statements that use a gNode, then that gNode will be "released", meaning that it's added to the FreeList. So now, whenever you ask for a new gNode, the FreeList will try to give you any released gNodes first before it returns the incremented internal long value.

However, that's a vast simplification. If you released a gNode in the current transaction, then these will be given back to you first (until exhausted). Next, it will try to give you any nodes released in old transactions that are not part of a currently "open" result set. Once all open resources that refer to a set of gNodes have been closed, then the FreeList is able to hand them out. Finally, it uses the incrementing long.

All of this reflects the 32 bit thinking that the system started with. There is little need to re-use gNode values when you have a 64 bit system (if you allocate a gNode every millisecond, then it will take you half a billion years to use them all up, so I think we're safe). We need to update it, but unfortunately, there are arrays which are indexed by the gNode ID, meaning we can't just increment the long all the time. With the 32 bit approach this was OK, since the ID values were packed from the bottom. But if we move to an incrementing number for gNodes (simplifying things greatly - and speeding them up) then we will need a new on-disk structure for this array.

OK, this isn't describing Mulgara now. It's really my recent musings on making it all faster.

The Server

Andy was waiting for the server to start up, and his TQL client appeared to be getting confused with the intermediate startup state. There wasn't a lot said here, but I want to reiterate it anyway.

IMHO The server is WAY too heavy. I'm all for the services provided... but I think they need to be provided in an external framework, and let the database be a module that gets loaded by that framework. The fact that it starts so many services really bothers me. I'd fix this, if I had time.

Mind you, I'm being a bit harsh when I say "fix". It works. It's just I believe it needs to be made of smaller parts, which are either independent, or build on one another. The current server is monolithic.

No comments: