Friday, February 27, 2009

Programmable Logic

I had hoped to blog a lot this week, but I kept putting it off in order to get some actual work done. I still have a lot more to do, but I'm at a natural break, so I thought I'd write about it.

I have finally integrated RLog into Mulgara! In some senses this was not a big deal, so it took a surprising amount of work.

To explain what RLog is, I should describe Krule first.

Krule

The Mulgara rules engine, Krule, implements my design for executing rules on large data sets. For those familiar with rules engines, it is similar to RETE, but it runs over data as a batch, rather than iteratively. It was designed this way because our requirements were to perform inferencing on gigabytes of RDF. This was taking a long time to load, so we wanted to load it all up, and then do the inferencing.

RETE operates by setting up a network describing the rules, and then as statements are fed through it, each node in that network builds up a memory of the data it has seen and passed. As data comes in the first time, the nodes that care about it will remember this data and pass it on, but subsequent times through, the node will recognize the data and prevent it from going any further. In this way, every statement fed into the system will be processed as much as it needs to be, and no more. There is even a proof out there that says that RETE is the optimal rules algorithm. Note that the implementation of an algorithm can, and does, have numerous permutations which can allow for greater efficiency in some circumstances, so RETE is often treated as the basis for efficient engines.

One variation on algorithms of this sort is to trade time for space. This is a fancy way of saying that if we use more memory then we can use less processing time, or we can use more processing time to save on memory. Several variations on RETE do just this, and so does Krule.

When looking at the kinds of rules that can be run on RDF data, I noticed that the simple structure of RDF meant that each "node" in a RETE network corresponds to a constraint on a triple (or a BGP in SPARQL). Because Mulgara is indexed in "every direction", this means that every constraint can be found as a "slice" out of one of the indexes (while our current system usually takes O(log(n)) to find, upcoming systems can do some or all of these searches in O(1)). Consequently, instead of my rule network keeping a table in memory associated with every node, there is a section out of an index which exactly corresponds to this table.

There are several advantages to this. First, the existence of the data in the database is defined by it being in the indexes. This means that all the data gets indexed, rules engine or not. Second, when the rules engine is run, there is no need to use the data to iteratively populate the tables for each node, as the index slices (or constraint resolutions) are already fully populated, by definition. Finally, our query engine caches constraint resolutions, and they do not have to be re-resolved if no data has gone in that can affect them (well, some of the caching heuristics can be improved for better coverage, but the potential is there). This means that the "tables" associated with each node will be automatically updated for us as the index is updated, and the work needed to handle updates is minimal.

During the first run of Krule, none of the potential entailments have been made yet, so everything is potentially relevant. However, during subsequent iterations of the rules, Krule has no knowledge of which statements are new in the table on any given node. This means it will produce entailed statements that already exist, and are duplicates. Inserting these is unnecessary (and hence, suboptimal) and creates unwanted duplicates. We handle this in two ways.

The first and simpler mechanism is that Mulgara uses Set semantics, meaning that any duplicates are silently (and efficiently) ignored. Set semantics are important when dealing with RDF, and this is why I'm so frustrated at non-distinct nature of SPARQL queries... but that's a discussion for another time. :-)

The more important mechanism for duplicate inserts is based on RDF having a property of being monotonically increasing. This is because RDF lets you assert data, but not to "unassert" it. OWL 2 has introduced explicit denial of statements, but this is useful for preventing entailments and consistency checking... it does not remove previously existing statements. In non-monotonic systems a constraint resolution may keep the same size if some statements are deleted while an equal number of statements are inserted, but in a monotonic system like RDF, keeping the same size means that there has been no change. So a node knows to pass its data on if the size of its table increases, but otherwise it will do nothing. I stumbled across this technique as an obvious optimization, but I've since learned that it has a formal name: semi-naïve evaluation.

Krule Extensions

While this covers the "batch" use-case, what about the "iterative" use-case, where the user wants to perform inferences on streams of data as it is inserted into an existing database? In this case, the batch approach is too heavyweight, as it will infer statements that are almost entirely pre-existing. We might handle the non-insertion of these statements pretty well, but if you do the work again and again for every statement you try to insert, then it will add up. In this case, the iterative approach of standard RETE is more appropriate.

Unfortunately, RETE needs to build its tables up by iterating over the entire data set, but I've already indicated that this is expensive for the size of set that may be encountered. However, the Krule approach of using constraint resolutions as the tables is perfect for pre-populating these tables in a standard RETE engine. I mentioned this to Alex a few months ago, and he pointed out that he did exactly the same thing once before when implementing RETE in TKS.

I haven't actually done this extension, but I thought I'd let people know that we haven't forgotten it, and it's in the works. It will be based on Krule configurations, so a lot of existing work will be reused.

I don't want to overdo it in one post, so I'll write about RLog in the next one.

1 comment:

Dragisa Krsmanovic said...
This comment has been removed by the author.