Tuesday, October 05, 2004

The majority of today was spent building examples for the owl:sameAs predicate, and documenting the results.

This ended up being a bigger task than most other predicates. Entailment has to reflect symmetry, transitivity, and reflexivity. It also has to copy every statement which refers to or is referred by a node that is the owl:sameAs another.

Then there is consistency checking. Instances of classes can only be the same as other instances, classes can only be the same as classes, and predicates can only be the same as predicates. Objects cannot be the same as objects which they are also different from. These all get expressed as separate queries, and all required example data to be created.

Adding to the complexity of these queries is the fact that they do not fall within a rules engine. This then means that there is no set of pre-inferred statements to connect Properties with each of its subtypes. As a result, the statements which constrain on having an <rdf:type> of <rdf:Property> also constrain on <owl:ObjectProperty>, <owl:DatatypeProperty>, <owl:FunctionalProperty>, <owl:InverseFunctionalProperty>, <owl:TransitiveProperty> and <owl:SymmetricProperty>.

I created the sample code while putting together the documentation. This had its advantages and problems. The advantage was that I was able to proceed with each item as I arrived at it. The disadvantage was that new example data caused modifications to the example data for previously completed items. This meant that I spent a lot of time re-issuing queries, which obviously took some time.

Difference Operator
On Monday night I spoke to SR about the operations fundamental to a difference operator. AN had been insisting that it could be done using a complement operator. However, since a simple inner join against a complement on a constraint does not accomplish this I decided to work out why.

It didn't take long to realise that the complement against the triplestore was not providing what we needed. I discussed some of this with SR, but his questioning demanded specifics. This was probably a good thing, as it forced me to make sure that I had everything right.

The inner join operator appears as: ⋈

This is the "bowtie" operator. I've included it here with the HTML of &x8904;

If the operator does not show up correctly... then sorry. You'll just have to bear with me. :-)

Take A as a set of rows n columns wide. B is a set of rows m columns wide. The columns of A are:
a1 a2 a3 ... an
b1 b2 b3 ... bm

An inner join is a combination of 3 operations.

If I do an operation of A ⋈ B, then this can be broken down into:
A x B (cartesian product)
σ(A x B) (selection)
π(σ(A x B)) (projection)


A ⋈ B = π(σ(A x B))

The cartesian product ends up with a new set (which for convenience I'll call C). So:
C = A x B

The column names in C are:
c1 c2 c3 ... cn cn+1 cn+2 ... cn+m

Which corresponds to:
ci ≡ ai (i < n)
cj+n ≡ bj (j < m)

The σ(...) operation selects those rows where two or more columns are equal. ie. For one column:

The projection operation removes duplicate columns (for a selection of σCi=Cj(...) then this removes either ci or cj), and also removes columns where the values are fixed. So the total number of columns at the end t, is given by:
t < n + m

Since the selection operation is guaranteed to provide at least one duplicate column, this can be further restricted:
t < n + m - 1

The final projection operation is a mathematical convenience, as it does not change or add any information to the result. All it does is remove unnecessary information. As such, it is valid to consider the inner join without the projection operation for algebraic purposes.

This leaves us with a modified description of an inner join ⋈ which is defined as:
A ⋈ B = σ(A x B)

From here on, I'll use this definition of an inner join instead of the original. This is safe as since the missing π(...) operator simply removes redundant data that is not usually required.

Consider the universal sets of n and m columns respectively: Un Um.

By definition:
A ⊆ Un
B ⊆ Um

The cartesian product of A and B will then be in the space defined by the cartesian product of Un and Um.
A x B ⊆ Un x Um

If we consider a restricted universe of U'n and U'm, where:
A ⊆ U'n
B ⊆ U'm

Then the product of A and B will also fall into this space:
A x B ⊆ U'n x U'm

The selection operator does not remove any columns (ie. does not modify the space). Also, it provides a restriction on a set, meaning:
σ(A x B) ⊆ A x B ⊆ U'n x U'm

Now since we have defined ⋈ as:
A ⋈ B = σ(A x B)

This means:
A ⋈ B ⊆ U'n x U'm

The definition of a complement S with respect to the universal set is:
S ∪ ¬S = U

Let us define the restricted complement ¬' as being with respect to the restricted universal set:
S ∪ ¬'S = U'

With this definition we can say:
(A ⋈ B) ∪ ¬'(A ⋈ B) ⊆ U'n x U'm

It is important to note that ¬'(A ⋈ B) appears in the same space as U'n x U'm.

Finally, we can define a projection operator which selects out the A columns from the (A x B) cartesian product. Remember that (A x B) has columns:
c1 c2 c3 ... cn cn+1 cn+2 ... cn+m

So we define πn(S) as reducing the space of S from n + m columns down to n columns. This means that if S has n+m columns of:
c1 c2 c3 ... cn cn+1 cn+2 ... cn+m
And S' has n columns of:
c1 c2 c3 ... cn

S' = πn(S)

This operator loses information from the columns cn+1 cn+2 ... cn+m. The result of this operator is within the same space as the original set A.

When applying this to a set in the space of (U'n x U'm) we get the columns which come from U'n. As defined earlier, A ⊆ U'n. Applying this operator to the cartesian product of A and B:
πn(A x B) = A

σ(A x B) ⊆ A x B
σ(A x B) = A ⋈ B

A ⋈ B ⊆ A x B
πn(A ⋈ B) ⊆ πn(A x B)
πn(A ⋈ B) ⊆ A

This means there are elements which are in A, but are not in πn(A ⋈ B). So if we take ALL of the elements which are not in πn(A ⋈ B) we will find some which are in A, and some which are not. In other words:
πn(¬'(A ⋈ B)) ∩ A ≠ ∅

Let us refer to πn(A ⋈ B) with the set D. To remove from A those elements which appear in both D and A we can write:
A ∩ πn(¬'(A ⋈ B))

If we look at the construction of this, we will see that the result is a subset of A (since A ∩ S ⊆ A), and that all rows which the selection operator which would have matched A to B have been removed.

To confirm this, we can looking at the opposite construction:
A ∩ πn(A ⋈ B) = πn(A ⋈ B)
πn(A ⋈ B) ⊆ A

Remember that by definition:
¬'(A ⋈ B) ∪ (A ⋈ B) = U'n x U'm
¬'(A ⋈ B) ∩ (A ⋈ B) = ∅

Which leads to:
πn(¬'(A ⋈ B) ∪ (A ⋈ B)) = πn(U'n x U'm)
πn(¬'(A ⋈ B)) ∪ πn((A ⋈ B)) = U'n

(This step relies on the distributivity of πn(...). This is easily provable, but I haven't done it here).

πn(¬'(A ⋈ B)) ∩ πn((A ⋈ B)) = ∅

Since we know that (A ⋈ B) is the inner join, then this defines all the rows where A and B match according to the selection operator. So πn((A ⋈ B)) defines all those rows from A which the selection operator would match against B. Since:
πn(¬'(A ⋈ B)) ∩ πn((A ⋈ B)) = ∅

Then we know that πn(¬'(A ⋈ B)) contains all the rows from A which did not match the selection operator, plus all the rows in U'n which are not in A.

This then defines our difference operator:
A - B = A ∩ πn(¬'(A ⋈ B))

Complement Operator
While the above definition is all quite sound, it assumes that the result of ¬'(...) is readily available. However, this complement is based on one or more cartesian products of the entire statement store against itself. When applied to database of decent size, this result can easily get too large to work with. Instead, a difference can be easily worked out by removing rows from the first set which match rows on the second set. This will be at least as fast and memory efficient as the current inner join.

So why have this pseudo-proof description if the operations described are not going to be implemented? Well I have three reasons. The first is that it provides a sound mathematical basis for the operation, and this helps us to recognise problems from the outset. The second reason is that it provides a full justification of the operation for anyone skeptical about its operation, such as SR. The final reason was to determine just what it is about the complement achieved with the excludes() operator that was not working for us.

No comments: