Proof Reading
Once again it's way too late to spend any further time on this to tidy it up. I still haven't proof read yesterday's blog! (addendum: I have now).
In the meantime, I hope that it's not too difficult to read.
N3 Once Again
The class path was resolved quickly this morning, allowing me to finish the N3 importing.
First of all, blank nodes were parsed correctly, but an exception occurred when creating a URIReference
from them. This was because subject nodes were not being checked properly to see if they were blank. After that the load worked fine.
So then I started a series of tests. I took the Jena camera ontology and loaded it into a model called "camera". I wanted to look at blank nodes, and this RDF file contains several. Then I saved it out as N3 to a file named "camera.n3", and was able to manually confirm that everything looked fine.
Next, I created a new model and named it "c2". I used the new N3 parser to load camera.n3 into the c2 model. Then I exported it to disk as "c2.n3" and compared it to camera.n3. Unfortunately, they were identical.
At first glance it would be a good thing for these files to be the same, as they represent the same data. However, when loading blank nodes into a system a whole set of new nodes must be allocated for each blank node. The fact that the nodes were being reused was a bug.
I mentioned the problem to AN, and this helped me to think about what the problem was. I realised that the caches which are used to map blank nodes must not be getting cleared out. A quick check confirmed this, and so I added calls to the clear
methods on these caches within the startDocument
method of the parser's event handler. This fixed it.
After N3 importing and exporting was working I started downloading the entailment tests from the RDF test page. It was a little annoying downloading each in turn, and this prompted me to check further down the page to find all the tests in a zip file.
Inferencing Goals
While swimming tonight I gave some thought to Bob's suggestion that I write a couple of pages on a real world example of what I'd like to achieve. The trite answer is to base it on the sorts of things we would ultimately like to do at work. The general idea here is to be able to take disparate databases, and to infer new and interesting information.
Maybe we'd like to say that a person "A" who is considered a threat, is a known associate of person "B", who traveled on the same flight as person "C" who works for a weapons manufacturer. Frankly, for this kind of inference I'm really reaching, particularly as I don't even know exactly what security experts would really want to see. Of course, we'd like to be able to provide the ability to do inferences like this, but if I try to come up with inferences of this type, then I run a much greater risk of choosing something which isn't really useful.
To start with, I'd need to show that a particular ontology can be mapped into RDF. If not, then I need to pick a different system. I don't have to actually map such an ontology into RDF, but I need to show that in principle it can be. Once I am confident that the information I know is in RDF, then I need to determine if it would be possible to infer information that I didn't originally know.
So I started thinking of the kind of inferences that I've personally made in the past. My goal then would be to create an inferencing system which can infer the same things I was able to. The first thing to come to mind was the fact that the light of a supernovae is preceded by a burst of neutrinos.
I was once describing the process of a supernova to a friend (in fact, it was AM, some years before he came to work at Tucana). I explained how a collapsing star has so much gravity that its inward rush is not stopped by electron degeneracy, but just continues on through it. Now if it continues to collapse through electron degeneracy, then the electrons simply can't continue to exist in their current form, due to the Pauli exclusion principle (electron degeneracy is the boundary of the Pauli exclusion principle where every possible electron state for the volume is being occupied). The way the star can continue collapsing is to merge its electrons with their corresponding protons (remember, stars are mostly hydrogen, with is just an electron-proton pair). Merging like this creates a neutron, and a neutrino. With all these neutrons being created, eventually neutron degeneracy is reached, and this is normally strong enough to make the whole collapsing star "bounce" back out. This is the explosion that we call a supernova. (If it kept going through neutron degeneracy then the star would become a black hole).
This process had been described to me in an astrophysics subject without the mention of neutrino generation. But as I mentioned to AM the electron-proton reaction part of the process, I inferred the part about the emitted neutrino, as my knowledge of that reaction told me that neutrinos had to be generated. At this point I realised that during the collapse preceding a supernova, an enourmous number of neutrinos must be created. The collapse takes place over several days (remember that stars are huge, so the speeds involved are still staggering), so neutrinos will be produced for the whole of this period. This reminded me of a media story I'd once heard about how the world's neutrino detectors would always show higher activity a few days before a supernova. Now I know why.
So the knowledge of the process of a supernova, the types of particles which make up a star, the rules governing the reactions of these particles, and the types of reactions which would occur during a supernova, all let me infer that a neutrino burst would precede a supernova by a few days. To a better physicist this may have been obvious, but for me it was a significant inference. If I can make a computer able to infer something like this, then I believe I'll have accomplished something significant. My immediate task is to convince myself that this is possible with RDF, with an inferencing system that is either OWL or an extension thereof.
Can I build the knowledge of the supernova process into RDF? I believe so.
I'd be describing a class called "Star" which is a subclass of "Stellar Body" (which is then a subclass of "Body"). It would have several properties, including size, which is the determining property for working out if it will go supernova (too small and it settles to a while dwarf, too large and it becomes a black hole). So the property of having a "supernova event" at some point in a star's existence can be inferred reasonably easily. The event itself has a sequence of sub events. These involve the collapse to electron degeneracy, the collapse to neutron degeneracy, and the bounce back. The collapse to neutron degeneracy has an associated electron-proton particle reaction. So I can trace this route of relationships down to stars of a certain size having a particle reaction as an event in their existence.
This then leads to an ontology for particle reactions. The electron-proton reaction has a neutrino produced. Linking the two ontologies shows that a star of a certain size will emit a neutrino at some point in its existence. In fact, going back to the size of the star, the number of electron-proton pairs can be inferred, and this can then lead to a number of neutrinos which can be emitted. If an ontology says that any emission over about 10100 (I pulled that number out of the air here) can be called a "burst", then it will be possible to infer that the star with the right properties will emit a burst of neutrinos.
I need to break it down more than that, and then I need to work out what kind of operations would need to be done to allow this sort of inference, but intuitively I feel confident that it can be done.
Indexing Numbers
While building and testing Kowari today I started giving serious thought to the concept of our indexes as vectors in a space.
When Kowari started we had 3 indices. Our indices were actually orderings of statements, based on the first column, followed by the second, and finally the third. By having 3 indices we were able to define a constraint as 1, 2 or 3 nodes, and there would always be an index which would provide the resulting statements as a contiguous group. The start and end of this group can be found with a binary search, meaning that the data can always be found in O(log n) time. I've discussed this here in the past.
Moving on to 4 nodes (subject, predicate, object and meta) meant that we needed to use a minimum of 6 indexes to be able to quickly find a contiguous group of statements. But where did that number 6 come from? How many indexes do we need for a large number of columns? Does this represent some larger structure that we might be able to take advantage of. I've come up with some of the answers to these questions.
For a given system of N nodes in a single statement (N=4 for Kowari) then there will be a need to search when given a constraint of 1 node, 2 nodes, 3 nodes, and all the way up to N nodes (searching when you have a constraint of N nodes is the equivalent of doing an "exists" operation on a statement). It turns out that a search done with S nodes in the constraint will define a minimum number of indexes, which I'll call minS. Finding the maximum value of minS as S varies from 1 to N gives the minimum number of indexes required to be able to completely search the statement space. The indexes chosen to meet this figure will need to cover all the cases of other values of S.
Note that I have not proven that a set of indexes which meets the maximum minS will always be able to meet the index requirements for all other values of S. However, the selection of indexes to search on S constraints must always take into account the other values for S. It appears that this is always possible, and after gaining the experience of choosing a few sets of indexes it seems that this would indeed be provable, particularly when S is less than the S for the maximum minS.
The value for minS for statements of N nodes can be determined with some simple combinatorial algebra, and it comes down to:
minS = N!/(N-S)!S!
(where is MathML when you need it?)
These maximum values are shown here for the first few values of N. The corresponding value for S is in brackets:
N
2 2 (1)
3 3 (2)
4 6 (2)
5 10 (3)
6 20 (3)
7 35 (4)
8 70 (4)
9 126 (5)
10 252 (5)
11 462 (6)
12 924 (6)
Hyperspace
Yesterday I discussed how a set of statements on
N nodes forms an
N-hypergraph.
Working with this supposition, one could assume that an index (of the type described already) forms a vector within an
N-hyperspace. Perhaps category theory will show that this mapping into topology will let us determine something about our indices that we have not already seen.
Suppose that each index forms a unit vector, which is normal to a hyperplane. In 3-D (which is were normal RDF stands) these planes then form a cube. The three indices of:
subject-predicate-object
predicate-object-subject
object-subject-predicate
form the normals of 3 sides of this cube. The opposite set of useful indices:
object-predicate-subject
predicate-subject-object
subject-object-predicate
form the normals of the other 3 sides to the cube.
The dimensions here are the 3 labels (subject,predicate,object), and the index's vector is along the dimension of the first label. The
curl of that vector corresponds to the remaining dimensions, and so subject-predicate-object is along the same line as subject-object-predicate, but in the negative direction. So the curl of a vector describes a path through the nodes of a statement. We can also see that the first three vectors are in the opposite direction to the second three vectors, and are functionally equivalent. This is why either set of indices can be used.
This is all fine for 3 dimensions, but does it hold for 4 dimensions and above? The answer seems to be yes.
In 3 dimensions we have 3 faces of a cube which define all our required indices. In 4 dimensions we have a hypercube, which has 24 faces. Since each face has 3 "opposite" faces, there are 4 functionally equivalent vectors normal to those faces. 24/4 = 6, which is the number of indices needed for statements of 4 nodes.
For 5 dimensions we no longer want to consider the number of faces on the hypercube. Going back to the maximum min
S table, note that the
S value for
N=3 and
N=4 was 2, but for
N=5 then
S=3. Remember that
S corresponds to the number of nodes to constrain on, which gets mapped to the number of dimensions we are searching on. By moving up from 2 dimensions to 3 dimensions we are instead looking for the number of opposing 3D
volumes defined by the hypercube, rather than the opposing 2D
planes. In 5 dimensions the hypercube has 40 volumes. Dividing this by the the number of opposing volumes (4) gives the 10 indexes required.
This pattern continues. For any hypercube of dimension
N, take the
S value for
N from the table above, and use that to select the number of dimensions of the "sides" of the hypercube you are interested in, and find out how many of those "sides" the hypercube has. The required number of indexes shown above will divide evenly into that number, by a factor of a power of 2.
The number of sides of various dimensions can be easily determined by following the instructions
here. The first few are shown here. The first column defines the vertices, the second is the edges, the third is the sides, the fourth is the volumes, the fifth is the hypervolumes, and so on:
N
0: 1
1: 2 1
2: 4 4 1
3: 8 12 6 1
4: 16 32 24 8 1
5: 32 80 80 40 10 1
6: 64 192 240 160 60 12 1
7: 128 448 672 560 280 84 14 1
8: 256 1024 1792 1792 1120 448 112 16 1
Programs
In case you've made it this far, yes the 2 tables above were generated with code that I wrote. Both a pretty trivial, but I might post them if I get some time.
In the meantime I have to get to bed.