Numbers RDF
Just in case anyone is interested, the little C++ program I have for generating RDF for cardinal numbers is now up.
Don't expect too much of it, as I never intended it for public consumption. Hence, it's important to read the README. Otherwise you might end up wondering why it just seems to sit there doing nothing...
It writes to stdout, but takes a while to initialise its set of prime numbers (a long while on a slow computer, or about 5 minutes on a fast one). A USR1 signal makes it close all XML tags and exit early. That feature may be important for you, as it will take days to run to completion, and it's unlikely you need a file quite that large.
I recommend piping the output into gzip.
Update (3 years later)
This link has long gone stale. I mention it again here, and put up a link to a more recent version. Some of the OWL was changed in this file, due to updates in the OWL spec (owl:sameIndividualAs is no longer a relationship), and the fact that I re-worked the structure for myself. The first version was based on someone else's description as I hadn't learnt OWL at that point, but I updated it once I learnt it for myself.
Monday, May 31, 2004
Resolver SPI
After getting through lots of correspondence today I got back to reading the Resolver SPI. Since I was starting with the interfaces I thought that the Javadoc would be the place to start, but it turns out that it wasn't generating properly. Some fixes by SR, a few iterations on clean rebuilds, and they're going again.
Kowari
AN got back from WWW 2004 today. Some encouraging feedback about Kowari has come out of the conference, and part of today was spent discussing this. While AN was away I've had some other ideas on what we should be doing, and I know that he has had all sort of inspirations from the conference, so we still have a lot to discuss. In anticipation of this forthcoming conversation I spent a few hours going over his Ontology plan for Kowari.
One of the comments presented at the conference was that Kowari handles billions of triples; which it does. This is one of the reasons I'm proud of the system, given that it was DM and I who designed and built it with that kind of scalability in mind. Just forget that both of us want to reimplement it with a different structure. :-)
All the same, people have wanted to see this claim in action, but few people have that many triples lying about. I wrote a program a little while ago that builds RDF to describes the cardinal numbers to ~4 billion (i.e. 2^32). The information is reasonably simple: the textual description (in English); the roman numeral notation (up to 40000); the prime factors; the square of the number; primality; triangularity; perfection (is that the name of the property which describes that a number is "perfect"?).
An example is shown in this excerpt:
<rdf:Description>
<owl:sameIndividualAs rdf:datatype="&xsd;integer">6</owl:sameIndividualAs>
<rdfs:label xml:lang="en">six</rdfs:label>
<math:roman>VI</math:roman>
<math:square rdf:datatype="&xsd;integer">36</math:square>
<math:primeFactorization>
<rdf:Bag>
<rdf:li rdf:datatype="&xsd;integer">2</rdf:li>
<rdf:li rdf:datatype="&xsd;integer">3</rdf:li>
</rdf:Bag>
</math:primeFactorization>
<rdf:type rdf:resource="&math;PerfectNumber"/>
<rdf:type rdf:resource="&math;TriangularNumber"/>
</rdf:Description>
Today I gave AN the generation code for this to hand out to those who ask. It is certainly much smaller than the generated RDF, though it takes a long time to run. Since I was expecting to be the only person to ever view or run this code, it probably isn't very pretty, and it had no documentation. Nonetheless I wrote up a README, and packaged it all up appropriately.
I should put it up somewhere, but I don't have anywhere convenient at the moment. I'll try and organise that in the morning.
JXUnit Revisited
Annoyingly, the Cruise Control server (aka "Trillian") failed on the JXUnit test for transitive queries. The error report never says why, only that it got back a response that didn't match what it expected. Since the test passed on my machine I'm guessing that it is due to the result being ordered differently. I'll get to that in the morning.
JRDFmem
Everything is working, tested, and building Javadoc just fine. There was one little hiccough though. I was going to use XML output from JUnit, but my path was missing some XML functionality. It had Xerces, but I couldn't find what else it needed. Rather than hunt it down, I decided to save time and just went to "plain" text output. It's more legible that way anyway.
My current plan is to make the Graph serializable, and to change the JRDF interface a little, before going onto the new triplestore code. I'm going to put transactions off for the moment, until I work out how I want to merge the ideas of javax.transaction and transactions-as-graphs.
The changes I'd like to make to JRDF involve moving the reifyStatement methods from NodeFactory and into Graph. AN has already expressed that he likes it, but he was jetlagged, so I'm holding off until he gives approval whilst compos mentis.
Posted by Paula at Monday, May 31, 2004 0 comments
Friday, May 28, 2004
JXUnit Tests
This morning I discovered just why the JXUnit tests are taking so long.
The first time I ran the tests I discovered that they were running slowly, so I chose not to do a "clean" before running them each time. This was in the belief that doing a clean build would result in a lot of data being unnecessarily loaded again. Unfortunately the opposite is true. Each time the tests are run, they drop any models they need, and then load whatever test data they use. One of the tests drops and then reloads a Wordnet RDF file.
The current Kowari drops models by individually removing each statement, meaning that dropping large models is very time consuming. In fact, since so much effort has been put into loading large amounts of data quickly, dropping a large model is significantly more time consuming than loading it. This meant that dropping and reloading Wordnet took over 20 minutes. Since model dropping has come up as an issue as an issue TJ is considering improving it, and DM has started considering what is needed to make it happen.
The way to get these tests happening quickly was to blow the database away with a clean build, and then run the tests again. This eliminated the need to drop the Wordnet data, saving about 20 minutes on my computer.
Testing all went well, and is now checked in. I've had to restrict the tests to what the trans()
queries currently return. I need to confirm with AN whether I should add in non-inferred statements (allowing tautologies to be removed by the current duplicate removal mechanisms), or if I should only return inferred statements instead and remove all tautological inferred statements.
The easiest way to remove tautologies would be to implement a difference operation. I believe that SR wants to implement this with an "AND NOT" combination, so it could be worth holding off until that is available. I'll talk to AN about this next week.
TKS Beta 2
I sat in on a meeting on the progress of the next TKS Beta release. This is so I can keep up with the other areas of the system that I'm not currently working on. I walked out of that meeting having been assigned a role helping get distributed queries into the next release. That will teach me to attend meetings indiscriminately.
My role is going to be relatively minor, and for the moment it may just be restricted to testing. I am nonetheless pleased to get the chance to work on distributed queries, as some of the original plans for it were my own.
I spent some time today learning the new Resolver SPI, as this is the basis of the forthcoming distributed query engine. I still have a way to go.
UQ and QUT
I got in touch with academics from both institutions today. Looking at their web sites, there seems to be no emphasis on the Semantic Web, but I'm hopeful that someone has enough of an interest to be willing to supervise a prospective candidate.
Posted by Paula at Friday, May 28, 2004 0 comments
Thursday, May 27, 2004
Testing
Reasonably simple day, which is just as well since Luc made last night a little difficult. Anne decided to try "controlled crying". If only she'd decided to do that on a Friday night! I could have survived a Saturday without sleeping much easier than a Thursday.
Today was just a matter of getting the JXUnit tests right. I had a short day today, so the tests took up most of it. While the tests were running I spent the downtime re-reading papers on TRIEs. DM also spent some time going over the indexing and file structure found in Lucene. He commented on the advantages of using skip lists, since these take advantage of the efficient nature of sequential reads from hard drives. He now wants to try a triplestore which uses skiplist indexes. I'm going to stick to hashtables for my own plans. It will be good to have differing implementations to compare.
TRIEs
As I've already mentioned, I'm planning on using a TRIE as the basis of my stringpool. Almost every description of TRIEs refers to them being an index, rather than a data store. I have a couple of ideas for using a suffix TRIE for the storage of the strings, but I'm not sure how efficient they will be. I could always use a simple linear list of strings, with a TRIE index providing indexes into the file, but that will lead to problems with deletions. On the other hand, strings don't have to be removed from the stringpool, and cleaning out the TRIE during a deletion would be quite slow. I might get a system working that does insertions only, and then consider my options from there.
One hassle with using something like a TRIE is that each node is has n children. This can be done with a fixed number of c children references, and a "next node" reference to form a linked list when the number of children exceeds c. Not as efficient nor as fast as a fixed size tree node, but more flexible.
Study
I've been considering getting back to study, after leaving for other things two years ago. My head isn't in physics at the moment, so I'll probably steer clear of that for the moment. Instead I've been thinking of the Masters courses at QUT. I did a couple of subjects there some years ago, and it's close to where I live.
Now the easy thing would be their coursework Masters degree. I've seen the subjects, and it wouldn't be difficult - just time consuming, and probably a distraction from doing real learning. On the other hand, the piece of paper would be handy, and could help me skip some odious requirements if and when I continue on with physics research.
A research Masters would be harder though much more worthwhile. The great thing about that is that I could do it on the Semantic Web, hopefully connecting it to my work. It would provide the same benefits as the coursework degree, and wouldn't be distracting me from what I really want to do. But then again, it would be harder. :-)
I think that typing this has convinced me that I want to do the research work. After all, what's the point if it's easy and I don't push myself? Why do I do these things to myself?
Posted by Paula at Thursday, May 27, 2004 0 comments
Wednesday, May 26, 2004
State of Origin
Tonight's post will be short. State of Origin was on tonight, and my brother invited me along to watch it with a few drinks. Queensland lost in extra time - the first occasion extra time has been used.
So now I'm back home, and in no fit state to talk about much, so I won't... much.
JXUnit tests
While testing the transitive statement code last night, one of the JXUnit tests failed. After confirming with others that I was the only one with the problem, I spent all morning trying to work out why.
The problem finally showed up as a NullPointerException thrown while logging. Because I've been working in LocalQuery, I had logging turned on for this class. Some of the recent changes for Tuples objects have occasionally resulted in "Empty" sets being returned when a Tuples object has no data in it. This means that a Tuples with specified columns can claim that it has no columns if there are no rows. One of the methods which reports the contents of a Tuples was not compatible with this change, and caused the exception seen. However, it was only called when logging was enabled, which is why I was the only one to see it.
After resolving that issue I moved on to writing my own JXUnit tests. These are going fine, though I'd like to write a few more.
I spent the last couple of hours asking questions and clarifications for AN's ontology plan for Kowari. I'm finally starting to get an idea of what to do, though I don't know if everyone agrees on the approaches to take. In the meantime, I might just end up trying my own over the top of JRDF, but that will just to test some ideas. Following that path in my own time will let me do my own thing without being constrained on business issues. Hopefully that will let me get some things done that wouldn't happen otherwise.
JRDFmem
JRDFmem is coming along nicely. I have a BIG list of unit tests to perform, and only a few more to write. I'm hoping to finish them shortly. Once this is done I can get onto my new datastore.
Spent some time considering the structure of the new store today. In the first instance I won't worry about ordering at all. I realised that this won't be a problem unless I try to perform join operations on data. Without a query layer this is quite unnecessary, so I may never need it.
Since the datastore is to have a Java interface, I realised that I should work out what data I need to store in a class-owned struct
on the binary side of the JNI. For a start, I'll need to store file handles. Then I realised that separate sessions will need to have their own access to the files, and I didn't want to be overusing file handles! For the index files I'll be OK, as these files will be mapped. For indexes I'll need to store the map pointer, and every instance of the session can use the same pointer.
Normal I/O is a different matter though. I can't just store a new file offset for each session, or else I will get a race for lseek
and read
/write
file accesses between different sessions. This means I either need to lock access with a mutex, or else I use multiple file handles. Locking for file access is not scalable, so I'll go with the multiple handle option. I don't like the idea of restricting the number of objects based on the number of file handles allowable, but since this is just supposed to be small, scalable and fast I won't let it bother me to start with. My other option is to provide a pool of connections, with a mutex protecting access to the pool, though this seems like overkill for the simple system I'm aiming for.
Forgetting, for a moment, all the advantages I was expecting to get with JNI, DM was performing some tests on reading and writing strings to files, using C and Java. It turns out that we have a difference of much more than an order of magnitude, with C the clear winner, though this does vary depending on the platform. I'm now more convinced than ever that I need to create a store written in C.
I'm also starting to think about the new string pool. I need to read up on TRIEs, as it's been a couple of years. I'm curious to see if I can make one work, and how well it will go.
JRDF Transactions
Had an idea about JRDF transactions today. Create a new method on Graph
called createTransaction
which returns a Transaction
object and can throw UnsupportedOperationException
. Transaction
will then define an interface that extends Graph
, and introduces the methods of commit
and rollback
.
I'll have to run it by AN to see what he thinks. If anyone else has a comment then please email me or add it to the comments below...
Posted by Paula at Wednesday, May 26, 2004 0 comments
Tuesday, May 25, 2004
Google Queries
One of the things the tracker reports is the set of Google queries used to find this page. Most of them are obvious, but some are unusual. They seem even more unusual when you consider that the searcher actually bothered to follow the link to this seemingly unrelated page.
One that caught my eye was the following query: "write a class to represent Roman numerals. the class should have two constructors". I found it interesting for 3 reasons.
Firstly, it was written in natural english. I don't believe that Google handles this, but it is nonetheless interesting to see that it returns useful results. I've noticed that Google can return documents which don't always include all the given words, particularly when a large number of words are provided. In fact, I've found that to be a frustrating feature when I've attempted to use extra words as filters on large result sets. But I see it being quite useful for natural language queries like this.
Secondly, my page has nothing to do with this request. I wonder what inspired them to follow the link?
Finally, I find it ironic, because a couple of months ago I wrote a class that converts integers to roman numerals! It even used unicode for ↂ (the character for 10000) and ↁ (5000). It was in C++ rather than Java, but it would be trivial to port as it was only 81 lines, including comments and whitespace.
Posted by Paula at Tuesday, May 25, 2004 0 comments
Testing Trans
The trans
statement had some bugs this morning, but they were all found and fixed. On the verge of checking in, and a completely unrelated JXUnit test failed. Will have to chase this down before anything gets committed.
More correspondence. Yes AN, some of that was you. :-)
Inferencing
Finally reading AN's plan for implementing inferencing.
I'll confess that inferencing has me intimidated. I know we can do it, but we need it fast and scalable, and that is the tough part. While some ways of implementing will be fast for some situations, they will suffer on other occasions. eg. Inferring from a large model that all gets loaded in one hit would be most efficient if done during the commit phase of the load. A regularly changing model will suffer badly from that strategy, and would do better to be inferred from when queried.
AN is intent on getting it working, and then making it scalable, even if that means reimplementing. That eases my mind somewhat, as it lets me concentrate on what is involved in inferencing without being distracted by scalability issues.
The other thing that I've been considering has been how an ontology model can be used as the inferencing rules. Up until now I've been just considering the RDFS rules, and these have been best expressed with the entailment-rdf-mt-20030123.xml file. However, that won't work for ontologies. Sure we can hard code in the OWL rules (subClassOf is transitive, what minCardinality means, etc) but then the application of the ontology will be the tough part. I think I need to sit down with AN for some quality time on this. Good thing he's back next week. Hope he's not too jet-lagged!
Posted by Paula at Tuesday, May 25, 2004 0 comments
Monday, May 24, 2004
Refactoring
Today was spent refactoring the transitive query code for the new syntax. It was looking pretty messy by Friday night, with two methods doing all of the work: resolveTransitiveConstraint()
and inferTransitiveStatements()
. After the syntax was changed, the former ended up full of if
statements trying to handle the separate cases of anchored and unanchored queries, with the result that the code was becoming illegible. This was principally due to the method needing to return a tuples with either 1 or 2 columns depending on whether or not the transitive constraint was anchored.
Today's effort pulled the first method apart into two separate methods: resolveUnanchoredTransitive()
and resolveAnchoredTransitive()
. These are now significantly simpler. Making all the changes in these methods also allowed inferTransitiveStatements()
to remain almost completely unchanged.
Correspondence
The above doesn't look like that much work, and it wasn't. I hadn't checked my email at all over the weekend, and it seemed I had quite a bit to attend to today. Other than the obligatory spam, I also had quite a few questions on RDF stores. This was both internal, and external, and even a few external emails which had been forwarded to me internally. I'm on top of it all now, though it means I have quite a bit of unit testing to attend to tomorrow.
Printer Drivers
Other than familial commitments, one thing keeping me from coding at home over the last few days has been our new printer. It's an Epson Stylus Photo R310. I've been very impressed with it so far when printing via USB from Anne's iBook running OS X. The print quality is superb, and the price of the ink is better than most.
However, configuration of this thing has been driving me crazy.
OS X uses CUPS to print, and the GUI setup was pretty easy to follow. That was good, as it allowed me to check out the resulting text configuration files knowing that they worked. The reason for this was so I could change them to use Bluetooth as the output rather than the USB port, since the printer supports Bluetooth.
CUPS expects to use several filter programs (read from stdin, write to stdout) to process the data appropriately for the printer, and then output it to the correct device. The device to be used is defined in the file /etc/cups/printers.conf with the line DeviceURI. For USB on Linux it might be:
usb:///dev/usb/lp0.
For Bluetooth it could be:
bluetooth://00:09:5B:03:0B:D4
(no, that's not the real address of the printer - I just used a handy Mac address as an example).
The printers.conf file also describes a printer definition file (PPD file) to use for the printer. These are in the /etc/cups/ppd directory. This file describes all the capabilities of the printer, and importantly for CUPS, it also describes the name of the final filter to use to generate instructions for the printer.
CUPS provides an Epson PhotoStylus driver that fits into this system neatly, but it doesn't handle the full resolution of the printer, and it is very slow. I've been able to get this to work on both OS X and on Linux, but the image quality has been disappointing.
Fortunately, Epson themselves provide drivers for this model of printer, in the form of a PPD file and a filter. However, these drivers do not conform to normal CUPS standards, as the filter does no write to stdout. Intead, on OS X it goes straight to the USB port (the DeviceURI line is set to file:///dev/null). This means that on the iBook there is no way to use this driver with the Bluetooth adapter!
Linux is not so bad. Here the filter writes to a pipe via the normal stdout mechanism, and this pipe is then read by a daemon that writes to the USB port. The reason for this is so the daemon can also listen on a network socket for utility programs that want to check ink levels, or other non-standard tasks. This all seems to work fine, except that the filter provided does not work!
Fortunately the filter comes with source code. After spending some time debugging I've been able to work out that it is not reading the PPD file correctly, and it exits accordingly. I don't know anything about the Epson printer control language, so I was grateful to see that the problem is in an area I might be able to fix.
Anyway, time has been a precious commodity around here lately, so I haven't finished looking at this yet. I'll spend a little time on it tonight, before getting back to JRDF.
Posted by Paula at Monday, May 24, 2004 0 comments
Friday, May 21, 2004
Transitive Syntax
I didn't get to finish the changes today, as I got sidetracked onto other issues. The coding I was able to do is still in progress.
The previous syntax took two constraints. The first was the predicate to use surrounded by two variables, and the second defined the anchor point, using the same predicate, a single variable, and a value for the subject or object. Since the second constraint matches the first, it makes sense to remove the first. Unfortunately, that leaves only one defined variable, and this changes the whole shape of the tuples being used.
Yesterday's code had a couple of if
statements and variables to handle the difference between anchored and non-anchored queries. The difference is now getting so great that I think it will be much clearer to separate the methods.
Since anchored queries still need to get the list of all statements containing a predicate, they still need to do queries based on constraints with variable subjects and predicates. This query was being provided for free with the previous syntax, but now it has to be built. We only have a single variable so we need to create the other. The name I'll use is the name of the existing variable with "_" appended. While it might make the code seem a little obtuse, it is guaranteed to have a different name.
RMI
TJ had some problems with a memory leak today. Using OptimizeIt told him that Answer objects were being kept by the RMI server. I then remembered that SR had created a server-side Set which he added all Answers into, to prevent them from being garbage collected while they were still in use. The ORA book on RMI says that UnicastRemoteObject.exportObject()
will keep a reference to all exported objects, meaning that we don't need to go to special pains to prevent it from being garbage collected. I saw no equivalent statement in the Javadoc for this method, so I assumed it was true. Based on this, the set holding the objects should not be needed. Once I told TJ about this, he removed the set and found that he was using much less memory.
After further examination TJ found that sessions were suffering the same leakage problem. Sure enough, there is a Set for them as well. However, after removing this Set, a number of RMI failure errors appeared. These errors were being caused by the session objects being garbage collected, before they were finished with. This didn't make sense, as the previous set was never read, and the session objects had been exported, which should have kept them in memory.
Further investigation showed that RMI references to Answer objects were being returned as the result of a method, but this was not how RMI references to Session objects were being returned. These objects were being wrapped in an instance of a serializable class and then that object was returned. Somehow this meant that the object was not considered to be in use, and was a candidate for garbage collection.
The fix was to inform sessions of the SessionFactory object from whence they came, and to give the factory a Set of all sessions it has served out. Then during the close()
method call on the Session object, UnicastRemoteObject.unexportObject()
and Set.remove()
were called on the object. The Set kept the reference, and dropped it when it was no longer needed.
The reason that the RMI server seems to miss these objects (and hence, allow them to be garbage collected) might be because it never saw the RMI remote references being used. The server need only keep references to registered objects that are accessible. If a remote reference is made for an object but never used, then the server should be able to clean that object up, as it is inaccessible remotely. By wrapping the remote reference in a serializable object and returning that object, the RMI server never sees the reference going over the wire, so it assumes that the reference must have been unused.
The solution would be to return the remote reference directly, or keep an internal reference to each object, in order to prevent garbage collection on them. We chose the latter, as the former would have involved significant changes to a few interfaces. The latter option just added a Set to the SessionFactory and a couple of extra methods. Having the set of active sessions also lets TJ run some metrics on the system which was another reason to go that way.
Posted by Paula at Friday, May 21, 2004 0 comments
Thursday, May 20, 2004
Anchored Queries
Anchored queries worked this morning as expected. Funnilly enough, I had thought they would work one particular way, but when I ran it I saw that they work a bit differently. Upon reflection I realised that they were working exactly as designed, I just hadn't realised that I designed them that way!
Spoke to TJ about how he wanted them working, and it turned out that I'd built it exactly as required. It seems that my subconscious designed it for me. I must really need sleep. I suppose late night coding and 3 month old babies don't mix too well.
After typing in a heap of examples to test this new query type I realised that parts of the syntax are redundant, so I spent some time cleaning it up. Unfortunately, the changes were structural in the query object, with consequences running quite deep, so it took a while. However it all seemed to be working when I finished tonight. I'll check it more thoroughly in the morning.
Another hassle was caused this morning when I tried to check the code in. Before checking in, I did an update, and ran the tests. Only the transitive queries failed after doing the update. This was due to Tuples objects being changed, such that they may not necessarily have any columns if they have no rows. My code was assuming that the columns would be there (after all, I did ask for them to be there). It used to work!
Once this was fixed I had to update again, discovered new changes, so ran the tests, updated again, discovered new changes so ran the tests... Sometimes the simple task of checking code in can be quite painful and time consuming. One person suggested that I should just check my code in without running the tests. Obviously he's not the first person in the office to think of that! :-)
Inferencing
It's late, so I'll just mention some random thoughts here...
While TJ often suggests it, inferencing can't really be done for each query. The task is just too big to make it fast. That probably means that we need to do inferences after every commit. That's fine for large transactions, but will hurt when autocommit is on. We may need to offer an option to suspend inferencing for a while.
Inferences can be made with existing iTQL queries. Do we implement them in this way? The transitive predicates (subClassOf, subPropertyOf) are definately implemented an order of mangitude faster with the new code. So should we consider this for the other rules as well?
Doing inferenced rules in iTQL is easy, particularly since the XML which is used for RDFS can be translated into iTQL so easily. However, the speed improvements available with doing most, or all, of the internally would seem to outweigh this. Besides, Jena has gone the way of providing Ontology support over the top of the existing interfaces, and while it works, it doesn't scale.
Once we have the inferred data, then where do we storeit? It can't go in memory for scalability reasons. At first glance it might seem to go in with the original data, but what if that data changes? We can re-infer from the data, but how do we know which statements should no longer be inferred if the data they were based on is no longer there?
The first solution would seem to be to reify all inferred statements, and statements used for inferences. While this provides great flexibility, and offers efficiency opportunities when minor changes are made, the overhead in time and space would be huge. I suppose it will depend on the size of the inferred data relative to the original data, but if this system is going to work at all, then the inferred data should be significantly larger than the original data.
Wouldn't it be nice to tell an ontology store a couple of axioms and have it tell you everything you ever needed to know about that ontology? :-) I'd ask it how to program this stuff....
The second solution is to put all inferred statements into a numerically adjacent model. Queries which restrict searches to a particular model could be trivially changed to look in an adjacent model as well and the code would take no longer to execute. This means that we would be able to consider the two models as one for most operations, but the differentiator would still be there when needed. Unfortunately, without reified statements the odds are good that the entire model would need to be dropped and rebuilt after transactions.
It might be possible to obtain a list of all statements inferred in any way from an original statement, and to use this list when that statement is changed or deleted. This would reduce work on an ontology. However, once the size of a transaction gets beyond a particular point there will be a lot of duplication of effort, and using the completed set of original data to create the whole set of inferred statements again from scratch will be quicker. Maybe we should do both, and find an approximate switchover point where we go from one technique to the other.
Posted by Paula at Thursday, May 20, 2004 1 comments
Wednesday, May 19, 2004
Length
According to Google Blog, blog posts should be short. Well sorry everyone, that ain't gunna happen.
To reiterate for anyone who didn't pick up on it, this is a place for me to journal what I've been doing, and to organise my own thoughts about what I plan on doing, both at home and at work. If I happen to say something interesting for others then that's great (hey, I can use an ego boost as much as the next guy), but I'm really just here to help organise my own life.
Transitive Queries
I did my first transitive queries today! Unfortunately anchored queries have a bug in them that I'm in the process of tracking down (it's building as I type). Other than anchoring (which is simply an optimisation anyway), it all works as expected.
There was no one around this morning when I made the decision of whether or not I should return all statements matching a predicate, or only the inferred ones. I opted to only return the inferred statements (a little less work), and of course that was the wrong thing to do. I realised it this afternoon when a query I executed correctly inferred a statement that already existed. There were people around by this point, and everyone agreed that existing statements should be returned at the same time. However, I'm pleased I did it this way, as it let me see exactly what is being inferred.
Hmmm, now that it's just about going I'd better write some unit tests for this query. If AN asks, then just tell him that they were written before I started the coding. ;-)
Posted by Paula at Wednesday, May 19, 2004 0 comments
Tuesday, May 18, 2004
Errors
In the words of Homer Simpson, "Doh!"
Some late night errors in the algorithm I described yesterday (for a start, that one wasn't tail recursive). Out of embarressment I went back and changed it rather than providing an addendum. It's probably still wrong, but it's closer. :-)
For the benefit of AN: I have a lot more unit tests for JRDF. I've tried to stick to interfaces mostly, to make it adaptable to other JRDF implementations. I've had to use SOME JRDFmem specific code, but perhaps I can break that out into a subclass.
Posted by Paula at Tuesday, May 18, 2004 0 comments
Inferring Transitive Predicates
Now have everything happening down to the resolution point. Spent some time learning the various interfaces available in LocalQuery for building "Tuples" collections.
Reducing (or constraining) a result is typically done by creating a Given constraint object, and combining it with the data to be reduced by way of a ConstraintConjunction object. These objects can then be resolved with a LocalQuery. This would be inefficient in this case, as the code where this has to happen is already inside LocalQuery, and there is a lot of cached state available for building tuples which can't be passed into a new LocalQuery. There are no methods available inside LocalQuery to shortcut this process, and reasonably significant changes would need to be made to do a local ConstraintExpression resolution like this.
Instead I'd gone onto using the static TuplesOperations.join()
method. This is at a much lower level, and seems to do what is needed here. At least, SR assures me that it will work.
A very rough outline of the algorithm I'm using is:
- If anchored, find all statements matching the anchor, otherwise find all statements matching the predicate.
- Save the resulting tuples as the initial result.
- Initialise an empty set of tuples for the inferred results.
- For each tuple in the initial result:
- Get the current subject, and go forward to find all objects for this subject adding them to a given clause.
- Call into inference function, passing the current subject, the initial result, the given, and the inferred results.
Inference function
- Join the given against initial result.
- For each object in this join result:
- add current subject parameter and object from join as tuples in the inferred results
- If object == current-subject parameter then continue to next object.
- If object exists in set of visited nodes for the current-subject parameter then continue to next object.
- Put object into a new given tuples, and in set of visted nodes.
- If the new given is empty then return.
- Call back into this inference function, passing the original current subject, the initial result, the new given, and the inferred results.
Since the inference function uses tail recursion I should probably go back and change it into an iteration (ie. a loop) rather than recursing. While I think of it, I should check if Java 1.5 is introducing tail recursion, but I don't think so.
Back Chaining
Back chaining from anchors which specify the object rather than the subject is also being supported in the code described above. This is because the "subjects" and "objects" I refer to above are actually just indexes into a tuples (typically indexes 0 and 1: predicates are fixed and are therefore not in the results). Instead of using the numbers 0 and 1, I used indexes into a two element array, with the values {0,1}. Back constraints need to swap access to subjects and predicates, so this is a simple matter of changing the array to {1,0}. After that all "subjects" are actually objects, and vice versa. I left the names as "subject" and "object" as it is often easier to think with fewer levels of abstraction.
JRDF
Late last night I got back to a bit of debugging of this. This was after leaving it for a week. I'm thinking that I should also provide serialisation on the Graph, so it looks like I still have some non-test coding to do.
Posted by Paula at Tuesday, May 18, 2004 0 comments
Monday, May 17, 2004
More Transitive Queries
Anyone reading this must be sick of me being in the grammar and query code by now. Put it down to distractions due to other parts of the system I get involved with. Not that I do coding there, but because I'm in discussions about designs.
Did most of my coding in the LocalQuery class today. That's where all ConstraintExpression objects get resolved, so I'm finally into the nitty gritty of it. After looking at the way the code is structured I decided that I was needlessly missing out on sharing some code by having TransitiveConstraint extend ConstraintExpression rather than Constraint. So I changed it. It wasn't all that bad, as all of the Constraint behaviour that I don't want is being overridden.
I've realised that anchored transitive statements need a separate set of code to non-anchored statements. That's a little annoying, as it means I end up doing it twice. Unfortunately the complexities in trying to sift out the common code will make it unreadable. OTOH, I'll keep my mind open, and if I get to the end of it and it looks like I was wrong I'll try and pull them back in together again. What I've done so far doesn't look like it though.
I was under the impression that I was going to be using a lot of find()
method calls, but it seems that using Constraints will be as low in the APIs as I need to go. The lets me leverage the merge code already used for constraints. I'm still in the process of building up "where clause" constraint objects based on answers from previous constraint resolutions. I expect to be able to use this method to iteratively resolve the inferred statements.
I was left wondering if the query should only return the inferred statements or the original statements as well. To be "strictly" correct I should probably be including the original statements, but I think that only providing the inferred statements is more useful. I'll be asking for opinions in the morning.
Simultaneous Writes
Interesting question came up doing simultaneous writes, and I spent some time discussing it with DM.
Now it's very rare that any system really does simultaneous writes. If a computer has only one CPU or one hard drive then at some point it has to be serialised. It is feasible to avoid it if you have a multithreaded database with RAID storage, but even that doesn't guarantee parallel writing, as there are always locks involved somewhere. The issue is where the serialisation occurs. The lower down, the better. This is because an operating system can do things like read-write reordering, and there is little point in trying to duplicate this functionality at a higher level. Probably the best way to guarantee non-serialised writing is to use a cluster, but that was getting away from the question we were posed. We just need to consider writing from multiple clients at once.
Normal table-based databases often allow table, page, and row level locking. Table locking is pretty useless these days and does not allow for scaling. Page level locks are pretty good, but the best performance in scalable systems normally comes from row level locking.
Kowari and TKS don't have tables, but it got me wondering what the parallels are, and if we can apply them. (The short answer is no, though we do have solutions).
When I was discussing Kowari indexes in the last couple of days I mentioned that the graph was stored as a set of 6 indexes. These indexes form our "tables". Unfortunately, changing a single node in a statement means that statement must be moved in 5 indexes. It can't simply be "locked" and modified in position (except in one of the indexes).
As an aside, this touches on where the Kowari/TKS dual phase commit happens. The database relies on all the indexes holding the same data. The dual phase commit lets us update all 6 indexes in the first phase of the commit and to confirm this in the second phase. If one of the indexes fails to update in the first phase, then the data in all indexes gets rolled back. The second phase is atomic on the file system, so it either happens or it doesn't. This is done internally, so there is no API for it, but it is certainly happening. We couldn't guarantee integrity of the data if it didn't work this way.
Back to considering locking in Kowari: it could theoretically be possible to lock the source and destination nodes in each index's AVL tree when performing a move of a statement. It would be very messy if a source node were nearly empty and needed removal, or if the destination were full and needed splitting. Splitting nodes (ie. adding a new node) and removal in AVL trees isn't usually so bad, but they can have repurcussions all the way to the root of the tree! (This reminds me of a little rant I have about AVL tree rebalancing. See below.)
Unfortunately, locking nodes in a Kowari/TKS write phase could be very problematic, particularly if we were trying to share the phase. Kowari and TKS use a phase-tree to handle transactions. This worked in perfectly with the AVL tree structure. As soon as a node is to be modified, that node and all parent nodes above it get copied. This means that we can make as many modifications as we want without disturbing the original data. Committing a phase simply means adopting the root of that phase's tree as the new root of the index (this is an atomic, and near-instantaneous operation). It also means that we can have numerous transactions which all have their own unique view of the data.
We only want to be writing to the most recent phase tree, as only one tree's root can be made the "new" root. DM has the clever idea of giving each writing thread its own tree, and keeping a transaction log of all modifications. Then the first transaction phase to be committed gets to make the root of its tree the new root. After the first transaction is committed, all other phases will be dropped, but their transaction logs can then be replayed against the newly committed phase. Naturally, these transactions will be rolled back if there is a conflict during playback.
While this works, it does involve overhead of the second (and subsequent) committed phase having to write statements to its own tree, and then abandoning them to write them to the newly committed tree. I believe that some of this can be eliminated if all executing transactions are informed of a commit, so they can be played against the new phase. This means that any conflicting transaction can be aborted early. More importantly, it will also mean that each transaction can get a new phase tree before the partial playback. Consequently, whichever transaction is the next to be committed will simply get to make it's phase tree the new root, meaning that all of its new modifications get kept, and don't require playback. DM suggested a further modification which only updates phases like this after a minimum time period, to prevent too much overhead accumulating on long running transactions from numerous short transactions.
Next comes the node pool. At this point it is a single point of contention, though there is not much overhead associated with it. Still, it will eventually need attention.
The simplest idea is to have a node pool manager that operates like our current node pool. Any transaction which needs to allocate nodes can ask the manager for its own private node pool. The manager can allocate a slab of node numbers to allocate (in groups of, say, 1000) and can also provide a list of nodes from the free list that it holds. Like the current node pool, these temporary node pools will allocate from their free lists first, and then hand out nodes from an incrementing counter. Once a temporary node pool is exhausted, the transaction can request a new one. If a transaction is committed or rolled back then all the unsed nodes will go onto the node pool manager's free list.
This idea will cycle many more nodes through the free lists, but they should also get used up pretty quickly, so the free list won't grow without bounds.
I also considered the idea of using interleaved node pools. This involves a fixed set of n node pools, each of which hands out nodes which are all equal mod n. Then instead of asking a single node pool for the next available node, a transaction can ask one of n node pools. Unfortunately, while this works for up to several times n threads, it won't scale for large numbers of writing threads. However, it might be worth considering just because there may never be that many writing threads, and it does provide some extra options. For a start, there need be no centralized free list. To determine which free list a deleted node should be returned to, a thread need merely take that node number mod n. It also means that starting new transactions would not incur any extra overhead from creating a temporary node pool.
I only mention this as an idea, because I really feel that the scaling problems rule it out.
Rant About AVL Deletions
By the way... has anyone ever noticed that all these sites out there that discuss rebalancing AVL trees seem to always get it wrong when deleting nodes? Everyone knows how to rebalance after inserting, but I've read a lot of sites which discuss this, and every single one of them got removal wrong. Many of these were course notes from university subjects! Oh, they MOSTLY work, but throw a heap of random data at them, and start adding and removing nodes at random, and after about your millionth modification of the tree you'll suddenly discover that it is out of balance.
It was a couple of years ago now, but it took DM and I days to find a problem we had with removals. That was the last time I trust a university lecturer's algorithm. ;-) Surprisingly though, we did a search and found lots of implementations, and they were all subtly wrong. We had to work it out for ourselves.
Hmmmm, it's been a while now, but maybe I should write up a web page on how to remove nodes from AVL trees. OTOH, maybe everyone realised the error of their ways (or maybe they just downloaded Kowari) and now has it right, but I somehow doubt it.
Posted by Paula at Monday, May 17, 2004 0 comments
Friday, May 14, 2004
A couple of important internal meetings today, so not a lot of coding happened.
I spent time debugging some grammar code, and then I went hunting through the SableCC generated code in order to find the entry point to the query layer. It seemed a little complex in there, though I suppose it made sense to the guy who designed it.
Once the command has been parsed into a grammar tree, an execution method is called on the tree. This in turn keeps calling methods on the individual parts of the tree, but I didn't work out how that gets into AbstractDatabaseSession before I was called into a meeting, and then my coding was all over for the. I might try and get some of this going over the weekend, but I won't panic if I don't get time.
My first priority over this weekend is a set of calculations for a cousin who is building rowing machines (well, he's Dad's cousin, making him my first cousin, once removed, right?). He's upgrading his range of exercise equipment, which means that the computers have to be modified. Ultimately he'd like to put a USB interface on the rower, but for the moment he just needs a computer that can handle all of the dynamics of the new design. So I'm doing the calculations for him, and then possibily doing the USB design in the near future. He needs figures for Monday so I'll probably be busy with them for a few hours on Saturday at least.
Posted by Paula at Friday, May 14, 2004 0 comments
Thursday, May 13, 2004
Addendum
I noticed someone referring to the MySQL indexes I described last Monday. I thought I'd better make an extra comment on this.
These indexes allow triples (or quads in this case) to be easily found given any combination of subject, predicate, object and model. However, the indexes don't come for free! For a start, having this number of indexes will seriously slow down loading data. Administrators of large databases normally drop indexes to do insertions, and then add them again afterwards for just this reason. Unfortunately, adding indexes like this to a large amount of data will take a long time as well (just not as long as adding to indexed tables).
Also important is that these indexes all take up space themselves. Significant space. Disk is cheap, so this may not be an issue, but it is definately something to keep in mind.
Kowari approaches things a little differently. It stores data in 6 different orders, corresponding to:
- Subject, Predicate, Object, Model
- Predicate, Object, Subject, Model
- Object, Subject, Predicate, Model
- Model, Subject, Predicate, Object
- Model, Predicate, Object, Subject
- Model, Object, Subject, Predicate
So Kowari is able to re-use its indexes for different types of queries while MySQL has to needs new ones for each different type. For instance, the Kowari index of (Subject, Predicate, Object, Model) fulfills the requirements of the following MySQL indexes:
- Subject
- Subject,Predicate
- Subject,Predicate,Object
- Subject,Predicate,Object,Model
When we were back at using just 3 elements, there were 6 possible indexes available. To search for any combination of subject, predicate and object could be done with either of 2 mutually exclusive sets of 3 indexes. We chose:
- Subject, Predicate, Object
- Predicate, Object, Subject
- Object, Subject, Predicate
When we moved to a fourth element (the model), we went through all the combinations available for the indexes (all 24 of them: 4! = 24), and it turned out that there are several sets of indexes which can be used to search for any combination of elements. This time the sets each had 6 indexes in them. We chose the set that most closely resembled our current indexes (notice that they match the 3-element indexes, with the model on the end, and again with the model on the front). Not only did that set of indexes make more intuitive sense for us (some of the other sets had the elements in a haphazard order) but it allowed us to reuse a lot of code without modification.
The cost of all these indexes is drive space, although it is much less than a relational database would use. It also means that data must be inserted into 6 different indexes, and it must be ordered as it goes in. This is the main reason why bulk insertions are limited in speed, though recently DM has done a lot to improve this. (One possible improvement would be to load the data unordered, and then order it for each of the indexes in the background, much as I mentioned in previous blogs for the hashtable based indexes. This is the equivalent of an SQL database dropping its indexes for an insert and adding the indexes afterwards). The purpose of the hashtable based indexes that I've discussed in the last few days has been to try and speed these load times. It also would speed up queries to O(1), as they are currently O(ln(n)) due to the binary searching of the index.
DM has noted on several occasions that usage patterns mean we only tend to use 4 of out 6 indexes. We could chop 30% off our storage requirements and loading speed if we were to remove them, but if we ever need them again, then searching would be painfully slow. If we write a new open source version of a triplestore, and Tucana choose to adopt it for Kowari/TKS, then this would not be an issue anyway.
Posted by Paula at Thursday, May 13, 2004 0 comments
iTQL
It's amazing what 6 hours sleep will do. And to think, 4 months ago I thought I needed nearly 8 to get a good night's sleep. :-)
I refactored my code from yesterday. I replaced the triples with constraints, as I said I would. The main difference in structure is that a constraint has a model and possibly a NOT statement. Model quantifiers are not legal for these constraints, as we only want to iterate through the transitive predicates within the models in the from clause, and further modifiers would not be of much use.
Since the grammar is looking for a constraint with an optional model, and an optional NOT statement, I had to include code to test for the mode or NOT modifier and throw a QueryException if either are found. Similarly, I was able to test for appropriate placement of variables. The TransitiveConstraint object from yesterday then had to change quite a bit to handle the encapulated constraints. I was able to wrap the TransitiveConstraint object around the constraint for the transitive predicate in order to meet most of the ConstraintExpression interface. This refactoring all involved changing much more code than I expected, but it's done now.
Now that I have these constraints available I can use them directly when they get down to the query layer. I can allow the query to go ahead as usual with just the transitive predicate constraint, and it will find each instance for me. I can then use the anchor (if it exists) to start the run through these results, and find my starting position. Otherwise I can just iterate over the whole result set to look for the transitive statements I'm after.
I'm disappointed that it took so long to get out of the grammar code, but it should all go smoothly from here.
Second Design Revisited
Now that I'm more awake than yesterday I can recall the details of this design much better. :-)
To recap: There are 7 index files. Each file is a hashmap which takes one of the 7 possible combinations of Subject, Predicate and Object as its keys. Each value from the hashmap is an index into a data file.
The 3 indexes based on the "single element" hash functions (Subject, Predicate or Object hashes) point to an entry in their data files. These entries contains 3 items: a next pointer; a previous pointer; an index to the next data file. These entries form a circular doubly-linked list that covers every possible value for their key (so the "subject" data file holds a linked list with elements for each possible subject). A small "Root" file with 3 indexes in it holds the heads of these linked lists. These heads are needed for finding a starting point for each of the linked lists.
The index into the next data file is for the "pair-index" data file which is based on the pair of elements that starts with the element used in the first index. Going with the example of the "subject" index, its data file holds indexes into the "subject-predicate" data file. The entries found in this file also form a circular linked list, only this time there are several lists in the file. Each list represents subject-predicate pairs with a fixed subject. So if there are n subjects in a graph, there will be n linked lists in the subject-predicate data file. Similarly, entries in the predicate data file point to lists in the predicate-object data file, and entries in the object data file point to lists in the object-subject data file. Lists in the predicate-object file are for fixed predicates, and lists in the object-subject file are for fixed objects.
Sticking to the example of the subject-predicate pair, if only the subject is known then the hash is done on the subject to find an entry in the "subject" data file. Then the entry from the "subject" data file will find the list of subject-predicate pairs in the "subject-predicate" data file. Alternatively, if both the subject and predicate are known, then a hash can be done on the 2 of them together, and the subject-predicate index hashmap can be looked up. This will find the subject-predicate entry immediately in the subject-predicate data file (the fact that this entry is in a linked list is of no interest if the entry was found using both data elements, as there is no need to find pairs with a different predicate).
The entries in the index-pair data files point to entries in the "Triple" data file. The entries in this file hold more information than in the previous data files. For a start, they hold the entire triple. They also hold "next" and "prev" pointers like the other data files, but in this case they hold 6 such pointers - a pair for the next/prev subject, the next/prev object, and a pair for the next/prev predicate. So each "triple" entry participates in three linked lists, with each list varying by only one element (lists may be only 1 element long).
The data files which represent the single element and dual element data don't actually carry the value of the elements in the node. This is because the value can be looked up in the triple data file. So if a subject were looked up and it led to a list of subject-predicates, then for any of these entries it is possible to look through to the triple file and see what the subject-predicate of that triple is (the object could be anything, depending on where in the linked list the subject-predicate index pointed to). This does lead to an extra disk read in order to know what data we have, but in almost every case we need to get to a triple before we can start using the data anyway.
Operations are performed as follows:
Searching
- Take the known nodes and hash.
- Look up the appriate index for these nodes, and go to the correct place in the associated data file.
- Return an iterator which knows about the searched-for nodes, and has the index into the correct data file.
Iterating for a single known node means indirecting through the first data file into the list in the second data file. For each element of this list, indirect through to the list in the triple data file. Then iterate through this list via the appropriate "next" pointer.
Iterating for a pair of known nodes is the same, only now the iterator starts with an index to a list in the second data file, and indirects through these into the lists in the triple file.
Iterators for all three nodes known are trivial, as they can find the triple directly (with the hashmap index for the triple file), and don't have to iterate at all.
Inserting
- Find a free entry in the triple file (using a free list - described for the first design), and put the triple there. Remember where it went.
- Hash the full triple, and put the triple location in the data file into the correct position in the full triple hash table.
- For each of the paired-indexes:
- Hash the pair from the triple to test if the entry already exists in the appropriate index file. If so, then look up the location in the paired-index data file for the triple list based on that pair. Modify next/prev pointers of the found triple (and the one before it) and the newly stored triple to insert the new triple in the list. The insert is now over for that index-pair, and the single index work (below) can be skipped.
- The pair was not in the hashtable at this point, so find a free entry in the paired-index data file and point it to the element in the triple file. Point the triple's appropriate next and prev pointers back at the triple, so it forms a list of length 1.
For each of the single-indexes: - Hash the single node value from the triple to test if the entry already exists in the appropriate index file. If so, then look up the location in the single-index data file for the list of pairs based on that node. Modify next/prev pointers of the pair found at the head of this list (and the one before it) and the newly stored pair entry to insert this new entry in the list. The insert is now over for that single node.
- The single node was not in the hashtable at this point, so find a free entry in the single-index data file and point it to the element in the index-pair data file. Point this element's appropriate next and prev pointers back at itself, so it forms a list of length 1.
- Go to the root file and find the head of the list in the data file. Adjust next/prev pointers to insert the new entry in the single-index data file.
Deleting
- Find the triple with the triple index and remove it from the hashmap
- Remove the triple in the triple data file from its three linked lists (by modifying next/prev pointers for it and for the next/prev triples). Add the triple's index to the free list for the file.
- Look up all 3 of the pair-indexes for this triple. If any of them pointed to the removed triple entry then update to the next triple in the list.
- If a triple had no next triple for any of its indexes, then perform the following steps on those indexes, otherwise the deletion is over.
- Go back to the pair-index hashmap and remove the entry. Go to the data file entry and remove it from its linked list. Add the entry to the free list for the file.
- Look up the single-index hashmap for this pair. If it pointed to the removed pair-index entry then update to the next pair-index entry in the list.
- If the pair-index data entry had no next entry then perform the following steps on the associated single-index hashmap and data file, otherwise the deletion for this index is over.
- Go back to the single-index hashmap and remove the entry. Go to the data file entry and remove it from its linked list. Add the entry to the free list for the file.
- Look up the Root for this node type. If it pointed to the single node that was just deleted then point to to the next one instead. If there is no next one, then set the root to -1.
Note that most of the time, the deletion code above will exit after the fourth step, making most deletions very fast.
Discussion on Second Design
This design is more circirtuous that the first design, but it saves the space of two pointers per triple when back pointers are used in the linked lists.
Using less space with back pointers is specifically supposed to provide faster deletions, which it does. Most deletions require 4 reads, and 3 writes (ignoring the free list operation which is another read and write each). However, as I said yesterday, faster deletions are not really essential, except when deleting models. So maybe the complexities of this design don't make the minor space saving worth it.
Lookups are trivial, with only one disk read, but iterators are more complex than the first design.
Insertions are not too bad, with a full database taking 7 reads and 4 writes (ignoring the free list, which should be a single read and write, though the read could be cached). A database without much data takes up to another 24 writes and 12 reads (that was a quick count, but I think I got them all, excluding the 6 read/writes on the free lists). The figures can't be accurate due to hashing collisions, available data in the free list, and the order that data is inserted and removed. The worst case figures can also be ignored, as the database would quickly reduce the number of IO operations towards the minimum as more data was inserted.
Like the first design, no ordering is used here, but can be added into the linked lists in an identical way to the first design.
Deletions in the first design are quite slow as they need to traverse linked lists, unless we choose to nearly double the size of the data file with back pointers on the lists. However, we normally only see deletions on entire models, and if we can do that by simply dropping the directory we might be fine.
Insertions and iterating over datasets are easier in the first design, and I think they would be faster - particularly the insertions.
Overall I think I prefer the first design. We'll see which direction DM wants to go in. I think it will depend on what features we want to have fast, and what we're prepared to compromise on.
Dates
It's actaully 12:45am, but I'm posting the time as 11:59pm so I can remember that this is all for the 13th, and not the 14th. I did the same for yesterday.
Posted by Paula at Thursday, May 13, 2004 0 comments
Wednesday, May 12, 2004
Luc
I love the little boy, but last night I got about 2 hours of sleep. I'll probably come back to this tomorrow and wonder what state of mind I was in when I wrote it. :-)
I know there will be spelling and grammar problems here tonight. I'll come back and fix them later.
iTQL Transitive Queries
Transitive constraints didn't quite go as planned today. I've successfully extracted the triples used in a transitive constraint, but where I thought I could simply extract the data I need and pass it down to the query layer, I ran into some problems.
The grammar has defined that the contents of a trans(...) element is a triple, optionally followed by another triple. Note that this specifically avoids the Constraint defined in the grammar, because these allow an optional model to be specified, which we don't want here. So all I need to do is take the elements found in the triple, and store them in a class that meets the ConstraintExpression interface so it can be passed down to the query layer.
I've created the appropriate Constraint class called TransitiveConstraint, based on Constraint (both extend ConstraintExpression). Constraint uses 4 ConstraintElements (three parts of a triple and a model) in the constructor. TransitiveConstraint needs the compulsory triple which describes the transitive predicate, and the optional triple to describe the anchor. So the plan was to use the 3 parts of a triple in the same manner that Constraint uses its four parts, and to provide the anchor as a complete Constraint object.
The SableCC classes provide triples in an implementation class called ATriple. This contains all the information needed for the TransitiveConstraint. Unofortunately, neither the triple nor its elements can be passed down to the query layers as this layer is not supposed to know anything about the SableCC grammar classes, so the elements of the ATriple class have to be converted to a usable form. This is done with an existing function called toNode(). This method converts from a PTripleElement to a org.jrdf.graph.Node, which is exactly what is required. After conversion, all the nodes can be passed to TransitiveConstraint and we can get on with the query code.
This is where I came unstuck. When the toNode method encounters a variable element it converts it to a BlankNode. A temporary HashMap is used to map variable names to BlankNodes so they will be re-used when the same name is encountered, but this mapping is lost as soon as the complete conversion is done, and there is no way to keep it around.
I don't want a BlankNode when I convert a variable to a org.jrdf.graph.Node. Even if I try to use one I lose the name, which makes it useless for me. Normal Constraint objects must get around this somehow, but I couldn't see how. After some time I asked SR and AN what they knew about it. SR didn't know, and after spending half an hour with him AN suddenly realised that I was looking at a problem that he is currently working on himself. He assured me that it won't be long before he will have some solution for keeping all information about variables around.
AN commented that he found it strange that both ML and myself would run into exactly the same issues within a day of each other. This is because ML and myself are doing work that is completely unrelated to each other. To date I haven't had occasion to discuss a single thing with him. That might be about to change!
I decided to move my efforts into the query layer until AN has finished what he's doing. If I finish there first then I'll change the iTQL grammar to accept a full constraint inside trans(...), and I'll use the code in ItqlInterpreter to enforce the extra conditions. After considering this for a while I realised that this will have the advantage of letting me use the normal constraint code when I get down into the query layer, so it's starting to seem like a good idea. I can't easily do this while using a triple instead of a constraint, as the constraint objects are specifically written to be built from constraint grammar objects, and not triple grammar objects. The one thing that I don't like about it is that the grammar file won't be a true representation of the real grammar.
Maybe it will look clearer in the morning after a decent night's sleep.
Triplestore Disk Data Structure
DM and I discussed another option for storing triples on disk. It's worthwhile writing our options down so I have something to refer back to. I mentioned the first pattern last Saturday (May 08, 2004), but I'll try and flesh it out a little.
Both patterns are based on hash tables. We've done disk based hash tables before, so we already have code for that. The hashing functions may take a little work, but they shouldn't be too hard. We'll try and come up with something reasonable, and then we can keep metrics to see how many collisions we're getting, so it should be possible to test a variety of functions to get the best performance.
First Design
The first design uses 7 hashmaps as indexes. These maps are completely disk based, and when small enough would definately be candidates for memory mapping. When they get too full they are increased in size, and rehashed. This is where C has the advantage over Java, as an existing memory map of a file can be unmapped and mapped again to the larger file. Java doesn't allow this to be done explicitly. The best we have been able to come up with is to close the file and set the reference to the MappedByteBuffer to null, and then loop on the garbage collector until it picks up the buffer and unmaps it. We can tell when this has happened if we keep a WeakReference to the buffer and test it after each garbage collection. This all works, but it's painful, and has major performance implications, particularly under Windows.
The hashmaps take hashes of the following parts of triples as keys:
- (Object)
- (Object,Subject)
- (Predicate)
- (Predicate,Object)
- (Subject)
- (Subject,Predicate)
- (Subject,Predicate,Object)
These indexes let us look up triples based on any combination of Subject/Predicate/Object that we may have. All we need is a decent hashing function that takes the appropriate parts of the triple.
Each bucket in the hashmaps stores the complete key (to detect collisions) and an index into a large data file.
The first design uses just one large data file. Files can now be addressed with 64 bit addresses now in all important operating systems (Linux, OS X, Windows, Solaris), so this file can be as large as we like. Indeed, under Linux we could even be accessing the raw device, though I doubt we'd gain much from doing that.
The elements of the data file are all identical. They hold a complete triple, and 6 indexes which point to other elements in the file. These indexes allow each triple to be a part of 6 distinct linked lists. The first linked list contains all the triples which match on Object. The second list matches on Object and Predicate. This goes on as per all of the indexes above, with the exception of (Subject,Predicate,Object), since there is only ever one triple which contains that set of values.
A separate file is kept to list all unused elements in the data file, with the exception of the block of consecutive unused elements at the end of the file. This file is called the "Free list". A single index is also stored indicating the start of the unused space at the end of the data file.
This design leads to the following operations:
Searching
- Take the known nodes, and hash them with the function appropriate for that number of nodes.
- Go to the appropriate hashmap, and look up the index into the data file.
- Return an iterator which has that index.
Iterating just means going to the data file to return the triple found at that index, and then setting the index to the linked list pointer appropriate for the data being traversed. This continues until the end of the list is reached.
Inserting
Starting with the full (Subject,Predicate,Object) index:
- Take the nodes and hash them.
- Go to the (Subject,Predicate,Object) hash table and insert an entry if it is not already there. If it was there then there is no need to continue as the insertion is not valid.
- Allocate a new element in the data file. This is done by checking if the free list has any elements listed. If not, then use the pointer to the beginning of the space at the end of the file, and increment that pointer to the next unused element.
- Write the triple to the data file in the allocated element.
- Remember this index for the subsequent indexes.
For the subsequent indexes the procedure is similar:
- Take the appropriate nodes and hash them.
- Go to the appropriate hash table and see if there is already an entry for those nodes. Inserting the entry if it is not already there, otherwise remember the existing index into the block file and then set it to the index to the remembered element that was allocated above.
- Go to the remembered data file element. If an entry was inserted in the last step, then there is no linked list yet, so set the appropriate pointer to -1 (or to the current element for a circularly linked list). If there was a pre-existing entry then set the appropriate pointer to this entry so that this forms the new head of the list.
Strictly speaking, the above steps are slightly out of order. The data element that forms the new head of a linked list should point to the old head before the hashfile is changed to point to the new element. This is done to maintain file integrity, though if this is important then flushes are needed at the correct time to prevent out-of-order writing to disk.
Deleting
For each index:
- Take the appropriate nodes and hash them.
- Go to the correct hashmap index and find the index to the data file.
- Iterate along the linked list in the data file until the correct triple is found.
- Change the pointers appropriate to the index around this element in order to remove it. If the triple was the first element in the list, then change the hashmap to point to the new head.
- If the list had only the one triple in it, then remove the entry from the hashmap.
- The first time through, remember the index to the triple being removed.
Once the triple is removed from all the indexes and linked lists, then add its index to the free list.
Discussion on First Design
Insertions are likely to be very fast with this scheme. Very little reading and writing occurs for each index, and the block file requires a single write with no reading (or two writes and a read if the linked lists are to be made circular).
Finds are very fast with only 2 disk reads. I have no problems with any overhead involved with iterating, as the calling code is often more complex. If it became an issue then a background thread could even be used to read ahead on the list.
As you can see, deleting takes a little effort, as it requires traversal of all 6 linked lists. This could be prevented completely if the lists were changed to be doubly linked, but then the size of the data file would jump from 72 bytes per element to 120 bytes (all numbers are 64 bit. This is needed to make the node IDs scalable, and to allow the indexes to refer to files of any size). This is a significant increase in size and at this stage we don't believe we need it.
Slow deletions can be a problem in systems where entire groups of statements are to be deleted. For instance, in Kowari/TKS we usually see this as the dropping of an entire model. There seems to be little call to drop groups of statements by other criteria.
Fortunately this isn't a problem here as we plan on using the file system to our advantage letting us keep different models in different directories. That means that dropping a whole model just means that we can drop the directory.
To digress far a moment, in the past I've noticed a reluctance in proprietry software to use the filesystem's structure to store data for a program. In the rare case where multiple directories are used, then they often have obscure names, and the program stores information internally about what goes into which directory. For instance, iTunes (and the iPod) splits songs up into a collection of obscurely named directories. This allows it to find files relatively quickly by using a hash. People have worked the algorithm out, but it was non-trivial.
Programmers don't want users playing around in the program's internal data structures, and for good reason. Making things obscure discourages this. Otherwise, if I had a collection of data called "Foo" and I wanted to change its name to "Bar", then if I see a directory called "Foo" I might think I could change it, unaware that the program needs to change some internal data as well. I'd deserve what I get, but people still do this, and programmers seem to think they should be preventing it.
Maybe programmers of proprietry software also think they are protecting some important secrets. I don't know. But this code is destined to be open sourced, so secrets are no issue. It's also likely to only be run by those who know what they're doing, so I have no problems in using a directory name to tell the program what model is held in it. Besides, it means that anyone changing a directory name or deleting a directory will see what they expected to see. ;-)
Back to the design...
The main problem with this design is that the triples are not ordered. In the normal run of things this is irrelevant. The JRDF interface certainly does not need it. However, any efficient joins on the returned data will require sorting. One of the strengths of Kowari/TKS that makes it so scalable is that all of its data is sorted. Unfortunately this leads to loading RDF being slower. This design will load data faster because it doesn't need sorting at all, but scalable joins are still going to need sorted data.
Sorting can be done in 3 different ways:
- At load time by ordering the linked lists. This will slow loading down significantly, but it should still be reasonably fast in comparison to Kowari/TKS.
- At query time. This makes a certain amount of sense, as joined data often needs to be re-sorted anyway. If sorting is ever done, then it should be cached to reduce duplication of effort.
- In a background thread. This isn't as fast as doing it at load time, but it lets loading be a lot faster. Stability becomes more important and harder if this data is to be modified or queried while it is being sorted.
The second and third options are the most likely at this point. I'd like to see what the speed difference is when compared against the first option, and I may end up doing tests to compare these.
Second Design
It's getting late, so I'll be terse here...
This is a similar design to the first, only now there are 7 data files. Each hashmap index refers only to its own data file. With the exception of the data file belonging to the (Subject,Predicate,Object) index, each data file contains only the first data item from the nodes used to form the index. ie. The data file for the (Subject,Predicate) index only has subject nodes in its elements. The data file for the (Object,Subject) index only has object nodes in its elements.
The data file elements form a single linked list with other elements in the same index. They also hold an index into the next data file. This is done in the following pattern:
Subject -> Subject,Predicate -> Full Triples
Object -> Object,Subject -> Full Triples
Predicate -> Predicate,Object -> Full Triples
Searching for a Subject gives a pointer to the first data file which tells us all subsequent subjects (via the linked list) and the start of the linked list in the Subject,Predicate index which matches our query. Each element in the Subject,Predicate index points to the next Predicate that we need (we already know the Subject), and also points to the head of a list in the FullTriples data file. The list in the FullTriples file will entirely match the Subject,Predicate which pointed to it.
All indexes end up pointing to the FullTriples file. The elements in this file contains the full triple, and 3 pointers for finding the triple with the next Subject, the next Predicate, or the next Object. The previous indexes used to get us here provide the rest of the needed data, so this is fine. It also involves less overhead for making the list doubly linked, allowing for faster deletions.
This design would have a slower loading speed, but not dramatically. Reading would be slightly faster. The real speed benefits come from deletions.
I'm getting vague on the tradeoffs of this design, though I had them in mind 8 hours ago. I'll write more on this later when I've had some sleep.
String Pools
You'll note that there has been no mention of string pools. Well that is not essential yet, but it will certainly be needed before this system can look at RDF. To start with we may borrow the StringPool structure from Kowari, but I'm keen to give a trie structure a go as a fully indexed string pool would do wonders for us. I also want practice with trie structures, as I believe it would be possible to make them phasable, make them candidates for insertion into Kowari.
Posted by Paula at Wednesday, May 12, 2004 0 comments
Tuesday, May 11, 2004
Blogspot
It's getting on towards midnight, and I sat down a little while ago intending to code. Instead I'm surfing instead. Darn. At least I've now discovered the problem that I and others have occasionally seen with Blogspot. From the "Known Issues" page:
Blog*Spot blogs occasionally render a massive amount of gibberish in Mozilla (due to a complex bug involving gzip). The current workaround is to force-reload the page.
Says it all really.
JRDF Mem
Well I should really write some unit tests for this JRDF in memory library. I finished the coding last night, but went straight to bed before testing, and we all know that testing/debugging is the time consuming part. I'd better get it done for the simple reason that I want to start implementing this other project with DM. At least JRDF is a pretty simple interface, so it won't be hard.
One hassle I have with what I've done is that I've kept all the classes apart. However, the iterators are just screaming out to be made inner classes of the GraphImpl class. I've resisted it so I can write JUnit tests for them. I also like to keep classes as small as I can to aid readability and maintainability. I've worked with code that has made too much use of inner classes in the past, and the resulting confusion was not pretty.
It's not really a big deal. After all, an inner class is just an external class with different visibility and a built-in reference to the enclosing class. I've just had to give package scope to things that might have been private, and I have to pass a GraphImpl reference in the constructors of the iterators.
If I get it working soon I'll give it to AN and see if he wants to put it up on the JRDF site at Sourceforge.
Posted by Paula at Tuesday, May 11, 2004 0 comments
iTQL Grammar
After a slow start today the grammar is coming along nicely. I've finally worked out what SableCC is doing with the grammar file. I now have code that successfully sees transitive predicates and an optional anchor for chaining.
Each expression in the grammar file gets turned into a class. The ITQLInterpreter class has methods which takes a full expression and hands it off to the parser to convert it into a tree of objects, the types of which were defined by SableCC from the grammar. The code then passes this tree into methods which break it up and recursively call into methods which break it up further. These methods check the type of the object they were given (will normally only be two or three possibilities) with instanceof tests and then extracts sub-parts out of these objects with accessor methods that SableCC provided in the classes. Each sub-part is then sent off to another function which is designed specifically for its type, to be broken up in the same way. The methods at the bottom of this recursive stack build primitive ConstraintExpression objects and return them.
I'm in the process of building a new ConstraintExpression class to take the transitive data and return it up the call stack. So long as I meet that interface it should go down to the lower query layers with no problems. Once it's there I can start the code that builds the chain.
If I'd stuck to it with no other distractions I could have had it done by today, but I'm pretty sure I can get through it for tomorrow.
Distributed Queries
TA forwarded this message from the RDF interest group mailing list. We've been looking at putting scalable distributed queries into the system, and I think he's interested in making sure we are aware of what others have been doing.
TKS/Kowari had a distributed query system that worked, but sub-optimally. It went through a phase where it wasn't supported but I'm pretty sure it's working fine at the moment. However, it has never been scalable, and we need to do that.
The type of distributed query described in the email above is what we have normally referred to as "load balancing". We don't support this and it's not yet on our roadmap, but DM and I have been keen to do it for a few years, and it may be time to consider it again. In the meantime we might give it a go in the open source code which we're designing. It wouldn't be in the first release of the code, but we could design it in so we can do it afterwards.
The Kowari/TKS idea of distributed queries allows data from differing data sources to be correlated together. While this works now, it's done by asking models on each server for the relevant data, shipping it all back to the server that the client is connected to, and doing the joins there. This is sub-optimal, as it means moving far more data over the network than is neccessary.
I looked at this a couple of years ago, but other priorities came up at the time. It looks like I have to pick it up again.
A contrived example of the kind of query we want to perform might be to ask for the licence plates of all people who have travelled to the middle east recently, and who have had conversations with a known individual. The conversation data is stored by the NSA, the travel info is stored by the FBI, and the license plates are with the DMV. We can assume that the NSA and FBI are allowed to see each other's data (see? It's really contrived!), and the DMV's data. But the DMV can't see anything from the two agencies. Note that security like this is a TKS feature, and is not available in Kowari. (We have to sell something, right?) If anyone wanted to then I guess I could be added to Kowari, since it is an Open Source project, but I doubt that will happen.
My 2-year-old plan for this went something like:
1. When we get a query, break it down into its constituent join operations on constraints, and work out which operations apply to which models. This creates subqueries for us which can be expressed as valid iTQL.
2. Any queries to be performed locally can be executed locally. If this forms the full query then return the data to whomever asked for it. Otherwise this value is kept to provide the numbers for step 3, and cached for a future call back to this machine during step 5.
3. Send out constraints to all servers holding the models in the query, and request the count. Constraints have an optional model, which can restrict which servers it will get sent to. eg. we only need to send a constraint like <$car isOwnedBy $person> to a model on the DMV server. This can be done as a variation on iTQL sub-query where the select is changed to a "count".
4. Get back the count of all sub-queries from each server. Use this to make a decision of which server should do the joins. This is based on the following:
a. The server with the highest count is the nominally preferred server for joins.
b. If the winning server (with the highest count) has low bandwidth AND the
combined results for the join from all the other servers have a larger
count than that server's single result, then choose another server to do the
count. This rule needs to be weighted depending on the bandwidth, and maybe
processing power of the server.
c. If the winning server does not have permission to see another server's data,
then go to the next most eligible server. e.g. The DMV server may not see
data from the NSA servers, but the NSA and FBI are both allowed to see that
data.
Part (b) is based on the idea of keeping the traffic moving on or off a slow machine to a minimum.
5. If the current machine is the winning server then send out all subqueries that aren't for models on this server to the other servers, and join the results when they are returned. If another machine is the winning server then send it the full query. That server will start at step 1.
This plan means that being a "remote server" and simply answering local queries is essentially the same operation. Step 4 keeps bandwidth usage to a minimum.
Kowari has the potential to do all of this, sans the security work.
In case it isn't clear that distributed queries for Kowari/TKS don't do what the email was talking about, I should point out the comment about selecting the servers to perform the query. It should be clear from the plan above that knowledge about which stores we are using is essential, and the query doesn't make a lot of sense if we don't know. So having a process to choose which server to ask is not an available option.
If we are load balancing, then choosing the server to query makes sense. Hopefully this will eventually make it onto our roadmap.
Note that distributed queries (as I defined them above) and load balanced queries are not mutually exclusive. Indeed, we always envisaged that the two would work together.
Blogging
I started this blog as a way of keeping notes. I record what I'm doing at work, and I also record some other things I find of interest. It started as a .plan file, but blogging is de rigueur these days, so I transferred it exactly 2 weeks ago. I thought it might be handy for others at work (including AN who I'm working closely with at the moment).
It also crossed my mind that maybe there would be others interested in what was happening inside an open source project like Kowari. I didn't think that would be much of a concern, but I decided to refer to my colleagues by their initials, to protect their modesty if nothing else. So now you know why I refer to AN, DM, AM, an so on. :-) (AN want to know who RMI is)
As for the non-open source parts of my work (for instance, on TKS), then I figured it would just generate interest, and not be giving away "trade secrets". After all, even companies like Microsoft have internal developers blogging about their work on the OS, and MS claim that security on the OS source code base is highly important.
Now AN has had a hit tracker on his blog for some time, and in the past he has had interesting things to say about it, so I decided to give it a go too. Anyone visiting since yesterday should have seen the icon near the bottom of this page. It mostly accumulates statistics, but it does show the last 20 hits. Out of curiosity I looked at it after 24 hours, and I have to admit, for a 2 week old blog I was surprised that anyone had looked at it at all! So, ummm, hi everyone. :-)
Posted by Paula at Tuesday, May 11, 2004 0 comments