Friday, July 30, 2004

Compression
I still didn't get to testing the N3 import code today.

TJ has been concerned about the time taken to return large amounts of data over RMI. He has wanted to compress this data for some time, and so he started doing so today.

When the code first failed for him I came in to have a look, and discovered that he hadn't been looking at the right file. I offered a few points of advice, but then ended up staying for a bit longer. A bit longer went on until mid-afternoon when TJ had a conference call, at which point I continued on trying to do compression.

Using some standard examples, TJ was trying to compress all communication with an RMI object by making sure that the object was an extension of UnicastRemoteObject, and then he provided his own socket factories in the constructor. This worked well, so long as he didn't try to wrap the data streams with a compression object.

To start with, TJ had tried to use ZipInputStream and ZipOutputStream, but these are designed for file compression, and the decompression call immediately complained about the item number. This was resolved by moving to an appropriate compression utility such as java.util.zip.DeflaterOutputStream or java.util.zip.InflaterInputStream.

As soon as the correct compression streams were in place then it would all come to a halt. This was happening because the server was not sending back its response. I realised that this was because of buffering on the compression stream, and that it just needed to be flushed to make sure the data was sent through. However, RMI normally works across a socket which doesn't need to be flushed, so it never bothers to flush.

Externalizing
After thinking about it, I decided that there was really only one time that data really needed to be compressed, and that was in the return of AnswerPageImpl objects across the network. If these could be serialized to a compressed form then most of the network traffic would be reduced. So I moved onto implementing readObject and writeObject. TJ pointed out that profiling had already showed significant time was being spent in reflection during serialization, so I pulled back and went on to readExternal and writeExternal in the Externalizable interface instead. This is because of a lot of metadata about the class is avoided when using Externalization over Serialization.

Now that I was writing to the compression stream myself, I was able to flush the data stream, but again the server would just sit there and not respond to the client. I talked with DM about this, and tried a few other things, before it occurred to me to close the stream instead of flushing it. This worked correctly. So now I know that flushing to a DeflatorOutput does nothing. It won't be prepared to send data until it is closed.

Compression Performance
Once compression on the returned data was working correctly TJ and I could start timing the responses. TJ loaded WordNet into the database, and performed a query which returned all the data to the client. The queries went over a 100Mb network, through 2 switches.

With compression we were able to return the data in about 1 minute 40 seconds (1:40), with a compression ratio of about 80%.

Without compression we retrieved the same data in 1:13. Something was obviously wrong here. The compression/decompression was being run on 2.8GHz P4 processors, and they should not have even blinked at this kind of operation.

This led to me checking all sorts of things, ranging from compression ratios and times of execution, through to buffer sizes and compression levels.

Double Buffering
I had initially created an object writing stream with the following code:

  ByteArrayOutputStream data =  new ByteArrayOutputStream();

  ObjectOutputStream objectData = new ObjectOutputStream(new DeflaterOutputStream(data));
  // write to objectData
I could then use data.toByteArray() to get the resulting bytes. However, this wouldn't tell me how much compression we were achieving. So I changed it to look like this:
  ByteArrayOutputStream data =  new ByteArrayOutputStream();

  ObjectOutputStream objectData = new ObjectOutputStream(data);
  ... // write to objectData
  ByteArrayOutputStream zData = new ByteArrayOutputStream();
  DeflaterOutputStream deflater = new DeflaterOutputStream(zData);
  deflater.write(outputData);
  deflater.close();
The result of this was that data.toByteArray() could provide the raw, uncompressed data, while zData.toByteArray() provided the compressed data.

While the presence of both sets of bytes allowed me to obtain more information about the compression rates we'd achieved, I assumed that there was some cost, since an extra byte buffer was being unnecessarily created. I assumed incorrectly. When I compared the times of both methods, the code with the two ByteArrayOutputStream objects was consistently 3 seconds faster (over the ~1:40 time period). Having both objects available is handy, especially for metrics, so I was more than happy to get this result.

Why did the use of a second buffer speed things up? I can only guess that writing the entire data stream in one hit was faster than compressing it by parts, as the first method does.

Bad Compression
Out of frustration with the longer times taken to return compressed data, I started playing with the compression level and buffer size.

Looking at the Deflater code, I discovered that the default buffer size is 512. I was not able get any real performance change out of varying this size.

The other thing to try was the compression level. Neither the documentation nor the provided source shows what the default level is. So I did some comparisons with compression set to 1 (minimal) and 0 (no compression). The results here surprised me.

Minimal compression still gave me a compression ratio of about 80%. This could mean that the default level is 1, or close to it.

Even more surprising, a compression level of 0 took almost exactly the same length of time to execute as a compression level of 1. ie. ~1:40. This was surprising, as 0 compression should have taken no time, and yet the code was still being slowed down by over 20 seconds compared to no compression at all. So there is significant overhead in calling the compression library, but actually doing compression while there was not taking an excessively long time.

So where is the overhead in compression coming from?

The Deflation classes use JNI calls to do the actual compression (even when the compression level is 0). So the problem has to be either in crossing the JNI boundary, or in the compression algorithm itself.

Java relies pretty heavily on JNI, and it is normally quite efficient. I've certainly never seen it behaving slowly in the past. That leads me to think that the compression algorithm is at fault, possibly in scanning the data to be compressed (which would happen even with a compression level of 0). DM pointed out that Sun ignore the zlib libraries on *nix platforms and that they implement these libraries for themselves. Given the performance I've seen here I'm guessing that they have done a terrible job of it.

Compromise
While it offers nothing yet, compression may still be of benefit across a slower network. I've introduced a system property for a compression level, and I use this to select in compression when the level is above 0. If the level is undefined, or defined at 0 then the uncompressed byte array is used as the serialized data, compression is not performed at all, and there is no performance penalty.

Fortunately, having the uncompressed byte array hanging around was speeding things up, so there was no performance penalty in using it.

For the moment, the system property is necessarily defined at the server side. This is because it is the server which sends the pages back over the wire, and it is at this point that the compression is performed. If compression is to be used seriously in the future then the client should be the one to set the level. This means that client requests for pages should include the desired compression level as a parameter. This is easy, but unnecessary for the moment.

Future Speedups
There are two ways I can speed up returning pages like this.

The first is to use a decent compression algorithm. The ones that come with Java 1.4 are obviously too inefficient to be of use. Most implementations are unlikely to be blindingly fast unless they are done in native code, in which case there is the major problem of portability.

On the other hand, I may be too quick to assume that a pure Java compression would not be as fast. This should be the subject for some future study.

The second option is to improve serialization of the AnswerPageImpl class. Profiling has shown that at least a third of the time taken to return a query is spent performing reflection to serialize this code. Since we know exactly the kind of data that can be found in a page, it would be much more efficient to write only what we need to the data stream. This would save on both the amount of data sent, and the reflection needed to find this data. It is this approach which I'll be taking in the near future.

Lucene Index Locks
After Wednesday night's blog I was able to find all the build problems and compile everything. The problems were due to using the right import statements, in the wrong files. Well, it was getting late. As I said at the time, I just needed a break in order to find the problem.

I didn't get to Thursday's blog since I didn't get to do much productive work. Before proceeding with testing the N3 loader I wanted to make sure that I had a stable system and hadn't broken anything. So of course, when the Lucene tests started failing I had to start wondering what I'd broken, and how.

My first thought was that I could not have possible broken the Lucene code, as the N3 loading code was in a completely unrelated area. However, I've been mistaken in the past when I thought that two pieces of code were completely unrelated, so I wasn't prepared to dismiss the possibility of the problem being in my code.

The test failure was caused by Lucene failing to obtain a lock on a file. Perhaps it was some file access I'd done which prevented the lock from being acquired.

Each time I ran the tests it took quite a long period to get through them all. I was able to track the Lucene tests down to the store-test target, so I tried to concentrate on running this test only. I quickly discovered that the Lucene tests would always pass when the tests were restricted to the set of tests in store-test, and they would always fail when run as the full set of tests for Kowari. So now I started to think that it was some test which ran before the Lucene tests which was causing the problem. Perhaps I'd influenced this other test in some way.

To confirm that it was definitely my code which was causing the problem, I got a fresh checkout of Kowari, and tried a full build with all the tests. All the tests passed, further indicating that it was my changes which had caused a problem, though I still couldn't work out how.

I asked advice from AN and TJ, but they couldn't help. TJ was able to tell me that these errors had been seen before, but was unable to say what they were indicative of.

Finally I started copying my files from my normal Kowari checkout to the clean checkout, one file at a time. I did this to see which file could be causing the problem. By the end of the day I had copied over every file, and the fresh checkout could still run all of the tests with no errors. I finally started comparisons between all the files in both checkouts, and there were no differences.

So at the end of the day I had two sets of identical files. One would build and pass its tests every time. The other would build but fail its Lucene test every time. This morning I even used find to run through all the files and do an MD5 comparison, but they came up the same.

All I could do in the end was to abandon the checkout with the failing test. I have no idea why two identical sets of files can behave differently. Time to stick to the working checkout and move on.

Wednesday, July 28, 2004

Nearly There With N3
I was feeling tired and belligerent last night as I wrote my blog. Consequently I wrote that I was going to use a regex to parse N3. This isn't as bad to implement as it sounds, but I still had no real intention of doing it. Going about it in this way would mean re-inventing the wheel. Worse yet, it means that every little corner case of N3 that I hadn't considered was going to be plaguing me for months.

I wrote what I did last night in the hope that someone would say, "You moron! Why aren't you using the XXX parser?" I'd done a quick look for parsers, but the ones I'd found were too heavily tied to their applications and using them was going to be more work than they were worth, so I was hoping to be directed at something more portable. The response I received was from AN (and it was more pleasant than calling me a "moron") when he suggested I use the Jena N3 parser.

Fortunately, the Jena parser is event based, making it easy to hook into it. It only took me a half an hour to knock up some quick code that would parse and print an example N3 file (less than that to use the built in N3EventPrinter class, but I needed to try it for myself). So I ended up thinking that I should be able to make up for the last few days when I feel I haven't really done all that much. But it's now after 10:30pm, and I've only just now finished the main structure of it.

There was a LOT of glue to put this parser together with Kowari. I was able to base a lot of it on the RDF-XML parser that Kowari already has, but there were still significant differences. All up, it took over 700 lines of code.

The biggest hassle was converting from Antlr AST nodes to Literal nodes or URI references. I didn't have all of the Antlr code available to me (I just had the jar, but I could possibly find the source if need be), and I couldn't find any documentation easily. The N3EventPrinter offers some tantalizing hints, but it leaves a lot open to the imagination. Also there is a plethora of Antlr types, and I'm pretty sure that most of them aren't applicable to N3, but I couldn't work out which ones were. There are a few obvious ones, particularly anonymous nodes, literals, and URIs, but beyond that I don't really know.

At this point I'm thinking that the best approach might be to just "suck it and see".

I'm still a way off running though, as I have a compilation complaining that it doesn't know about classes I've included, and put into the class path. Once I got to that point I thought it worth blogging where I am and then taking a break.

Tuesday, July 27, 2004

N3 Parsing
AN pointed me to the class he'd been talking about, which is RDFSyntaxLoader. He'd given me the impression that I'd be finding a class that did a significant portion of the work, but when I saw the name I realised that this was not the case. Instead, it demonstrated the use of an IntFile as a map from anonymous node IDs to internal node IDs, and a StringToLongMap as a map from blank node names to internal nodes. These are important, as they can be disk-backed, which means that the maps are not bound by memory so they can scale.

If I'm only having to load anonymous nodes from our own N3 files then I don't need the name mapping, though the IntFile map is still needed. If I'm loading more general N3 files then I'll need to map by name, but it will also depend on whether I can spot anonymous node formats.

I'm just using a regex to parse the first part of each line. It's easy enough to pick everything out from the < and > characters. Unescaping text is reasonably straight forward as well, particularly as many escaped characters are only legal for literals, meaning that they don't affect subjects and predicates. What I haven't done yet is the last part of the line, which can be either a resource or a literal. It shouldn't be too hard, given that the first and last characters tell you exactly what the type is (a resource or a literal).

Syntax and Semantics
I spent a little time responding to suggestions from AN about how we implement a NOT operator in iTQL. There's the semantic issue of negation vs. and inverse, and that is still being argued. SR pointed out that depending on the semantic we choose, then the word "not" may not even be appropriate.

There's also the syntactic issue of how to express the construct in iTQL. Should it be a unary operator, preceding constraints in brackets, or should it be a "magical" predicate like <tucana:is>? Of course, different syntactic choices imply different semantics, so this question is not completely divorced from the first. Until semantics are determined then syntax can't be decided either.

You'd think that this would mean the semantics should be settled first, and then syntax decided, but these things are never quite that straight forward. People often like to conceptualise a semantic in terms of some expression, ie. the syntax. So it can be very difficult to pin down exactly what semantic is to be used without expressing it in some kind of formal language. Instead of introducing something as formal as predicate logic, it makes more sense for most people to think of the semantics in terms of the iTQL which is likely to represent it. So the semantics and syntax end up getting decided in parallel.

AN has lots of suggestions for both, and I spent a little time giving him feedback on it.

Short Entry
I know it's not much, but that's it. I didn't get as much done as I'd have liked today as I had to spend some time with a physiotherapist treating an injury, and it took longer than I anticipated.

Monday, July 26, 2004

N3 Reading
Not a lot to talk about today.

The N3 export is now working, after taking longer than I expected. Part of the effort (and this includes Friday) involved changing some of the exporting code to handle multiple export modules. This was a little further reaching than I thought, since the original code was heavily tied to class implementations without any use of interfaces. Note to self: have a chat about it to RT.

I should mention that RT is still very new at coding, so he's still learning a lot. However, this was my first real opportunity to go through his code, and overall I'm impressed. Better than I was at his age.

After a little more discussion on the command to be used for exporting, AN and TJ confirmed that it should definitely be based on the backup. Whether or not this is the cleanest thing to do, it means that the iTQL language does not need to be changed significantly, which in turn restricts the modifications to the documentation, tests, and training material for TKS. I have to admit that this alone makes it the right way to go.

At the moment the type of export is entirely based on the filename extension. Files are assumed to be for N3 format if they end in ".n3" or ".nt". As a compromise, a future extension will add a modifier to the backup command which allows an override of the file format.

N3 Writing
I'm now partway through N3 importing. I'm expecting to hook in an N3 reader and not have to do a lot more, but I've yet to see if it will be that simple. For a start, the N3 reader was not where I expected to see it, so I'm still hunting for it. More importantly I don't know if it will import blank nodes correctly.

Blank nodes are currently being exported in a format of: <_node##>. Since there is no standard for this in N3 (or am I wrong there?) then I'm happy to leave it like this. I'm just going to have to make sure that they get loaded correctly. If I can't make an importing module read this correctly then it won't be difficult to load the N3 myself. That's one of the big advantages of the format.

Sunday, July 25, 2004

Directions
As I said on Thursday, there are lots of things to do next, and I was trying to work out hte best direction to take next. AN suggested that exporting and importing N3 would be a good idea, as other people need it as well. This suited me well, as it let me put the decision off for another day.

After creating an "export" command in the iTQL grammar, I learned that the "backup" command already writes either binary backups, or RDF. Since this function is already overloaded, it made sense to back out the "export" command, and just overload the "backup" command a little more. I didn't quite finish, but it will soon write N3 files if the output filename has a .n3 extention.

XQuery
The main reason I didn't get a trivial thing like N3 exporting done was a longish video conversation with SR. We discussed a number of things, including the appropriateness of XQuery for the DAWG.

It seems that XQuery can query RDF data with few technical problems. I had thought that this would be the answer to all potential problems, and that XQuery would therefor be a suitable strawman. However, SR made the important observation that while XQuery can find any required data easily, providing an API base on XQuery is problematic.

The easiest way to describe the problem is to go back to that old comparison with SQL. SQL is very restrictive, partly in the structure for querying, and particularly in the structure of the returned data. However, for programming an API it is this very structure that makes it so useful. Any returned data comes back with elements formatted in rows of typed columns. This allows an API to retrieve any element, typically with statements which iterate over the rows, and accesses the elements by column.

XQuery is much more open, both in methods of retrieving data, but also in the returning format. While the flexibility of the query structure is advantageous, the same characteristic is a problem for the returned data. XQuery has the ability to format returned data in any manner of format, from an RDF document through to an Excel spreadsheet. This felxibility is a liability for a potential API, as there is no consistent way to access the returned data. Unlike SQL, it is possible for a user to create a query that returns data in any possible format. Even if a return format is specified, then a user can always create a query that doesn't follow it, breaking any API that tries to wrap it.

While a full-featured and flexible query language is useful, one which cannot be effectively wrapped in an API is a liability.

I'm sure SR has more to say on this, and I'm looking forward to how the DAWG approaches his results.

Thursday, July 22, 2004

Comments
For anyone interested in the response I received from Jeff Pollock regarding his submissions to the DAWG, I added a reply to his comment. Comments can get lost down there in the fine print, so I thought I should make a significant reference to it here.

BTW, thanks for responding Jeff. It's always nice when someone pays attention. :-)

Duplicate Variables
AN had a look at the problem of repeating a variable name in a select clause, and spotted a simple way around it. I know it isn't ideal from the perspective of AM, but personally I prefer it as it makes sense to be able to make a query like this. Consequently I was able to change rules 5b and 7b to match the other rules.

In a similar vein, I discovered that count statements also have a problem in select clauses. Count statements are used for "grouping" on variables (as they do in SQL). However, if a count is used without any accompanying variables then a RuntimeException is thrown from the server. Of course, this is appropriately caught and dealt with, but the fact that an exception like this can be thrown is a bug. The only exceptions to be thrown during a query should be thrown from the iTQL interpreter, and no lower. The interpreter should not pass down any kind of construct that is illegal, and if it isn't illegal then the code that executes the query should not throw an exception (except under exceptional circumstances, like a server being unreachable). If anything lower down the stack than the iTQL interpreter throws an exception due to a grammatical construct then it indicates a major problem.

In the meantime I've logged the count problem on Sourceforge.

Tweaking
I tweaked the rules a bit today in order to make them faster. There was a little re-coding, and I re-ordered the tests in the .drl file. It seemed to help, but I'm still not completely happy with it.

Given the way that the system is currently built, the biggest problem is due to all of the rules objects having the same type. Drools needs to test everything in the working memory by label in order to make sure that it has the objects required for a rule. This is really because Drools was designed to work on data which is in its working memory, rather than rules objects which in turn work on the data. It's made me think that it might be a little to heavyweight a framework to provide what we need, but I'm sticking with it for now. After all, it is working.

As I mentioned yesterday, it may be possible to make Drools operate a little more efficiently if each rule is given its own type. However, I don't really like that idea, as it doesn't scale to new rules at all. I'm also thinking that Drools might simply find its variables via instanceof statements, meaning that a label comparison is almost as efficient anyway.

The other change I could make would be to have the rules objects operate on Kowari/TKS at a lower level than iTQL. However, this means that each rule would need to be coded in Java, and not really configurable. So long as efficient iTQL commands are available then I don't see that much would be gained by this approach. For the moment, not all of the rules have efficient iTQL at their disposal, so it's a tempting avenue to take, but I think it should be avoided as long as possible.

Shortcomings
Rule XI is desperately in need of prefix matching to be made available from the Kowari/TKS string pool. This would be relatively easy, given that strings are stored in lexical order, but no one has the time to implement it.

While on the string pool, another problem is a lack of types. This has a major implication for finding anonymous nodes. Since statements are being inserted into the inferrence graph in bulk, then the only way to remove anonymous nodes is to go through after the insert and remove them one at a time. This is the second worst possible solution (the worst would be to filter the inferred statements before inserting them one at a time). Types would let us select all statements containing a "resource" in the appropriate position, skipping the anonymous nodes.

The other immediate problem is tautalogies. That is, inferred statements which exist in the base data. It's a shame that inferred data has to go into a separate model, as redundant insertions into the base model would get silently and efficiently dropped.

Testing and Integration
With RDFS going I'll need to check the tests on the RDF site. Unfortunately I'll need to build a translation layer, as Kowari doesn't yet allow for N3 output (which these tests rely on).

I talked with AN about integrating the inferencing rules with the rest of the system, but he hasn't yet worked out where he'd like them inserted. I know that he doesn't want to make a decision that makes it hard to use the rules for backward chaining, but I think that the choice of Drools is never going to work for that anyway. Hopefully we can discuss that in the morning.

Extending the Rules
RDFS is only the first step on a long road of inferencing. Next will be OWL Lite. I have yet to determine what the rules will look like there, but I know that many of them won't be like the ones for RDFS.

Apparently Jena tried to go with simple rules for OWL, but had to incorporate a lot of Java code to do it instead. I think I'll go through the list of available inferences first, and see which ones map easily to iTQL-based rules. Then I'll see what is needed for the remaining ones (the obvious one that comes to mind is cardinality). These should individually be easy enough to implement, but a generic framework would certainly be better.

Non-Rules Solutions
The whole rules-based structure has not left me feeling impressed with their efficiency. It is possible to make each individual rule more efficient (such as we did the trans statement, but the overall structure has not received all that much attention. I asked Bob what he knew of alternatives, and he suggested something called tableaus. I know nothing of these yet, but I'm thinking I should check into them shortly.

Wednesday, July 21, 2004

Rules
I finally got all my RDFS rules running today. They all seem to work fine. I'm not sure if I should just move onto OWL lite rules, or if I should be integrating the code that I have. I'll discuss it with AN tomorrow.

I had a few problems getting the workarounds going today, but I managed to resolve them. The first was due to how Drools performed its tests.

Drools regularly checks if a rule should be run without actually running the rule. This was unexpected, as I thought that it was supposed to efficiently avoid unnecessarily running tests, however it only seems to avoid running the consequent of the rules. The logged output demonstrated that the tests were performed numerous times with positive results, but without resulting in the consequent being run. This is unfortunate, as the test is a query on the database. Queries are fast, but at the iTQL level there is a lot of overhead involved.

I suspect that the problem was due to the test being tried with the wrong rules, and so the name matching on one rule failed even though all the other tests passed. This is because all rules are of the same type, and they are identified by testing their names. Drools may work faster if every rule is a different type, but that would make for a lot of redundant Java code. Perhaps unique interfaces would be enough to differentiate them for this. I should also move the expensive tests to the bottom of the list, as I believe that the tests are executed in order.

To see when a consequent needed to run, I had been checking to see if the query each rule is based on had changed its count between one test and the next. This approach failed with the tests being called so many times. I addressed this by holding onto the old count until the next insertion was performed for that rule. This seemed to work well.

The other problem was caused by rule XI. There is no way to query for all resources which begin with "rdf:", so I had to get everything and filter my way through the results. This is slow, but since iTQL will be modified to provide appropriate queries for this shortly then it won't stay like this for long, and in the meantime it works. The real problem is that there is no count of statements which match my criteria, and I can't go executing a slow count of these prefixes all the time as it is already quite inefficient.

Since the number of statements which affect rule IX is a subset of the entire model, then if there is no change in the number of statements in a model then there will be no need for rule IX to execute. This isn't quite ideal, as every increase in the inferred model means that this number will increase, but since there are only 5 rules which can trigger IX it won't be checked too often.

I just need to make sure than iTQL gets updated soon.

Tuesday, July 20, 2004

XML RPC
I was a little distracted today with the proceedings of the DAWG.

Over the last few days Jeff Pollock has expressed frustration that the DAWG has not made it a formal requirement to commit to a language based on XQuery. Personally, I've been a little surprised at this, given that the group seems quite happy to consider such a proposal, but wants to investigate the ramifications first. The relevant section of the minutes says:
We discussed Proposed XQuery requirement and/or objective without reaching critical mass around any particular wording. ACTION SimonR: write a document discussing tradeoffs with adapting XQuery as an RDF query language for discussion thru the September meeting in Bristol.

The issue seems to be that Jeff insists that a commitment to XQuery as a "requirement or objective" must be made now. Of course, Jeff has his own agenda on this, as does everyone else. That is the point of a committee after all. However, I haven't found his arguments to be persuasive.

Jeff made 4 points for using XQuery as the basis of the query language to be proposed by the DAWG. The first 3 of these are quite valid, but miss the point. In two of them he points out that XQuery is modular, general purpose, and well structured. In the other he points out that XQuery is a W3C specification, and describes the importance of supporting such standards. In other words, he describes many of the strengths of XQuery as a language, with no reference to its applicability to RDF.

However, it was the fourth point that frustrated me. Excerpting from his message:
the output of RDF and OWL (and most likely SWRL) specifications was solidly grounded upon XML inside the SemWeb layer cake

RDF can certainly be represented as XML, but to claim that the RDF specification is "solidly grounded" on XML is incorrect. XML describes a tree structure, while RDF describes a directed graph. Many RDF documents have a simple tree structure and are easily represented in XML, but when an RDF document contains loops it cannot be directly represented in XML. One method to overcome this is to label a branch of the XML with an ID attribute, and to refer to that ID from elsewhere in the document. While this works, it circumvents standard XML structure.

For XQuery to deal with RDF XML is certainly possible, though not as trivial as one might expect from an XML-based structure. To claim that XQuery is appropriate for RDF and OWL because they are solidly grounded on XML is incorrect.

That said, I have no personal objection to the use of XQuery. I'm looking forward to reading Simon's document on the benefits and problems associated with using it. In the meantime, Jeff should consider his arguments more carefully in future, as his current ones don't carry any weight.

RDFS Rules
My initial idea to work around the problem of duplicated variables described yesterday was to replace the rule objects which use them with a different object type that did the workaround. Once I got to implementing the new class, I realised that I really wanted to make all the rule objects appear the same in the .drl file, so the same tests could be called on them. That meant that both classes should implement a common interface. After considering the operations classes needed by these classes, I finally opted for an abstract class instead.

The classes are nearly done, but I'm still working on the new insertion code for the workaround class. I also realised that rule XI which needs namespace string matching, can also be done with an extension of the abstract class. Up until now I'd been avoiding rule XI because we didn't have iTQL that could be used for it, but now we need to provide workarounds for other missing iTQL functionality it makes sense to implement this rule as well.

I also spent a little time documenting after-the-fact requirements for the paging pre-fetching.

Masters
I had an extended lunch while I went into UQ again. I've been told by both Bob and the nice lady in ITEE postgraduate administration (Kathy) that my proposal will be accepted by the university, but that they are notoriously slow at getting through these things. Kathy explained that she spends a significant portion of every week trying to get a response out of the main university administration on student applications, so I may be waiting a while to get back an official response. In the meantime, the application form from the university says:
I understand that I will be enrolled as a student on the commencement date I specify above, and that I have agreed to start my research project on this date even if I have not received written advice from the University about my admission.

This is a little annoying, as there are a few texts in the Physical Sciences library that I'd like to borrow, but I can't until the application is processed. Having the application accepted would give me a little piece of mind as well. Kathy promised to help. At least I'm more fortunate that overseas applicants, whose visas typically require an acceptance.

I spent most of my time in a discussion with Bob about what I should be doing to start with. At this point he just seems to be happy to help me find my feet while I work out the specific direction I should be going in. He also lent me a few PhD and Masters theses to provide a rough guide of the sorts of things expected eventually.

Probably the most useful remark Bob made was not to let myself get caught up with coding. It usually has little bearing on the thesis, and a student can fool themselves into thinking that they are getting something useful done, when they're really standing still. I still have some ideas which need me to write code, but I'll be careful to keep this warning in mind.

We also spent a little time discussing Bob's current research. He is working with a group out of DSTC along with people from IBM and Sandpiper Software. They are building a translation mapping from one ontology description framework to another. These include UML, OWL, E-R diagrams, and others. It certainly touches on what I'm interested in, although it is in a slightly different direction, particularly given that my emphasis is squarely on OWL. Still, I'm interested in learning a little more, so I'll read up on it in the coming weeks.

Monday, July 19, 2004

Safari vs. Mozilla
I'm blogging in Mozilla on my Linux box tonight, and there are quite a few differences! While using Safari on OS X I noticed that there was a new "upload image" button, and a couple of other things seemed a bit different, but not that much. I realised that Safari wasn't showing up the bold or italics buttons, so I knew that some things were not being displayed, but until tonight I had no idea just how much was missing! I think I'm going to have to try Firefox on OS X to see if it shows anything that Safari doesn't.

Paged Answers
The pre-loading of pages seems to have gone well, which can be a surprise for threaded code. However, I thought it through quite thoroughly, so I'm cautiously optimistic.

I had to spend a little time today writing documentation on the page pre-loading for GN. This was made a little more awkward by the fact that someone seems to have recently introduced a device that is interfering with my Logitech radio keyboard. It made typing a real chore until I was finally able to find an unaffected frequency.

At the moment the paging ahead is done one page at a time, as this is simple and effective. However, it should be possible to create a queue of outstanding pages, within the limits of memory. However, I'm not sure how to go about finding the appropriate length of the queue. It would be counterproductive to make it too long and get an OOM error. For the time being it seems to be working well. TJ is using it for his tests at the moment, so I'm sure I'll hear about any problems soon enough.

Repeated Variables
My logging problem was so simple that I'm now feeling really, really stupid. As the resolver classes have been introduced they have taken over a number of existing classes, and have duplicated them in the resolver packages. This means that we have 2 nearly identical classes, in different packages. My logging was in the wrong one. Unfortunately, the code to attempt the fix was in the correct class, and so it became clear that nothing had been fixed. However, once logging was working it did show the problem.

The query causing the problem was:

select $xxx <rdfs:subPropertyOf> $xxx from <sourcemodel>

where $xxx <rdf:type> <rdf:Property>

During the course of running this query, the AppendAggregateTuples class was considering appending data found in two tuples objects. The first one had 1 variable called $xxx, while the second had 2 variables labeled $xxx and $xxx. The append operation was expecting that each tuples would have the same number of columns, and so it failed at this point.

The problem comes down to an inability to use the same variable twice in the select clause. From a mathematical perspective, this is correct, but it makes it impossible to express the query.

After discussions with AM and TJ, we've decided to allow variables to be aliased. This would allow the above statement to instead be expressed as:
select $xxx <rdfs:subPropertyOf> $yyy from <sourcemodel>

where $xxx <rdf:type> <rdf:Property>
and $yyy <tucana:is> $xxx

This solves most peoples' problems, but it will make automatic translation of the entailment-rdf-mt-20030123.xml document and its ilk that much harder. The original statement was a direct translation of rule 5b:
<rule name="rdfs5b">

<premise>
<subject var="xxx"/>
<predicate uri="&rdf;type"/>
<object uri="&rdf;Property"/>
</premise>
<consequent>
<subject var="xxx"/>
<predicate uri="&rdfs;subPropertyOf"/>
<object var="xxx"/>
</consequent>
<triggers_rule>
<rule name="rdfs2" />
<rule name="rdfs3" />
<rule name="rdfs6" />
</triggers_rule>
</rule>

Recognising the duplicated var="xxx" and replacing the second with a different variable will be annoying, but apparently necessary.

In the meantime, until pairs of variables are permitted in a <tucana:is> statement, I am writing some Java code to do this rule for me. Unfortunately, it means doing the query and performing an insert for each resulting row. This will be extremely inefficient, but it won't have to last long. As soon as the new queries are available, rules 5b and 7b can be updated to take advantage of them.

Split Infinitives
Yes, I know I'm using them. No, I'm quite comfortable with that. While many claim that it is incorrect, find me a single authority on the subject who claims that it is. Bryson agrees with me here.

Ditto for split compound verbs. :-)

Friday, July 16, 2004

Blogger Bugs
So what happened at Blogger yesterday? When I tried clicking "Preview" I was shown a heap of Javascript. Then when I tried to publish, or save as a draft, a new entry was created, but it was empty. So if anyone out there was wondering why I was posting empty entries, now you know.

Fortunately it's all going again today.

Masters
This morning I was thinking about the research I'm about to start, and it occurred to me that I really ought to be writing down everything I come across that may seem relevant to my future thesis. This includes my work, as well as documents I may read.

Then it occurred to me that I'm already doing that here in this blog! I'm not writing a lot about what I'm reading yet, but at least I'm putting down everything at work that may be of relevance.

I always intended this to be a record of my work, so I can go back and review what I've written in the past, but also so other people (particularly at Tucana) can keep up with how I'm going and what I'm doing. I never thought about study, so it's serendipity that it will be valuable there as well.

Speaking of study, I still don't have a response from the university, but my (prospective) supervisor says that he thinks it will be fine. He's associate professor in ITEE, so the odds are good that he knows what he's talking about. So I decided that I'd better start reading more again, as I've been getting slack in the last couple of weeks.

Clustering
Of course, the first thing I decided to read had nothing to do with RDF or OWL, but I thought it was better than nothing. :-) It's that reasonably well known paper on the Google File System. It does actually have some relevance to the design of TKS, and ultimately Kowari, as we need to start considering scaling the database transparently over large clusters. DM and I have talked about this on several occasions over the last couple of years, so I found it really interesting reading about Google's approach.

I suppose we'll need to start making more money on TKS before we get the go-ahead to implement clustering. After all, we already scale really well, and we need to make sure the company is around in a couple of years time. I worked in IT at 6 different places before coming to Plugged In/Tucana, and this is by far my favourite. It's a cross between the interesting work, and the fact that I'm working with the most intelligent team I've ever encountered. With the exception of a few months of leave so I could go programming at Saab (when Plugged In were short on contracts and money) I've been here for 4 and a half years. Not bad for the IT industry, and especially not bad for a company that is in "startup" mode.

Logging for Rules
The logging I was hoping to get going today didn't work quite as planned. I was able to get some time on it, but ended up spending the majority of my time working on other things. At this point I still have no logging telling explaining what is going on with the problem Tuples. At times like this I'm sometimes tempted to open my own file and start writing to it. Hopefully I won't end up that desperate, but it there are occasions when it would be quicker and less wasteful of time.

Other Things
A little while ago I implemented the AnswerPage code to reduce RMI calls when iterating over Answer objects. This worked quite well, but results in a pause whenever the end of a page is reached and the next page has to be fetched. To help with this problem TJ made a few requests today. First, I made the page size configurable with a system property, so that these pauses can optionally be made less frequent (by making the page sizes larger). Second, I introduced a background thread to preload the next page. Of course, it was the latter optimization which was more time consuming.

The code to do this started out with all sorts of clever locks to make sure that the pages would be loaded at the correct time, and that there could be no races. Of course, that is always a ridiculous thing to do with threaded code, so I spent the rest of my time paring it back, and considering all possible race conditions. I'm quite pleased with the final code, as it is very simple, and uses only one flag that can be shared between threads.

While access to this flag should be quite safe, AN pointed out a few months ago that just because variables get set in a particular order, there is no guarantee that a CPU will in fact execute the code in the order expected. The only way to guarantee that things occur in a desired order is to issue a write barrier, in this case with a synchronized lock.

The prefetch code only sets the flag to indicate that a page has been loaded by the reading thread, and only after the prefetch thread has terminated and been joined on. This is a reasonably safe bit of code, so I've avoided putting real synchronization in.

One problem with waiting for an outstanding prefetch thread to finish is that it may take longer than the client is prepared to wait. For this reason I've put a timeout on the Thread.join() call. If the join returns as a result of a timeout, then there is no way to check except by looking at the flag which indicates successful completion. This is the one place I could see a race happening, but it doesn't actually matter. The Answer object that the next page is being prefetched from is a stateful object, and if a prefetching thread fails to retrieve a requested page then the Answer will have been moved onto the next page internally. Since there is no way to roll the Answer back to a previous page, the only option on a timeout is to throw an exception (or else a page would silently go missing while the results were being iterated over). I was initially going to try reissuing a failed page prefetch, but this realisation made my life much easier.

It's late, and Luc didn't give me any sleep last night, so I'm feeling too stupid to do any proof reading. There's probably more to say, but it will have to wait for a time when I can actually keep my eyes open.

Thursday, July 15, 2004

ClassCastExceptions
The rule which caused this exception is 4b. Looking at it with TJ this morning made it quite obvious what the problem was. Going back to the original rule we see:

4b.     xxx aaa uuu              :  (nt)

uuu rdf:type rdfs:Resource (t1)

Only this is wrong. The above statement claims that for every subject/predicate/object statement, the object has a type of rdfs:Resource. This is not true for literals, nor is it true for anonymous nodes. The ClassCastException occurred when rule 4b attempted to put a literal into the above statement as a subject.

To fix this I need to do two things. The first is to prevent literals from being inserted. That will eliminate the ClassCastException. The second is to prevent anonymous nodes from having these statements made about them. These currently aren't causing any exceptions and are erroneously (though harmlessly) ending up as inferred statements.

The best solution for this would be a constraint which allowed statements to be constrained by node type. Only constraints usually describe existing statements, and there are no statements telling us about type, as this is exactly what rule 4b is trying to create. So I can't create the statements describing type, because I don't have any statements which describe type.

One possibility is to create a "magic" constraint, a little like trans, which allows selection of nodes according to type. Unfortunately, among other things, this would need full types support in the String Pool. This is planned, but doesn't exist yet.

The other option is to filter statements according to type. This is an undesirable method, as it does not scale well. For instance, on rule 4b, every statement in the system would need to be returned, and those whose objects are legitimate URI references would be kept. That will work, but will not scale to large systems (it will slow down linearly with the number of statements). For the moment though this is all that is available, so I spent the day going this way.

The simplest filter is to find those instances when a node of type literal is to be inserted as a subject, and to move on to the next node instead. I implemented this by catching the ClassCastExceptions as they came through, This has the added bonus of preventing exceptions like this being thrown for other situations as well, such as when a literal is used as a predicate.

Of course, the anonymous nodes still need to be filtered out, but it doesn't hurt to leave them in for the moment in order to see if all is working as expected. Removing anonymous nodes needs to be done after globalization shows up their type. AN suggested creating a new query type that does the same as the current type, but which also knows to filter out these nodes when globalizing. It would simply create a different type of Globalized Answer Iterator for the task. This should work, and in discussing it we realised that the current "closable" iterators are not being closed, so I'll make sure that gets done when I get to making these changes.

Once inferences with anonymous nodes still present was going, I opted to postpone the anonymous node issue. Instead I moved on to allow the rules to recurse by creating inferences on inferred statements as well as the base statements. Unfortunately, this caused immediate problems.

Empty Tuples
Tuples have a fixed number of columns with names, and a set of rows containing data. A short time ago it was decided that if a Tuples object were to contain no rows, then it could be represented by a constant object known as an Empty Tuples. From a mathematical perspective this works, but practically it has caused me grief on a few occasions, with today being the most recent case.

There are a number of places in the code which assert that two Tuples objects which are to be joined have the same number of columns. However, if a tuples is empty, then the Empty tuples will be presented instead, and the number of columns will be zero.

When the empty tuples object was first introduced it should have been made mandatory that all column comparisons were to be made with a method on the class, rather than by comparing the number of columns returned from getVariables().length. This method would always return true if one of the tuples in the comparison were the Empty Tuples. Since this is not the case, any assertions which expect the number of columns to be equal suddenly started failing. Unfortunately, these did not occur often enough in the code to be seen very frequently, so the problem was largely overlooked.

I've run into the problem before, and it struck again today when a selection against inferred statements returned zero rows. The frustrating thing about assertions like this is that they are runtime exceptions, and so RMI does not know how to deal with them. The result is simply an RMI exception and no description of the problem at all.

Since there are too many places in the code to track down all the assertions and assumptions that the number of columns are equal, then the fixes had to be local to this particular piece of code. A quick fixup didn't address all cases, so with some help from AM I added some logging to tell me what objects were being compared, and what their contents look like. As usual when I try to use logging in this system, it didn't coming up anywhere for me, so I will need to figure out why before I can get the rules working again.

Once the rules are running on inferred data as well as base data, then I can go back and remove anonymous nodes from the results. After that, I'll talk with AN about integration, and then move onto OWL.

Wednesday, July 14, 2004

Dinner
I didn't log my work last night as I was in a hurry to get out to a restaurant with Anne. It was Bastille Day, and a lovely little French restaurant has just opened just a few doors away.

However, it seems that some of the guys here at work have been hanging on my every word, and I've been chastised for my laxity. So I thought I'd better put in a quick update. Besides, it helps with my intention of logging what I'm doing, so I can come back and have a look at it if need be.

Rules
Yesterday was spent debugging the rules. First of all, I needed to properly load the rules, and I found that I couldn't load up the .drl XML file that holds all the rules and code. I assumed that this meant it needed to be in the jar file as a kind of resource, so I built a script which did all the compiling and jarring for me (some of this was done on Tuesday, and I finished it on Wednesday). It's not big enough to justify setting up Ant at this point.

Unfortunately I was now getting an error telling me that there was no method named buildFromUrl on RuleBaseBuilder. This was a bit confusing, since the example code all used this, so I went grepping. I discovered that Drools has a RuleBaseBuilder class in org.drools and another in org.drools.io. While I can see the convenience of being able to use the same class name in different packages, I've never approved of it when it can be avoided. It forces anyone using these classes to be explicit about packages whenever they use the class. Still, it was easily to fix once I found the problem, and I could start loading the .drl file.

Since Drools does a lot of interpreting of Java code inside of the XML, I was often getting errors telling me that parsing or execution was unsuccessful, but with no information as to what the problem was. One block of code in particular was quite long, and a failure in there was causing me all sort of grief.

Once I had this going, I found that all my &rdf; and &rdfs; entities were not being parsed by iTQL in the way I exepcted. After speaking with AN I discovered that the GUI automatically loads a file to create aliases for these entities. I had assumed that these aliases were built in, but it makes sense that they're not. For some reason I had also assumed that user aliasing would be done by the GUI (since one helpful effect of aliasing is to reduce typing), but the connection does it instead. So I went through the GUI code, and discovered that it is loading a file named default-pre.itql, which in turn holds the namespace aliases that I needed. I now have the Bootstrap code inserting these aliases for me. Better that than making the queries less readable with lots of redundant namespace info.

As an aside, while setting the Bootstrap class up I discovered that ItqlInterpreterBean does not accept uppercase commands. Way back when I was programming databases I used to put all SQL keywords in uppercase as a matter of course, so this one tripped me up. The error I was receiving was unhelpful too, telling me that it expected an EOF in the statement.

Finally I found that the rules were not executing as I expected. It kept executing the same rules over and over again. Lots of logging later I found the reason. Drools takes advantage of cached results of its preconditions in order to determine if a rule needs to be run. I know that Drools does this to avoid running rules. However, I had subconsciously assumed that this optimisation was only used to avoid running tests, but that it would always run the tests if it thought that the rule does need to be executed. Only it doesn't run the tests at all.

The logs finally showed me that all the tests for the rule were being run, with the exception of the test that indicated that the data had not changed and hence no insertion would be necessary. This was because the object in question had not had drools.modifyObject called on it, so the Drools framework saw no need to test any conditions which used that object. Since Drools did call the tests in the other condition statements, the system is obviously parsing the statements to find the objects being tested in them. This is a little more complicated than simply interpreting and running code found in a statement.

Once everything was going I run headlong into a ClassCastException coming out of an insert statement. I've now found out why, but not before leaving yesterday, so I'll leave that description until the next day's entry.

Tuesday, July 13, 2004

iTQL Classpaths
It turns out that the iTQL jar holds almost nothing but other jars, plus a bootstrap class. I haven't looked at this class yet, but I already know that it unpacks the other jars and sets them up in the classpath, before starting the iTQL GUI.

Unfortunately this process is not what I need in order to get an ItqlInterpreterBean object. The unpacking and path setup are fine, but the subsequent execution of the GUI is not needed. the functionality may be separately available if I look carefully, but for the moment it's easier to just unpack the jar manually, and set them up in my classpath before running. On the other hand, this is all going to take quite a bit of gluing together when it gets fully integrated.

At this point it's all running, but the iTQL insertion queries are failing. Some extra logging tells me that there is an unexpected EOF. KA and ML suggested that the terminating semicolon is not needed when I'm not using the GUI, but that doesn't explain why the select queries are working fine. I'll sort it out in the morning.

UQ
Finally got all the application paperwork done and mailed into UQ. Now I just need to see if they'll accept me. In the meantime I've emailed my prospective supervisor to help me get started, as I need to start picking up my efforts in this regard.

Monday, July 12, 2004

TKS Hooks
All the Java classes are now done for hooking the Drools code into TKS, but I'm still setting up the execution environment. I was expecting to use Session objects directly, but I stuck to iTQL instead. Fortunately there has been a lot of effort to make this very efficient (eg. the trans statement permits all transitive predicates to be found in a single query). Still, if this all works out, then it would be worth profiling to see if it could be done faster by skipping the top layers.

I don't really know much about TKS, and the differences to Kowari. I mean, I know what makes TKS different (since security in one part of that, and I wrote the first implementation of the security layer on my own!), and I know most of the shared code, but I'm not sure of what things had to made different to tie in the extra features. I keep expecting to get tripped up on something, but so far it's OK.

At the moment the code which starts the process is just a test class. I really don't know how or where it should be hooked in, but getting the code running is the most important thing at this stage. I can always copy and paste it if it needs to move.

QT3
A "fink" install of QT3 resulted in an internal compiler error, of all things! Apple left a message in the error handling code which asked me to lodge a bug report. Only fink cleaned up the working files, so I had nothing to report. Still, I was able to report it to the maintainer of the package.

In the meantime I've downloaded the source. I'll set it running before bed, to see it if compiles without fink. Hopefully it will, as I'm still reading that QT programming book, and I also need it to compile MythTV.

Saturday, July 10, 2004

Meetings and Classes
I'm a day late with this one, but yesterday was very unspectacular, so there wasn't to write about anyway. We had a company-wide meeting, which took up some time. All of that went well, but it really cut into what could be accomplished for the day.

Afterward I started implementing the bootstrap and RDFS rule classes for Drools. They're now mostly finished with the exception of actually connecting to a database session to execute queries. This needs to be done, but I'm considering if there is something else I can wrap them over, so I can run a test case separately from a real database. The day moved a little slowly, so I didn't get any further with this by the end of the day.

TV
I spent some time last night trying to watch streaming video from my Linux box to my new PowerBook. It was a resounding failure.

There are a couple of options I can try. Since I'm running MythTV on the Linux server I thought I could try running the client on the the PowerBook. This needs QT and other libraries. So I decided to use Fink, since this supposedly removed the burden of getting the software properly configured for a Mac environment. Only Fink failed the build for QT. I've moved onto the "unstable" installation for Fink in the hope that this might have the problem resolved (now isn't that ironic?). It's trying to install as I type.

The other alternative is to use VLC. This would seem good, except that I can't make it work as a server. The documentation for streaming DVB with VLS talks about the need for a .dvbrc file, which I don't have. Examples of this file are apparently in dvblib, but this is not a package I've been able to find (linuxtv talks about a new project called dvblib2, but this has nothing in it yet). I've also looked on the net, but the closest I've found to an example dvblib file have been posts in which people show individual lines from their file, and these are always set up for DVB-S.

Tuxzap has options which talk about making VLC compatible files, but all of the lines in the resulting file start with the channel name, causing VLS to fail. The only option I think I have now is to read the source code which reads .dvbrc, but I have no desire to do that at the moment.

I got much closer using dvbstream and VLC as a client. However, while a network connection was established the streaming still didn't work. The best I could get was a green screen. Maybe I should just write to a pipe, and pick it up with NFS?

It'll be nice if I can get TV showing up on the PowerBook. Even just a few years ago, who'd have thought that we could watch TV coming in through radio waves? ;-)

Thursday, July 08, 2004

Rule Implementation
I've now converted the RDFS rules to into an appropriate drl format. This includes instantiation of a set of RDFSRule objects, each with the iTQL associated with the statements generated from that rule. Because of the formatting needed for the iTQL, all of this part of the drl file was wrapped in a CDATA section. The conditions for each of the rules are almost entirely used for identifying the correct rules by name.

Each RDFSRule has also been given a reference for the bootstrapping object. This is the only object to be created in actual Java code, and so is not scriptable. It is this object that will provide all of the session code to the rules so that they can perform operations on the dataset.

Now I just need to finish writing the RDFSRule class. It needs to be able to perform the iTQL query given to it, and store the length of the returned result. It will then need to be capable of testing this old result against the new. This test needs to be called in the drl file for the rules and will form the final condition needed for each rule (other than name identification). The consequent of a rule will be to insert the results of its query. For the moment this will include tautologies, but will eventually need to get cleverer.

I'll need to profile how these rules are being called, to make sure that these iTQL statements are not being executed too often. I believe that won't happen, but there are a couple of corner cases that are leaving a nagging feeling in the back of my mind.

Computers and Stuff
I finished the UQ application last night, after much procrastinating. Strangely, once I looked at it I discovered that there was nothing left to be done except write an accompanying letter and print everything. I'm wondering why I thought I still had work to do? Of course, I forgot some of this paperwork today, so now I can't mail it until tomorrow. I guess we'll see what UQ has to say about it.

So I was in bed late, and yet there was no way I could sleep. No idea why. Lack of sleep also led to me putting my back out in the gym this morning, so I need to stick to swimming, with no more cycling/running for a few days. Yuck. Consequently, everything described above may sound a little dazed. I'll try and do better tomorrow.

Today wasn't all bad though. The new PowerBook showed up. I'm using it now and it's a lovely little thing. Strangely for Apple, they missed out on some installation options. For instance, with the new PowerBooks there is no provided way to install X11. A little Google searching helped find the method (others have also commented on this problem). After inserting the "Restore" DVD, there is a "hidden" directory (bash sees it fine, but the Finder doesn't show it) called "System". Under that are several directories, including one with all the packages one might need.

DM also had another handy little hint for me. It seems that every package installed appears in a directory called /Library/Receipts. I'll have to remember this (hence my posting it to this Blog).

Wednesday, July 07, 2004

Drools Rules
I now have a much better grasp of Drools, and feel confident enough to proceed using it. I'm still discovering some of the consequences of particular aspects of the system, so some of my ideas change from time to time. However, I've made a start implementing RDFS, and it seems to be going well... so far.

As I learned some of the specifics of Drools it occurred to me that the natural mapping for this stuff is if all of the statements of a graph were in the Drools working memory, and the rules were operating directly on the statements. This would mean that almost all the work could be done in the .drl description files. Unfortunately this has several problems. For a start, there's no way it would scale. The only way to go about it would be to rip out the working memory implementation in Drools and replace it with a Kowari/TKS graph. This would work, and actually has some merit, but it would also be a BIG job. Perhaps I'll consider it at a later date.

Another problem is that letting Drools loose on the statements would completely sidestep all the good work done to make the trans statements work so efficiently. This is a really bad idea.

At the end of the day, it may be worthwhile taking the algorithms of Drools and applying them directly to Kowari/TKS, but not to use the Drools framework. That way it could use the efficiencies of things like trans, but still do all the work in a rules based environment.

For now though, the rules have to be implemented differently. Each RDFS rule is being represented by a Java object. These objects will know the type of query to make, plus know the types of statements to generate based on the query. I'm planning on using a generic class for all of the rules, with the differences being defined in the constructors. Given that I've been able to express all of the RDFS rules in iTQL (except for rule XI), then this should not be too hard. It also lets us use such constructs as trans which will result in a massive reduction in triggered rules (this was the whole point of writing trans). In fact, a conversation with AN today has me thinking that many of the circular triggering of RDFS rules will not need to occur because we don't need to reiterate rules for anything commutative.

Since we want the rules to be in a configurable XML file, they will need to be instantiated and put into the working memory from inside the .drl file, and not in the Java code that invokes Drools. To do this they need to be inserted by a Bootstrap rule, whose only purpose is to initialise the working set. This Bootstrap rule will also pick up an initialisation object that the Java code will put into the Drools working memory for the purposes of providing access to iTQL and a Kowari/TKS database session.

Once going, each RDFS rule will be represented by a Drools rule. The first thing these rules will do is to pick up the Java object associated with the rule (confirming by name), plus any other rules which are triggered as needed. The Java object will then perform its query, and compare the returned size against a cached value. If there is a difference, then the condition for the rule will be met. Otherwise the rule won't have to run. The result from a query can only be the same size or larger than a previous result (since data is only ever added).

The consequence of a rule will be to tell the Java object for that rule to perform the required insert. It then tells the Drools working memory that the to-be-triggered Java object rules have all been updated. There is no need to touch any of these objects, as the results of their queries may already be different, and if not, then there was no need to execute the triggered rule, so its consequence will not be invoked.

This all seems to be coming together with one exception. The data to be inferred will go into a separate model from the base facts. If it is possible to infer a statement that appears in the base data, then it will end up in the inferred model as a tautology. I'll leave it this way to start with, just to get it working, but it needs to be addressed.

There are two ways to prevent these tautological statements. The first is to let them all go in, and then remove the intersection of the base data and the inferred data. This would not seem to be ideal, and if the intersection is large then a lot of statements will be redundantly added and removed, incurring a significant overhead. The other way is to test each statement for redundancy as it is going in. This means that I can't use an "INSERT INTO.... SELECT FROM" type of statement in iTQL. However, if I use some lower level interfaces it may not be all that expensive. At least it will only result in lots of lookups and no insert/removes. Kowari/TKS was specifically designed to look statements up very quickly, while modifications take much more time.

I'm still writing the XML, and I plan to have it finished early tomorrow. That then leaves me with the interesting task of building the RDFS Java objects. I'm putting off thinking about that too much until I can't avoid it. :-)

DVD Burners... again
I've had partial success with the new burner. I had a single disc burn correctly under Windows, and another not (it failed halfway through the verification). It turned out that I have 2 types of DVD-R. The first is a x4 burn, while the second is x1 only. I did both burns at x2, so it's no wonder that the second one failed. This had me feeling better. Under Windows at least, it looks like the drive works just fine. So far.

Back under Linux was a different story. I was able to blank a DVD-RW with dvdrecord, but I can't use dvdrecord or cdrecord-ProDVD to burn an image. Whenever I try, I get an error straight away. Perhaps I should re-enable scsi-ide and try it again (so I can say dev=0,1,0 instead of dev=ATAPI:0,0,0). I might also take a look for the latest in firmware to see if that helps.

It's for reasons like this that I'm thinking I'd like to go to Mac OSX. After all, who wants to stuff around with hardware configurations when you just need to back up some files?

Tuesday, July 06, 2004

Blogger
I nearly forgot... What happened to Blogger last night?

After I finished last night's entry I published it, and Blogger reported that all was fine. However, when I tried looking at the site, the new entry was not there. Looking at my list of entries, the latest one was there, but if I clicked on "View" I was taken to a 404.

OK, so there was a problem on the Blogger site. So I went to the page where I could report a problem, filled in the form, and pressed Submit. I got into work this morning, and I had bounced mail in my inbox telling me that the recipient of the form I submitted did not exist! That one made me laugh. You'd think that if there was a problem with the system, then the mechanism for reporting such problems would be kept running. It seems strange to think that the sysadmin's email could be on the same machine as the one running Blogger (or the part of Blogger that was failing). Maybe it wasn't, and the failure was more catastrophic than I thought. :-)

It couldn't have been too bad. It all seemed to be working by morning, and I didn't even need to republish.

Drools and Maven
I spent the day playing with Drools, mostly with an eye for porting the RDFS rules into it. After fiddling with JARs for ages I decided that I should just let it go, and use CVS. That meant I needed to build the system, and that meant I needed Maven.

I'd never heard of Maven before, but it seems to be an XML based build system non unlike Ant (yes, I can see that it's a little different, but in Drools it's being used to do the builds). The main page describes it as a "project management and project comprehension tool".

So I ran the examples through Maven, and these worked fine. But to integrate Drools we need to set up the environment without Maven, and it wasn't as easy as I thought. I took all of the JARs I thought I needed from Drools and put them in the classpath. Then I ran the example... only to find that it didn't seem to be configured correctly. The stack trace didn't help much, but there was an error message which seemed to indicate a problem either in configuration or the ability to find to namespaces:

no semantic module for namespace 'http://drools.org/rules' (rule)

The Maven XML configuration did not appear to offer any hints on configuration, so that left the possibility of a missing class. I added a few, and then all the JARs provided with Drools, and none of it helped. I thought to try running Maven in verbose mode, but -v didn't seem to do anything. Finally AN suggested I try maven --help which said that -X provides a debug option. Some days I end up feeling really dumb.

Debug on Maven showed a whole heap of classes under a ~/.maven/repository directory. Huh? Where did this come from? Well Maven seemed to have a LOT of JAR files which Drools didn't come with, but obviously needed, including some directory-service-type JARs. This got it going, and has let me proceed with writing my own code that doesn't use Maven. Took me long enough. Hope I'm more productive tomorrow.

LG DVD Burner
I got the drive replaced, but I haven't been too happy about how the new drive has behaved. This time it's a 8042B, rather than an 8041B.

The first thing I did was format a DVD-RAM disc. This worked perfectly (whew!). Then I tried to burn a DVD-R. This failed, but unlike the returned drive it told me instantly that it wouldn't burn the disc. The old drive would take several minutes before failing, so this seemed to be an improvement.

Back to Windows, and I tried to burn a DVD-RW disc with verification. The burn proceeded, but it took a couple tries before the drive acknowledged the disc. Once the burn finished, the verification didn't even start. This had me concerned about the ability to read the disc, so I tried opening it in Explorer. This too failed, but I thought to look at the properties of the drive, and found the "Burn" options. Changing some of the settings in there gave me a message stating that enabling the drive for reading would mean I would be disabling the ability to read and write DVD-RAM. This looked really promising. Sure enough, I could suddenly read the disc. So I tried a new burn with verification. This started well, but halfway through verification it stopped. No error, but it didn't proceed for over an hour.

I'm wondering about the fact that Windows can't configure the drive for both DVD-RAM and DVD-R at the same time. If this is a drive setting, then this would explain what I saw when I couldn't write to DVD-R after formatting a DVD-RAM. I've rebooted to Linux, and I'll have a go at dvdrecord again later.

So within the space of a couple of days I've become really dissatisfied with LG drives in general. I'll probably start using it more as just a DVD-RAM, and let the new notebook (when it gets here) meet all my DVD-R needs. Then I can replace the LG with a better name drive that does dual layer once I get a little money together. These drives are getting really cheap lately. The LG I bought originally cost $230, and now it's down to $120. The dual layer version is only $190 (and no, it's more than just a firmware difference). Matsushitas, Sonys and the like all cost more, but I'm pleased to see prices coming down so rapidly.

I'll just have to research if another manufacturer supports all formats (+/-/RAM).

Monday, July 05, 2004

DVD Burners
Tonight will be relatively short. Why? Because I've just spent hours trying to get my DVD burner to burn! I spent quite a few hours on the weekend as well.

I regularly update Debian, so it's always possible that I got a new version of a library or something, so when a backup didn't work for me yesterday I spent many hours trying different methods of writing to a disk. I had another go today, and I've been getting more and more frustrated as time went past.

The drive is an LG GSA-4081B. This is a nifty drive that burns DVD+-R(W) and DVD-RAM. The DVD-RAM has been particularly nice, and I've used it often. However, I had quite a bit of unchanging data that I wanted to back up, so I popped in a DVD-R, only to discover that it won't work for me. I have DVD-R discs I've burned with it in the past, so I know it used to work. I've used every driver available for dvdrecord and cdrecord-ProDVD, and on each occasion it tells me that the drive is not ready. I went out and found the latest firmware, to no avail... though it did fix an ongoing issue I've had reading data from the second layer of a DVD in Linux.

Finally I tried doing a mke2fs on a DVD-RAM disc, and that failed as well. Since I use these discs all the time I realised that I now had a real problem. Strangely, it still writes to already formatted discs, so the burning laser is operational. In the end I gave up, and did what I should have tried hours ago... I ran Windows.

Once in Windows I started running the software that came with the drive. It all seemed easy enough, but when it asked for me to insert a disc it wouldn't recognise when one was there. After several ejections and re-insertions of a DVD-R it finally acknowledged a disc, but then immediately came up with an error. I tried the same thing with a DVD-RW disc with the same result.

So now I've found the receipt and I'll be calling UMart in the morning. I hate trying to convince suppliers that I have intermittently faulty equipment.

At least my PowerBook is now in the country. Hopefully I'll see it in a few days. Apple can be a little slow about these orders though, so I'd better not hold my breath.

Rules Engines
I spent most of today reading documentation for Mandarax and Drools. "Silly me" went and started learning the architecture rather than the API before I realised exactly what Drools is. This was definitely not needed for what we plan on doing with it, since Drools is really just a framework for executing rules as needed. Perfect for RDFS. On the other hand, it was useful reading a description of the RETE algorithm.

I haven't really got into Mandarax yet, as I won't be using it to start with. However, it has some great potential for providing backward chaining. For on-the-fly inferencing this will be essential, but it seems that there are no commercial applications out there that do this. Obviously backward chaining is not an easy thing to address, but it is still an important thing for us to get working.

Anyway, I have a little more to learn about Drools tomorrow, and then I'll be attempting to set up some simple rule systems just to make sure I have my head around it. With any luck I'll have the RDFS rules going in a couple of days.

Thursday, July 01, 2004

Lucene Indexes
Today was a little unsatisfying, but gratifyingly short. The problem with Lucene indexes had been addressed by others in the last 2 days, so a CVS update was able to help. Unfortunately the update also brought in another file that was broken. In this case it was due to someone checking in a file when they thought that their build was fine, when a completely clean build would have shown them otherwise.

The problem file was the SubQuery class in the Resolver package. This has been making a transition from using DatabaseSession objects to using ResolverSession objects. For the moment I had to fix it by passing both objects to the constructor. That was enough to get it compiling, and I can leave the rest for AM to clean up, as this is what he's currently working on.

So I tried to run Kowari again, and once more I had issues. This time the Database object is trying to add statements to the system model. However, for some reason it has picked up a read-only phase to do this with, and of course it bombs out with an exception. This is tied up with the transaction implementation for resolvers. Again, this is code that AM is working on, and he doesn't expect it to be finished for a while.

So the remote resolver testing is blocked, with nothing that can be done until AM is finished. While I'd much rather finish what I'm doing first, it does leave me free to move onto inferencing again. At last! :-)

Inferencing
As a first cut on inferencing we want to use a brute force approach and store the inferred data in a separate model. This will be a far cry from what we can really achieve, but it gets us up and running.

To start this, TJ and AN are suggesting that we use an established engine (with a compatible licence) to do the work for us. We can then use this to generate the iTQL that will do the inferencing. I've been given a couple of suggestions, but AN likes DROOLS, so I'll look at it first.

Resolver Factory
Today's episode with the resolver went as slowly as yesterday's. In this case the problem stemmed from the Resolver Factory being unable to provide a Resolver for the remote model being requested. Many stack traces ensued while I tried to work out why this was happening.

Unfortunately I was a little hamstrung by being unable to log anything, as the code in question is being run by JUnit, and logging configuration is therefore unavailable. This meant that I had to put anything I was interested in into the messages being passed around with the thrown exceptions. Consequently I took longer to see the problem than I would have otherwise.

The client code was asking for a model which is found on a server according to the URI rmi://jaws.bne.pisoftware.com/server1 (We've been using the names of Bond henchmen for the desktop machines). The Resolver Factory was reporting that it couldn't connect to a session called "server". It took me some time to spot that it was supposed to be saying "server1" instead. Once I found that, I started looking through what URI had been configured, and what had been requested. This is where the lack of logging caused so many problems, as I couldn't just log the URI value whenever I wanted to see it.

Once I'd gone this far I asked AM for some help, and he suggested that it might have been trying to remove a "#" from a model URI, and was taking off the "1" instead. This made perfect sense, and after only a little searching I found it. This then led to 2 questions. First of all, why was a server URI being passed in when the code clearly expected a model URI? I suspect that this is to do with an assumption made for local models that isn't holding for remote models. It isn't a major problem, but I should work it out, in case it has consequences I don't see yet.

The second question is less important, but more perplexing. The problem code looked like this:
URI server = new URI(modelStr.substring(0, modelStr.indexOf('#') - 1));
Now unless I'm misreading this, if modelStr does not have a "#" character, the indexOf method should be returning -1, and the substring method should be throwing an exception when it gets -2 as the second parameter. Instead it is returning the full modelStr minus the final character. I have seen this before, and it seems to be a bug in the String class, at least on Linux.

So I fixed this to not change the URI string if it does not contain a "#" character. Alternatively it goes ahead and makes the above modifications if it does contain the character. This should have led me to the next bug in the chain of Resolver bugs, only things didn't quite go according to plan...

Lucene Indexes
Now that I'm using the "new" build script for the resolvers, I have found that some of the ant configuration is not quite right. Other than the path problems I have mentioned in the last few days, it seems to have some fundamentally different behaviour for some classes which I didn't realise have changed.

The main problem I've had has been with a locking file used by Lucene. Whenever I tried to start a Kowari server it would fail to run with an error describing a timeout waiting for a lock file. To remedy this I have taken to using rm /tmp/lucene-*.lock each time I've needed to run a server. This didn't always work, and occasionally I would see a startup error saying that Lucene was unable to delete the /tmp/lucene-xxxxxxxxxxx.lock file because it did not exist. To work around this I would just remove the server directory, do a clean build, and run it all again. Until this afternoon this worked fine. Suddenly it doesn't. I either get a timeout on the lock file if it exists, or an error attempting to delete the lock file if it doesn't. Either way the server fails to start. Clean builds and empty directories don't fix it either.

It looks like I need to learn what Lucene is doing before I can do any further work.