Working notes: 05/01/2005

Monday, May 30, 2005

Traversal
It's late, so I'll be brief.

I need to traverse my way down the constraint tree of a where clause. This is an issue because none of the nodes are named. So the only way to travel down is with a set of conjunctions in the iTQL query. Unfortunately, every level down in the tree means a new conjunction, and the tree is arbitrarily deep. So how do I go down?

There are a few solutions.

The first is to name everything. That would work, but should not be necessary, and would be a pain to use. Besides, an unnamed node should not cause it to all stop working.

Another solution is to use JRDF and traverse my way down manually. This is undesirable as I'm already using iTQL heavily. It also runs into a problem of re-using blank nodes from one query to the next. That should be OK, but strictly speaking is not allowed.

An alternative is to automatically generate iTQL to traverse its way down the tree. This would work, but would also lead to messy results with a lot of work to interpret. It would get even harder if different branches on the tree were different depths.

Thinking about the shape of the tree made me realise that the form of the queries is always in conjunctive normal form. I could take advantage of this, but it will lead to a very inflexible system in future... and I may need some flexibility when OWL gets fully implemented. However, it's always a fallback position.

My final option is to use blank nodes from a previous query as constraint elements for a new query. This should work, particularly as I'm on the server, and if I stick to a single transaction. The problem here is that I need to abandon iTQL and build the queries by hand. Fortunately, this isn't too hard, and need only be done for this part of the rule parser. I also intend to use iTQL to pre-query the simple constraints, and hold references to them for use as I get to the leaves of the constraint expression tree.

ItqlInterpreter
I had an attempt at a spike with some client side code, to see if I could build queries which were constrained on a blank node. Unfortunately, the query to retrieve the blank node kept failing on me. After trying all sorts of variations I tried selecting everything from the model. This worked. So then I constrained on a single column, and sure enough, nothing worked.

After lots of permutations I've discovered that ItqlInterpreter can execute a query easily and correctly. However, the combination of ItqlInterpreter.buildQuery, ItqlInterpreter.getSession() and Session.query() always fails. I was unable to work out why (not in the time I had, anyway). It must work for me on the server side because my session is pre-defined for me.

So I can't test my idea of querying on a blank node. I'll just have to do lots of logging at the server.

Saturday, May 28, 2005

SMOC
Coding over the last couple of days has gone well. I've been using the standard XP cycle of incrementally adding a bit, and running it to see that it works as intended. At this stage "working" is really a matter of making sure that the logs contain the data they are supposed to contain.

It's all been going well, but I'm only part way through. As Andrae likes to say, it's just a Small Matter Of Coding (SMOC). Meaning that the design is done, but the rest of the work is going to take some time. Now that I have the classpath issues resolved, I'm getting through a few hundred lines of code per day, so I'm happy the current pace. I just hope I don't run into any other major snags.

One issue that I've hit while building the Query objects has been the difference between a krule:Variable and a URIReference. This was bothering me when I wrote the RDFS, and the parsing of it made me consider it again.

Every time I run into constraint element I can ask for it's type. In the case of a variable, I can then use the name to create a new variable. Easy. But when I get a krule:URIReference I need to look for a krule:refersTo property to find the URI to construct the object. I hate if() constructs in code if I can avoid it. :-)

On the other hand, there is no real need for a variable to pick up a new attribute. If it did, then it should really be rdf:value.

This made me think about the krule:refersTo property that I've used for krule:URIReference, and so I looked up rdf:value again. I had thought that this was a datatype property (to use OWL parlance) instead of an object property, but the range is rdfs:Resource. Also compelling is the comment that the use of this property is encouraged to help define a common idiom. All of this was enough to make me change krule:refersTo to rdfs:Resource on krule:URIReference.

You'd think that would be enough to convince me to add a similar value to the variable, but I haven't. :-) The reason for this is usability and readability of the RDF for rules. Ideally I'd not have an extra indirection on the URI for URIReference either, but that has semantic consequences. I'd end up saying that some arbitrary URI has an rdf:type of krule:URIReference, which is just plain wrong.

The problem with having this difference between values and URI references is that each of them uses a different set of conjunctions to get all the data for construction. This means that I need two separate queries. To minimise the queries I'm doing, I'm initialising the whole rule reading procedure with a method that reads in everything of type krule:URIReference with their values, and mapping one to the other. This map is available later to any method that sees a krule:URIReference, and needs to get the URI it refers to.

Entailment and Consistency
Inferencing falls into two areas: entailment and consistency. I'm going to need to handle both. I haven't yet thought a lot about consistency checking, and that has worked out well.

Early on, I had thought that there would be a lot of consistency tests to perform, but after my learning exercise with cardinality I realise that consistency is not as strict as I'd thought. In the general case, if there is any possible interpretation that can make a data and its ontology correct, then it is consistent. This can rely on the most unlikely of statements which are not in the datastore. It is only when there are some direct conflicts that consistency becomes evident. Conflicting cardinality values, sameAs/differentFrom pairs, and the like are the most obvious examples. Fortunately, many of the less obvious examples will show up in the simpler tests after entailment has been performed, so the number of tests to be performed is not as daunting as it may first appear.

In the meantime, I've been learning more about logic at Guido's Wednesday sessions, which has also given me a better understanding of what I need to do.

As a first iteration, I'll be performing consistency checks only after entailment is complete. This will be more efficient, as it will only need to be done once, and will wait until all possible conflicting statements have been generated.

The problem with this approach is that it may be difficult to tell where the conflicting data came from. If entailed data conflicts with other entailed data, then the original data may be difficult to find (this is especially the case when the entailed data was entailed from other entailed data). The only way I can see around this is to perform checks after each entailment operation. This could get very expensive. Maybe it could be set as a debugging flag by the user?

Another thing to try during debugging will be more complicated consistency checks. Further entailments may show an inconsistency with simple tests, but it would be ideal if inconsistencies could be found and acted upon immediately. However, I believe that this approach could be of arbitrarily complexity, and only bound by the amount of work that the developer wishes to put into it. As a result, I don't think I'll be pursuing this for the time being. It would only be of use for debugging anyway, so the case to implement more complex test would have to be very strong (and involve a lot of $$$). :-)

Even with simple tests, I still need to define the consistency tests. The current rules are for entailment only, so I need to expand the vocabulary to handle them. The structure will be very similar to the entailment rules, so it should be easy to extend my current system to handle both. The main difference is when to run the rules, and to test for the existence of any result rows, rather than inserting results back into the model.

Thursday, May 26, 2005

Unmarshalling
After integrating the rules interfaces I thought I'd run up the Kowari server and see that everything was still OK before proceeding. It wasn't of course (otherwise I wouldn't be talking about it here).

My error message was really helpful: Couldn't answer query.

So I started by grepping for this error message, and found it in several places within ItqlInterpreterSession. This was frustrating, as this means that I was seeing the problem at the entry point to the query process, rather than wherever the error was really occurring.

Of more concern was that each time this message was printed, the cause of the exception was supposed to be printed as well, but I was seeing nothing. I decided to print the message from the exception as well (not just the cause of the exception), only I saw nothing here either. Finally, in frustration, I changed each occurrence of the "Couldn't answer query" message to something unique, so I could tell which one was being printed. At this point the cause of the exception suddenly started to be printed as well.

I must have been running into a dependency problem where it wasn't building the new code modifications. I don't know why it suddenly started to work. If the new strings I'd created were not printed then I'd have known that it was a build script dependency problem, and I'd have done a clean build (I was on the verge of this already).

Given this problem I realised that I needed to perform clean builds more often in this debugging process. Incremental Kowari builds are already very time consuming, so performing a clean build each time meant that the rest of the debugging operation was guaranteed to be very time consuming (my latest clean build took 3 minutes, 17 seconds).

Anyway, I now knew that my problem was an "Unmarshalling" exception. So the problem was in RMI.

My first suspicion was that I'd updated a versioned interface at one end and not the other. The only changed interface was Session so I ran serialver to get the new number (the number generated by this program is essentially a checksum of relevant parts of the interface signature). However, the serial ID was exactly the same as before.

To confirm the problem I was seeing I commented out the new methods (buildRules() and runRules()), and tried to run another query. This took a while, as I had to make sure I got every implementation of the Session interface. Sure enough, it worked. So my problem was definitely in these methods, but where?

Wondering about the other rules classes I worked out which ones are to be transferred across RMI, and made sure that each was serializable (the exceptions were already serializable, but Rules was not). I couldn't see that this would make a difference, as I hadn't yet called a method that would move these objects, but I was trying everything I could think of. While I was at it, I made sure that each of these classes had serial IDs.

Once I had these changes throughout all the Session implementations I re-ran the code, and had no change in the output. I guess this was to be expected, but I had still be hopeful. :-)

While going through the various classes, it occurred to me that I had not yet made the necessary changes to RemoteSession. I wasn't yet passing the new function calls across RMI, so I thought that these would be safe to leave unimplemented for the moment. To intercept any inadvertent calls when the RMI interface was not complete, the RemoteSessionWrapperSession (blame Simon for the name, though it does make sense) was just throwing an UnsupportedOperationException for the new rules methods. With few other options to try, I decided to add the rules methods to the RemoteSession interface as well (after all, they were about to be needed anyway). Just a few extra lines, and they were implemented on SessionWrapperRemoteSession as well.

Unexpectedly, this did the trick! I still had an exception, but suddenly I had messages going over RMI, and the exceptions were being printed. With this information I discovered that the rule classes were not present in the classpath at the server end. Obviously they'd made it into the classpath during compilation, but when the distribution jar was built they were not being included.

Once rules-base-1.1.0.jar and krule-base-1.1.0.jar were included it all ran fine. I'm now back into the integration of rules with a database session.

Monday, May 23, 2005

Whack-a-mole
Today was an exercise in frustration, but it was ultimately successful.

Every time I removed one dependency problem, another would raise its head. Some of the modules in Kowari have dependencies I'd never considered, and it was amazing just how many of these I had to deal with. Finally, I seemed to have them all under control, only to discover a new problem.

I've been reading data out of the kowari-config.xml file for some time, but I'd never tried to add to it. To start with, I naïvely thought that I could just add a new tag, and the code would automatically see it. This is not the case. I believe that Castor (the site is down ATM, but Google cached it just yesterday, so it can't be far away) is capable of just reading from an XML file and building an interface to what it finds, but that is not the configuration being used in Kowari. Instead, it uses an XSD file to describe the structure, and it builds the interface from that. At one point it was a DTD file and that is still to be found, but that is no longer used, and has fallen well behind.

Thanks to Simon for helping me find this schema.

I still had problems though, as my changes did not seem to be working. I added the new RuleLoader element, but nothing came up in the automatically generated Castor file. After a while I discovered that the field had resulted in it's own class files (RuleLoader.java and RuleLoaderDescriptor.java), but the field was still not showing up in the TucanaConfig class. Finally, I worked out what was missing in the XSD file. The TucanaConfig element wraps a list of references to other elements. All I had to do was enter a reference to the RuleLoader element, and it finally appeared in the TucanaConfig class.

Object Passing
One other thing that bothered me was when I tried to pass a session into the run() method of the rules engine. The rules interface can't know about sessions (because of dependency issues), so I thought I'd take the easy way out and pass it as an Object. However, for some reason the compiler refused to let me do this.

It's late, so maybe I missed something, but I got around this by providing a simple wrapper class for passing parameters to a rule framework for running. After all, different framework types might have difference requirements, so a generic way of passing parameters is probably a good thing. :-)

Sunday, May 22, 2005

Dependency Woes
Having had some success with the rules code sitting outside of Kowari (using RMI), I spent the weekend trying to bring it inside the project. Without that it was never going to scale.

Unfortunately, the rules code has led to a dependency nightmare. I need ItqlInterpreter to see an "apply" command, and so build a new rule framework based on the specified model. So the rules code needs access to a Session object in order to read the model. I then need to tell the database session to apply the rules (which in turn will be reading and writing on the session). So I suddenly have a dependency loop between rules and sessions. It is tempting to just bundle them together, but that doesn't quite do it either. As I mentioned the other day, I'm now using an ItqlInterpreter to parse my queries for me, so this leads to a loop between the iTQL module and the Kowari rules module.

The first step to address this has been to define a set of rule interfaces, and pull them into their own module. Then I defined a configurable rule engine item in the Kowari configuration file, to be loaded up via reflection, just like the resolvers and databases are. This solves some of the problems, and it also provides a perfect slot for a RuleML implementation at a later date.

I spent the entire weekend on this, and I'm still at it. Hopefully I'll be done in a few hours, as I really don't want to be bogged down in this any longer.

Friday, May 20, 2005

Documentation
With so many changes coming in recently, I really need to document how they all work. I'm sure than Andrae will be encountering the same problems. However, it's not a trivial thing to do.

When Kowari was still managed by Tucana, Grant was formally writing everything up for us. We'd do the copy, while he would format it, structure the content where appropriate, correct our language, etc, etc. Every so often his files would get run through some proprietary program, producing the HTML files seen on the Kowari web site today.

None of this process is available now.

That leaves me with two choices. First, I can just edit the web pages in place. This is always a bad idea with automatically generated files, but if we're not generating them automatically anymore, then who cares? However, I'd like to think that we WILL be able to get this process going again eventually, and any manual changes to these files will be painful to deal with.

The second option is to find the program that Grant used, and generate everything again. The problems here are:

Cost of the program (I have no idea how much it's worth).
It's not open source.
A bottleneck around whomever gets the job of building the documentation.
Steep learning curve for whoever takes on the job.
Will need a lot of work to make everything look consistent, getting publications out when needed, checking into Sourceforge, etc.

Of course, the advantages are that this is how the manuals currently work, and they are really great.

Either way, something has to be done. Does anyone have any ideas?

ItqlInterpreter
Who had the great idea of integrating the language parser with the execution framework? Let me explain.

When a client wants to execute a query, an ItqlInterpreter object is used. This object parses the iTQL code, and sends the results to a database session for execution. The problem is that it integrates these two steps.

I've written a lot of classes for representing rules when they are brought in from the data store, but the queries are causing me problems. My code is executing inside the database, so it will never need to go through Java RMI (establishing RMI connections is a bit bottleneck for Kowari). This means that I have a local DatabaseSession that I want to use. Unfortunately, ItqlInterpreter uses an internal SessionFactory to get the session that it is going to use, meaning that it will always go across RMI.

ItqlInterpreter creates Query objects, and I'd like to just use them on the Session that I have. This is why the coupling of interpreting and execution is such a problem for me.

The easiest way around this would be to allow a SessionFactory to be supplied to ItqlInterpreter, and to hand out the given DatabaseSession from it. However, I've been loathe to make this change without input from other people. I haven't seen much of them lately, so I have steered clear from this approach.

Instead, I've been building up Query objects manually, in the same way that ItqlInterpreter does it. There are numerous problems with this approach:

Takes a lot of work to write simple queries.
The code is hard to write and read, and is therefore not extensible nor flexible.
Difficult code like this is bug prone (OK so far, but I'm debugging a lot).
It takes longer to write

The advantages are that it means I don't have to change ItqlInterpreter, and it's very fast.

However, given my pace of progress, I'm thinking that I should just rip into ItqlInterpreter and give it a new session factory, even without consulting the other developers.

Top Down
I've started to understand the difficulties in representing hierarchal systems in databases. When building the object structure in memory, it is necessary to build objects from the top, and fill in their properties going down the hierarchy.

It is possible to come bottom up, but this can get arbitrarily complex, particularly when children at the bottom can be of arbitrary types. For instance, if A is a parent of B and C, then to build the object structure from the bottom-up, B and C must be found (actually, all leaf nodes must be found), and then you have to recognise that they share the same parent, and provide both of them when constructing A. This sounds easy enough, but gets awkward in practice, particularly when different branches have different depths.

So it is much easy to build this stuff bottom up. The problem there is that most of the classes being built expect to have all of their properties pre-built for them before they can be constructed. So that doesn't quite work either.

I'm dealing with it by building a network of objects which represents the structure, and which can be easily converted into the final class instances. It's relatively easy to build this way, and easy to follow, but again, it's more code than I wanted to write. Oh well, I guess that's what this job is all about. :-)

Saturday, May 14, 2005

Work, work, work
This has been a week of resolvers, testing, rules engines, and Carbon on Java.

Where to start? This is why I should be writing this stuff as it happens, rather than as a retrospective, but better late than never.

Prefix Resolver
I backed out the changes that I made to the string pool for prefix searching, but I've kept a copy. It's all been replaced with simpler code that uses Character.MAX_VALUE appended to the prefix string. This seems to work well with the findStringPoolRange() method from the string pool implementations.

I got this working with the prefixes defined in string literals, which makes conceptual sense, but it has some practical problems. In order to find all elements of a sequence, it is necessary to match a prefix of http://www.w3.org/1999/02/22-rdf-syntax-ns#_. When represented as a literal, this has to appear in its full form. However, if I use a URI instead, then I can take advantage of aliases, something that can't (and shouldn't) be done with string literals. This means that the prefix can be shown instead as <rdf:_>.

While the syntax for this is a lot more practical, it does cheat the semantics a little. After all, there is no URI of <rdf:_>, while there is a URI of <rdf:_1>. But the abbreviation is so much nicer that I'm sticking to it. I'm now supporting both URI and Literal prefixes, so anyone with a problem can continue to use the full expanded form as a string literal.

Backing the old code out, and putting in the dual URI/Literal option took longer than expected (as these things always do), so a lot of the week went in this, and the testing. However, it was worth it, as I can now select on a sequence. To do this I need a prefix model for the resolver (I'll also include the appropriate aliases here for convenience):

  alias <http://tucana.org/tucana#> as tucana;
  alias <http://kowari.org/owl/krule/#> as krule;
  alias <http://www.w3.org/1999/02/22-rdf-syntax-ns#> as rdf;
  alias <http://www.w3.org/2000/01/rdf-schema#> as rdfs;
  alias <http://www.w3.org/2002/07/owl#> as owl;
  create <rmi://localhost/server1#prefix> <tucana:PrefixModel>;

The Kowari Rules schema uses a sequence of variables for a query, so I can use the prefix resolver along with the new <tucana:prefix> predicate to find these for me:

select $query $p $o from <rmi://localhost/server1#rules>
where $query <krule:selectionVariables> $seq
  and $seq <rdf:type> <rdf:Seq>
  and $seq $p $o
  and $p <tucana:prefix> <rdf:_> in <rmi://localhost/server1#prefix>;

This has proven to be really handy for reading the RDF for the rules (which is why I wrote this resolver first). However, the real important of this is that I now have the tools to do full RDFS entailment. The final RDFS rule can now be encoded as:

insert 
  select $id <rdf:type> <rdfs:ContainerMembershipProperty>
  from <rmi://localhost/server1#model>
  where $x $id $y
    and $id <tucana:prefix> <rdf:_> in <rmi://localhost/server1#prefix>
into <rmi://localhost/server1#model>;

So all I need now is the rule engine to do it. I spent the latter part of this week on just that. It looks like it's the parsing of the rules that takes most of the work, as the engine itself looks relatively straightforward. I'll see how this comes together in the next week.

Meanwhile, I wrote a series of tests for the prefix matching (which I've mentioned several times in the last few weeks), and after many iterations of correcting both tests and code, I have everything checked in. Whew.

Namespaces
You'll note that the "magic" predicates and model types are all still using the "tucana" namespace. We really need to change that to "kowari". I'm a little hesitant to simply make wholesale changes here, as it will break other code. On the other hand, it is probably a good idea to do it sooner rather than later.

Does anyone have any objections to this change?

Warnings
Checking anything into Sourceforge still needs the full test suite to be run first (this is even more important now that we rarely see each other in person). During the tests I noted that there are some warnings about calls to walk and trans. In each case they are complaining about variables rather than fixed values for a resource.

In the case of walk, then this will need to be addressed by allowing a variable, so long as it has been bound to a single value. This will be necessary in order to query for the node where the walking will start.

For trans then a bound variable of any type will need to be supported. This will make it possible to select all transitive statements in OWL.

However, I'm aiming to have the basic rule engine with RDFS completed within the month. While these changes are vital, can I justify doing them now? They are not needed for RDFS, so perhaps I should postpone it. I'm just reluctant to put it off, as the change is important and should only take a few days to do.

Maybe I should put my other projects on the back burner and start using my "after hours" time to get this done instead? That'd be a shame, as I need some sanity in my life.

jCarbonMetadata
As my code for wrapping the Carbon classes became more complete and refined, I decided to put it up on SourceForge. Since it was just supposed to wrap the Carbon metadata classes for Java, I called it jCarbonMetadata. However, I'm starting to wonder if the scope is shifting.

The latest part of this project has been to implement the query class. To properly write the javadoc for this code, I really need to duplicate Apple's documentation, but I don't think I'd be allowed to do that. However, I'm not sure about this. I'm providing access to Apple's classes, so it makes sense to use Apple's docs. Writing my own descriptions sounds like a recipe for disaster, particularly as I'm often doing this stuff after midnight. Linking isn't very practical either.

I've been making progress anyway, writing wrapper functions around many of the MDQuery functions. Just when I thought I was near completion I discovered a method that threw my whole plan into chaos.

The initial idea was to write a wrapper around the three relevant MD classes: MDItem, MDQuery, MDSchema. The methods for these classes all return strings and collection objects from the CoreFoundation framework. Rather than provide a wrapper implementation for each of these objects, I've been converting everything to its java equivalent and returning that instead. Not only does this mean less work (as I have fewer classes to implement), but it makes more sense for a Java programmer anyway, as the Java classes are the ones they need to be using.

This was all going well until I got to the function called MDQueryCopyValuesOfAttributes. This function returns a CFArray of values. I was expecting to convert this into a java.lang.Object[] until I read the following:
The array contents may change over time if the query is configured for live-updates."
So if I just copy the results into an array I'll lose this live-update functionality!

There are two approaches I can take here. The first is to just have a static interface, and wear the loss in functionality. The CFArray has to be polled for a change in size anyway, so why not poll the function call instead?

The second approach is to wrap a CFArray in another Java class. While this would take more work, it is a more complete solution.

For the moment I'm hedging my bets, and have written two methods. The first method will return an array, while the second will return a dynamically updated MDQuery object. I have a stub class for this at the moment, though I don't expect to flesh it out until everything else is done. I thought that the method that returns the Object[] might just use the CFArray method, and have this class provide a toArray() method, but in the end I decided that I can cross the JNI boundary less often if I provide separate methods in C.

If I'm going to look at CFArray I figured that I might as well find out what else it offers. Most of the functions are as you'd expect, but there is also a function called CFArrayApplyFunction for applying a function to each element of the array (this brings flashbacks to the STL for me). That seems relatively useful, so I've starting thinking about the rest of the Core Foundation (CF).

Apple have explicitly said that Carbon is not going away, and indeed that there are some things that can only be done using Carbon. With this in mind, maybe there is room to implement the whole of CF in Java? I don't think any of it would be too hard, though it would take a lot of time.

This made me wonder if I should really be converting all of the CF objects into their Java equivalents. Perhaps I should implement each of these objects in Java, and return those instead. Then they can be converted into the Java classes by appropriate pure Java functions.

Considering this a little more, I decided that even if I implement all of these classes, then I'd still do conversions in C, through a JNI interface. There are two reasons for this. The first is that some objects must be converted in C, such as numbers, and for this reason most of the conversion functions have already been written. The second reason is to minimise crossing the JNI boundary. Converting objects like CFArray and CFDictionary in Java would mean repeatedly calling the same JNI access functions.

As I said earlier, I'll just stick to the three main classes, and consider the remaining CF classes when I get there. I still have other projects on the go, so I'll just have to consider the importance of these things once I get there. For the moment I just want this code for a Kowari resolver.

I'm sure there's other stuff to talk about, but I'd better get back to work.

Sunday, May 08, 2005

Redundant Code
Last night at the reception I was talking to DavidM about my JNI implementation for OSX metadata. David thought that this was unnecessary, as Apple have a full set of Cocoa bindings for Java. If this were true, then the code I wrote would be redundant. While it was great practice, I'd rather that I'd provided something really useful.

BTW, the wedding went really well. No honeymoon yet, but maybe we'll try to organise something later in the year.

Tonight I started looking for documentation on Java-Cocoa, to see if David is right. It turns out that he is (mostly). Damn. :-)

Looking for documentation on Apple's site tells me that Java-Cocoa does exist, but it seemed hard to find. Looking in the Xcode documentation, I found it under:
ADC Home > Reference Library > Documentation > Cocoa > Java
Google searching also finds references to it (often in mailing lists), but typically in disparaging tones. The complaints have been that the code is buggy, and that documentation is poor.

So I went looking for it on the local system. The first place I looked was in /System/Library/Java. The Extensions subdirectory looked promising, but wasn't. However, I soon found all of the "NS" Cocoa classes in com/apple/cocoa/foundation. This included NSMetadataItem, which I thought was the exact equivalent of what I'd just written.

Oh well. I'm still glad for the coding practice. :-}

However, looking at the situation more carefully, it's not so clear cut. It seems that the NSMetadataItem has no constructor that accepts a file. Instead, these objects are instantiated as the result of a metadata query. So from what I can see, the MDItem class from Carbon allows the metadata from a given file to be retrieved, but there is no equivalent functionality in Cocoa. It may exist, but I haven't found it yet.

Once I knew what I was looking for, I started searching for the Java API for this class, but with no luck. Most other classes are available, but none of the metadata classes. Finally, I tried using the javap disassembler on the NSMetadataItem, getting this:

  Compiled from "NSMetadataItem.java"
  public class com.apple.cocoa.foundation.NSMetadataItem extends com.apple.cocoa.foundation.NSObject{
      public native java.lang.Object valueForAttribute(java.lang.String);
      public native com.apple.cocoa.foundation.NSDictionary valuesForAttributes(com.apple.cocoa.foundation.NSArray);
      public native com.apple.cocoa.foundation.NSArray attributes();
      protected com.apple.cocoa.foundation.NSMetadataItem(boolean, int);
      public com.apple.cocoa.foundation.NSMetadataItem();
      static {};
  }

This confirms that there is no constructor which accepts a file path.

I also noted that these classes return NSDictionary and NSArray objects. This makes the wrapping a little thinner than it need be, as Java code would usually use java.util.Map and Object[] for the same purposes. At least it uses Java strings, rather than NSStringReference.

So I'm thinking I'll keep up with this code for two reasons. The first is the more convenient interfaces from java.util. The second is that having a Carbon wrapper seems to offer a couple of functions that are unavailable under Cocoa (like getting metadata from a specific file).

This extra functionality in Carbon seems to be confirmed by looking at the contents of the "mdls" command (using "nm"). The symbols it uses all indicate that the code is written in, and linked to, Objective C. However, it still uses MDItem from Core Framework. There shouldn't be a need for this, but with no support for file-specific metadata in Cocoa then it becomes essential.

I'll keep writing with this code for the time being. Given the lack of documentation, I suppose that Apple are still in the process of working on this, and will probably release something more complete as 10.4 progresses. In the meantime, I find this sort of thing fun, and everyone needs a hobby. :-)

Friday, May 06, 2005

MDItem
Well it's the night before the wedding, and I should be trying to make sure everything is ready. Instead I've finished the first JNI wrapper class for the OS X.4 metadata classes. It hasn't been easy finding the time, as I was trying to work on real Kowari stuff, as well as trying to help Anne with all those last minute details.

The class I've finished is MDItem. This is a direct rip-off of the MDItem class found in the Core Framework. There is a one-to-one correspondence between the classes' methods. There are a couple of minor TODO items, and I should add in a list of the common attributes, but it's basically done.

A lot of work went into the code that converts Core Framework objects into Java objects, and back again. This code will get re-used heavily, so any further work will be a lot faster.

Speaking of further work, the main class I need to work on now is MDQuery. I expect to start on that next week. After I have that one I should be able to start on the resolver! There's also MDSchema, but I don't think it's as important to get things running. I also can't see the need to try and write a file importer in Java, as the file system is not going to want to start up a JVM. :-)

Anyway, if anyone has OS X.4 and is interested in trying it out, then you can find the Xcode project here. The class comes with a main() to test it out, so you can run the resulting jar like this:

  $ cd build
  $ export DYLD_LIBRARY_PATH=${DYLD_LIBRARY_PATH}:.
  $ java -jar metadata.jar libmetadata.jnilib

The output is in two parts. The first part is all of the attributes, and gives the same results as the mdls command. The second part is a repeat of the first part, but only tries to pick up every second attribute instead (this is to test a different method). You'll note that I ran it on the library itself, but you can run it on any file in the system.

I hope someone likes it. :-)

Thursday, May 05, 2005

Time
It's still hard to find time to write at the moment. Anne and I are getting married tomorrow, so hopefully I'll have a little more time next week.

In the meantime, we've had several public holidays recently (we get them all at one time of the year). Unfortunately I've only taken advantage of the most recent of them, which was last Monday.

Anyway, with the wedding tomorrow and a lot of work to get done, I won't be writing much today either. I just thought I'd better write something before I get too far out of the habit.

Uni
I did my confirmation last Friday. The internal web page with the administrative details of confirmation has been taken down, so it wasn't until the last moment that I discovered when the paper was due. It was sooner than I had thought, so I didn't have a lot of time for feedback on it. As a result, no one got back to me in time. It was frustrating, but at least the paper was accepted.

Again, with lack of time, I didn't get a lot of time to prepare for the presentation either. I'm normally good with presentations, so I was feeling a bit overconfident about it. But once I was in there I realised how out of practice and under prepared I was. Fortunately, confirmation is evaluated in relation to the quality expected of most students, so from that perspective I did well. All that matters is whether I was accepted or not, and I did that easily.

However, I was very disappointed with myself, as I can do a lot better. I've asked Bob for the opportunity to give several more presentations, to get some more practice.

Resolvers
I backed out some of the comparator changes, as I mentioned last time. However, I've kept them archived, as I still feel that they are more "correct". If I ever run into problems with the current method I'll have some code to fall back onto. This code is quite long, so I wouldn't want to write it again.

Debugging has gone slowly, with everything else I've been doing, but it's all looking good. The main problem I seem to have now is a need for a resolver that will also search along collection lists. I have to think about the consequences of this though, as this would remove duplicate entries. None of the use cases I have contain duplicates, but it would still be a problem in some circumstances. Maybe I should just document this as the required behaviour (which it is) and move on.

In my spare time (like I have any of that) I've also started work on another resolver. This time it's not related to my project, but I think it's going to be important...

OSX
Mac OS X.4 was supposed to arrive last Friday, but it didn't. If I'd realised it might be late (Apple guaranteed delivery on the release day) I wouldn't have pre-paid for it, and would have gone off to an Apple Centre to buy it instead. This is a toy I've been looking forward to. :-)

At least this gave me a chance to play with, and set up my new Neuston MC-500. Linux does not synchronize it's video output too well, so fast-action video scenes can tear a little. Also, MythTV hates sending output to the TV through my old MGA G400 card (I don't play games, so I haven't had a need to upgrade it). So now I have MythTV doing the recording, and the Neuston doing the playback. The user interface with the remote is easy for Anne to use, so she's been able to show Luc a lot of recorded "Wiggles". :-)

But I digress...

The real reason I've been waiting for OSX has been to use Spotlight. This is a hash-based index for meta-data of all the files on disk. For some time I've been thinking of implementing something like this for Linux or OSX, but now that Apple has done it for me I don't need to. :-) Longhorn will be doing something similar as well, so it should not be long before this becomes a common feature for modern file systems.

I was thinking about using Kowari (or a similar system written in C) to implement the filesystem index. Now that I don't have to, I've been thinking of inverting the whole thing, and using the filesystem index as a backend for Kowari. This just requires a resolver that wraps Apple's Spotlight interfaces. That's the resolver I've been writing.

For the moment, I've been using JNI to implement a set of Java classes which provide the meta-data interfaces from Apple's "Core Foundation". This has been a good way to learn Core Foundation. It's also good practice in JNI, as I'm trying to do everything "right". This means a lot of error checking, etc.

Eventually I expect to provide a set of metadata classes for Longhorn as well. However, I should see what functionality is provided there before I try to work out a common interface that will work across filesystems. In the long run, I'd like to have library that detects the OS, and automatically loads up the correct JNI library. I could even have a fallback class written in Java that returns basic meta data about files (filename, timestamp, owner, etc).

But at this stage I just have classes which work on Spotlight. Once I've wrapped them in a resolver I'll be adding them to Kowari.

If anyone is interested in this code, then just let me know. It's all being open-sourced, but I just haven't got around to publishing it yet. I'm been doing this Spotlight work in my own time (since I'm working on the rules engine in the day), and between the confirmation, the wedding, and last weekend's triathlon, I haven't had a lot of "spare time" at all. :-)

Working notes