Friday, November 24, 2006

Patents

Today Slashdot has a story on LSI patenting multiply linked lists. The patent was applied for in 2002 and granted in April this year.

I'm glad to see lots of comments on Slashdot referring to prior art. A decent lawyer could avoid the question of doubly linked lists, so I was pleased to see more complex examples from the 80's and 90's.

I was particularly bothered by this patent due to the fact that I would have inadvertantly violated it several years ago. My blog describes an RDF DB design with takes advantage of multiple links for each node in a list, so that the lists can be traversed in completely different directions. Unfortunately the posts were made in May 2004, which was well after the submission date of the patent (2002). So my work doesn't qualify as prior art.

You'd think that my own work (among others) constitutes the breaking of the "non-obvious" requirement. Apparently not. There must be a whole swathe of the software developers community who may be classified as "brilliant geniuses" by the USPTO, given that we all come up with the same non-obvious idea.

When learning programming, we are all inculcated with standard techniques. We are taught to not reinvent the wheel. Design patterns, algorithms, you name it... the state of the art in programming is to take someone else's system and build on it. Make it something new by connecting old things in new ways, or adding into existing systems. We are NOT taught which techniques to avoid due to patent problems. We are not taught how to search for possible conflicts with existing patents. In fact, the whole point of modern programming is an anathema to what patents try to enforce. So it can be really frustrating to see patents on common ideas. It can be even more frustrating when someone decides to patent something that you've worked with. I guess I should be grateful that I needed a more scalable design than the one described in that blog post.

Wednesday, November 22, 2006

Mashups

I like seeing the tools we write being used in the real world. After all, what is the point otherwise? We are trying to create something useful, so it has to be applicable to use-cases that people really want. While my own specialty is Mulgara, this is just a tool for the semantic web. So I'm interested to see these use-cases, regardless of the underlying engine.

With this in mind, I enjoyed reading a post from Danny Ayers today. He describes using various data sources to solve a useful real-world query that he got from yet another source, to wit:
"Where can I buy firewood as a function of price and distance from where I am now?"

While I agree with Danny about the ability to put all of this together, I think he is a little optimistic about the effort to do so:
"... all the tech to answer the question on the Web is readily available as almost-commodity tools, not a lot of effort needed."

It would be nice to think that he is right, but I'm skeptical. Maybe I'm overestimating the effort. After all, I'm still a beginner at programming Ruby. :-)

My favorite line was last:
"Why would anyone put such data on the Web? Perhaps they might want to sell their wood."


BTW, of all the words in the US spelling system, the one that bothers me the most is "favorite". It just looks wrong somehow. Why this word bothers me more than "color", "optimize", or "bank check" is something I can't answer. :-)

Lament for Open Source Work

I miss those heady days of working on open source code 90-100% of the time. In those days I could talk about work as much as I wanted, with no ramifications. In fact, discussing my work on Kowari was a good thing.

I initially started blogging about my work simply so others in my team could find out what I was doing. If they didn't have time, then they didn't have to read it, but if they were interested then it was all there for them to read. I quickly found that organizing (there's that American spelling again) my thoughts sufficiently to write about them helped me to keep track of what I was doing, and to plan for up coming work.

Somewhat surprisingly (for me), it didn't take long before people I didn't know started to read my blog as well. This was an interesting experience, and led to some interesting conversations and contacts. Ultimately it helped me get my current job!

Unfortunately, my work now is almost exclusively commercial. That means I can't write about it in public. I've also found that I like blogging too much to write about it privately (even in the local Wiki at my company).

I can still write about Mulgara outside of working hours, but then I run into a new problem. Back in the day, I was able to work on Kowari (now Mulgara) during the day, and blog about it at night. These days I work on "other things" during the day, and at night I can either blog, or I can work on Mulgara. It's hard to find the time to do both. It's a little frustrating.

Actually, most of the technical things I've been writing about have been in the Mulgara Wiki, and not in this blog at all. So at least I've written something, but it's disappointing to think that I've lost most of my audience (small as it was). The emails and comments I received were often insightful, very useful, and quite gratifying.

With all that audience missing, who am I writing for today? Well, mostly myself... just like it was in the beginning. I suppose I'll also be read by those people who forgot I was in their RSS feeds (hello to both of you!). :-)

Recently I've noticed that I miss the process of writing. I guess I'll have a flurry of activity in this blog for a little while, until I get overwhelmed with other things again. Hopefully it will be fun while it lasts!

Tuesday, October 10, 2006

Functional Programming


Is functional programming the next big thing? A lot of the reading I've done lately would seem to suggest so, but that might just be the nature of what I've been reading, rather than a genuine shift in the IT community.

Of course, I've been hearing about functional programming for years. I've always skirted around it, since it was not taught in my electrical engineering course and I never pursued computer science academically until recently. (Maybe that's another reason I've been hearing about it more?)

My first exposure to this style of programming was when I learnt the C++ Standard Template Library (STL). Of course, it's no where near pure functional programming, but I was amazed at how elegant code could become with the STL. An HP example that I remember seeing demonstrated a simple program which read strings from stdin, sorted them, removed duplicates, and printed the results, all in 2 lines of code. However, if I wanted terse I could just code in Perl. The beauty of this code lay in its simplicity.

Over time I encountered this approach in many different forms, but it wasn't until I watched the lectures of Abelson and Sussman that I got a formal appreciation of it. Now I understand why programmers recommend learning Lisp, simply for the different way that it makes you think.

Recently I've started seeing many programmers whose blogs I follow commenting on functional programming. It has taken a while, given that the approach was first really pushed by Backus way back in 1978.

One of the big advantages of functional programming is the ability to provide a proof. There are mechanisms for proving the correctness of imperative programs, but these are largely impractical on real systems. Functional programs by contrast have the form of a mathematical expression, thereby making proofs easier to derive and verify.

I used to think of proofs as only being of interest to academics, but I've since changed my tune. Everyone in this industry is aware of how buggy software can be, and how often projects fail to meet deadlines due to errors. Unfortunately it is all too common that these problems (among other things) can lead to projects failing altogether. With this in mind, several companies have been successfully programming using tools for converting formal specifications in Z-notation into running code. The reported bug rate is significantly lower, along with extraordinarily high rates of success. I understand that many clients with critical applications (such as military and space deployments) are choosing to only accept code written in this way. It makes sense to me.

The more I learn about this, the more I appreciate Dijkstra point of view, particularly on teaching programming. And I'm not the only one lamenting his lack of influence (I apologise to those non-IEEE members who cannot read that link). I understand the point of view that he was Ivory Tower idealist, but with projects constantly failing in this industry, and our inability to find developers with a proper understanding of programming, then I've come to revise my point of view.

With this in mind, some months ago I started learning Haskell. This language was created for academic purposes, and is considered the only completely functional language. For that reason, it may sound impractical, though Quake Haskell, along with many other projects, provided me with enough encouragement that it really can be used in the real world.

I'm not very productive in Haskell yet (for instance, I can follow monads, but I don't really get them - though I'm still reading). But I've found that it has improved my understanding of certain programming approaches, and made me a much better developer in the languages I am good at.

In the process of learning Haskell, I started to learn more about another functional language called Erlang. This language was developed in the late 80's by Ericsson, partly in response to some of the things I've been talking about here. Ericsson also released a short film, if you're interested in learning more about it. Erlang has a number of similar features to Haskell, including pattern matching, which is phttp://beta.blogger.com/img/gl.link.gifrobably the single most important feature of these languages.

If they're similar, then why would I start learning Erlang when I still haven't mastered Haskell? Well, regardless of the success of Haskell in some of the above projects, there are still barriers to its practicality. The best comparison I've seen so far was by Joel Reymont, when he attempted to build a client application for a poker server. Regardless of his complaints on Haskell, he still finished by saying:
"That said, because it forever twisted my brain into a different shape and I think I’m overall a much better coder now."
This reflects my own experience so far, so I'm continuing to make my way through my Haskell books slowly. It's a shame that it has to be as slow as it is, but work, study and Mulgara leave me limited time. Maybe one day I will get competent enough at Haskell and Erlang that I can move on to OCaml.

It's amazing how all of these things interconnect. I'm also spending time on Description Logic and how to do this in RDF/OWL, so I can create a programming engine in Mulgara. While assertions are different to function calls, there is enough similarity that I wish I could do much of this work in Haskell rather than Java. At least Haskell has taught me many of the techniques I need to make this job easier, and much more likely to be correct.

Wednesday, September 13, 2006

Morphisms


This won't make sense to most people, but I'm just taking notes for myself so I don't forget...

I notice in the Wikipedia entry on Category Theory the following:

"Relations among morphisms (such as fg = h) are often depicted using commutative diagrams, with "points" (corners) representing objects and "arrows" representing morphisms. The influence of commutative diagrams has been such that "arrow" and morphism are now synonymous."

This reminds me of the (recently neglected) idea of compositional statements in OWL. I require some formalism to make composition a viable suggestion, and this may be a starting point.

The problem with this approach is that predicates can't really be considered to be morphisms, as it is possible to repeat them many times on a given subject. (Does this mean that the predicates can't be morhpisms, or just that they can't be monomorphisms?) Still, I wonder how many category theory ideas are portable into RDF? In a moment of meta-meta insanity, I'm left wondering if there is a morphism from category theory to RDF?

Thursday, August 31, 2006

Nothing

I didn't want to embarrass myself in public, so I looked at the whole owl:Thing vs. owl:Nothing again. Something I'd forgotten is that owl:Nothing cannot be instantiated. So it's OK if owl:Nothing seems to be a contradiction (I think), since it is only the instances of a class which create problems, and you can't have an instance of owl:Nothing.

It's after midnight. I'd better get to bed before my head hurts any more.

Wednesday, August 30, 2006

OWL

I'm finally getting off my rear end and writing some code to do OWL in Mulgara. (I hear the gasps of disbelief from the peanut gallery).

So I needed to start with some OWL axioms. You know, owl:Class is an owl:Class, it's a rdfs:subClass of owl:Class, etc. So I went to a handy little file I have named owl.rdf. Now I distinctly recall that the file was not originally named owl.rdf but I wasn't sure why I'd renamed it. All the same, I've had it for some time, and it looks official, so I decided to work with it.

The easiest way to convert it into Krule axioms was to load this file into Mulgara, and then have an iTQL client build the RDF for me. I had a one line bug (why is it always one line?) which took me a while to track down to a typo in a constructor where I effectively said:
  parameter = this.parameter;
But finally my output was as expected.

At this stage I noticed something odd in the axioms. Out of curiosity I went back to the owl.rdf file and discovered this:
  <Class rdf:ID="Thing">
<rdfs:label>Thing</rdfs:label>
<unionOf rdf:parseType="Collection">
<Class rdf:about="#Nothing"/>
<Class>
<complementOf rdf:resource="#Nothing"/>
</Class>
</unionOf>
</Class>

<Class rdf:ID="Nothing">
<rdfs:label>Nothing</rdfs:label>
<complementOf rdf:resource="#Thing"/>
</Class>
So on one hand it says that owl:Thing is the union of owl:Nothing and everything, and on the other hand, owl:Thing is the complement of owl:Nothing. I can see a rational behind both points of view, but they can't both be true at one. owl:Thing either includes owl:Nothing or it doesn't. (Or am I missing something really big here?)

So where did this file come from? Unfortunately, I picked it up with wget. If I'd used Safari, then it would store the URL in metadata which I could view with mdls. Apple metadata might be a nice patch to work on for wget. Fortunately, when I went to the W3C to search for the file, I suddenly remembered where it came from.

RDF uses Uniform Resource Identifiers (URI) to identify resources. Note that these are Identifiers and not the more familiar Uniform Resource Locators (URL). URLs are a subset of URIs, but their purpose is for location rather than purely identification. A URI may look like a URL, but it may not be specifying the location of anything at all. This is often confusing for first-time users of RDF, who expect that the identifiers they see should be taking them somewhere useful.

Now just because RDF says that you don't need something at the address of a given URI, that doesn't mean you can't put something there. In fact, many people do, including the W3C. For this reason, one day I decided to put the URI for the OWL domain into wget and see if I got anything. The result was this RDF file called owl. Once I remembered that, I also remembered changing the name to the more useable owl.rdf.

So here is this file sitting at the W3C, hiding in plain sight. I find it amusing that it can have such a contradiction in it.

Friday, August 25, 2006

Am I Getting Old?


This story has been written elsewhere, but I have my own contributions to make to it. Hopefully the catharsis will help me sleep better. :-)

I've been helping to interview several programmers recently, and I've really been surprised at the results.

To be clear, I do not have a degree in Computer Science. My first degree was in Computer Engineering. That means that I spent many long hours debugging breadboard with a logic analyzer, and the code I was writing in my final year was usually for device drivers to communicate with the hardware I'd built. That isn't to say that I didn't have subjects in computer science - I did.

I did the basic first year CS course, which taught us about records, how to build linked lists, recursive functions, simple sorting algorithms (bubble sort!), and the other essentials. I did subjects on networking, operating systems, concurrency (anyone care to dine with philosophers?), modular programming, complexity theory, and databases. Funnily enough, in my first database course we were not given access to computers, and never learnt SQL. Instead, Terry Halpin taught us how to model data using a system called NIAM. It seemed very theoretical at the time, but it ended up being a vital foundation for my later work in OWL. I also had Computer Engineering subjects which taught me Assembler (in several flavours) and C, how to use them to talk to hardware and/or the operating system, how to model hardware in software, sockets programming, and so on.

I never covered more advanced data structures or algorithms, but then I never did a degree in CS, did I? My physics degree gave me the opportunity to do a couple of extra subjects in programming, but these were vocational, and hardly worth the effort. Instead, I had to gain an understanding of professional programming in my own time. I had a very good mentor to start with, but his main role was in teaching me how to teach myself. Consequently, I now have a large bookshelf full of books covering topics from MFC and GTK, through to Knuth, Type Theory, and Category Theory. There are still holes in my knowledge, but I've covered a lot, and I'm still progressing. In fact, learning after university has meant that I've learnt better, as I learnt out of genuine interest, and not because some lecturer said I needed to learn something in order to pass an exam. For this reason, I have learnt to have less respect for certified courses, and a greater appreciation for enthusiastic amateurs.

The same willingness to keep learning also applies to most of my friends and colleagues. This is important in our industry, both because the field is too large to know it all, and because the systems and protocols implementing the theory are constantly expanding. Fortunately, we don't need to know it all. The fundamental principles in computer science always hold, and form a solid foundation for all future learning. Once you have these down, then any new language or paradigm can be understood and applied quite rapidly.

So when we interview candidates, I look to make sure that they have a grasp of the fundamentals. Instead I keep discovering people with years of experience in J2EE, who are completely unaware of the underlying principles of computer science. It is almost universal that these people have solid vocational training, but very little ability to work outside of established paradigms.

Why do I consider these things important? Isn't it acceptable to simply work with EJBs, SOAP and JDBC? Why should I expect more?

There are several answers to these questions.

To start with, staying within these narrow APIs has left many of these people unaware of significant APIs in Java proper. Most developers we have met do not even know what file mapping is, yet alone the fact that it is included in the java.nio package. Concurrency is a subject with subtle complexities, yet these developers are often unaware of the attributes which make up a Thread in Java, including the stack that each thread owns. How can they hope to correctly use a ThreadLocal object, or know how to debug a deadlock?

Another problem with vocational training with a narrow focus is that almost all systems have a chance to interact with each other in some way. Not understanding the surrounding systems leads to an inability to solve some problems. I have yet to meet a candidate who understands what significance a 64 bit operating system has for a JVM, when compared to running on a 32 bit system. This is so important that Mulgara has a completely different configuration for file management when operating on 64 bit systems, with vastly different performance. How can programmers working on scalable "Enterprise" applications be effective at their jobs if they can't understand this difference? The level of ignorance is surprising, given the general interest in 64 bit chips and operating systems.

This extends down to simpler issues as well. As with many languages, some understanding of 2's complement binary arithmetic is required for working with all of the numeric types in Java. For instance, for every signed integer type in Java, the abs(MIN_VALUE) is always 1 greater than the MAX_VALUE. Even when shown this fact, the people I've spoken to don't understand the significance of the expression -MIN_VALUE. This could easily cause problems in code that said:
  if (-x > 0) {...}

We also find an amazing ignorance of complexity. How can a programmer write a scalable application if they believe that linear complexity is to be preferred over logarithmic? We even had one person think that polynomial was best!

The next issue is that Java will not be the dominant paradigm for the rest of our careers. Computing moves fast. Java may be around for some time yet (look at C), but something new is guaranteed to usurp it, and it will happen in the next decade (I feel uncomfortable making a statement with a guaranteed timeframe, but I have history on my side). For instance, many clients of mission critical systems are looking to developers using functional languages with the ability to mathematically prove the correctness of a system. I'm only mentioning it briefly, but the importance of functional languages in this respect cannot be underestimated. Proof trees make it possible to prove the correctness of an imperative program, but the results are messier than practical in a commercial environment.

Finally, and this is one of the most important principles, other languages teach programmers to think differently. This allows programmers to write important functionality in Java that would not be possible with standard J2EE practices. Mulgara needed a C programmer like David in order to have data structures on disk which are as efficient as the ones we have. Standard J2EE techniques would not store anything directly to disk, but would use a relational database. Even if a programmer had chosen to write directly to disk, standard serialization techniques would have been used. Instead, Mulgara uses densely packed data structures that serialize and deserialize quickly, and are structured in complex ways that are made cleanly opaque to the calling object oriented code... and it is all done in Java. David's understanding of how operating systems use write-behind algorithms, which hard drive operations are atomic, and how best to interact with the operating system's buffer cache were all vital in Mulgara's performance.

Google needed Lisp programmers to write MapReduce, which is one of the most important and fundamental tools in their system. It was Lisp programmers trying to get some semblance of Lambda Calculus into C++ who introduced the templates, and ultimately the STL. This then led to template classes and functions in Java, again highlighting the need to understand these other systems in order to better program in Java.

If we want to write important code that does anything beyond standard business logic for commercial applications, then we need people who understand computer science, and not people who have received vocational training in the popular language of the day. They need to understand the mathematics behind programming, the pros and cons of both functional and imperative languages, the features of underlying operating systems, along with many other things.

Finally, if they want to be really good, they need a desire to go out and learn some of this stuff for themselves, because no course can teach it all. It's just a shame that they don't seem to teach the basics any more. And it's the fact that I typed that last statement that makes me realise that I'm getting old. :-)

Thursday, August 24, 2006

One at a Time


I'd love to be writing more here, but instead I'm putting my Mulgara ideas up on the Mulgara Wiki. Anyone interested should look under Development/Indexes.

I can only really manage the time for writing in one place at a time. Not to say that I'm abandoning this blog, but it is more appropriate to put Mulgara design details on the Wiki.

Saturday, August 19, 2006

Mulgara v1.0.0


Mulgara v1.0.0 has been released!

All my spare time has gone into this recently, which should explain the dearth of blogs posts. Now that the first release is out, I should have more time to:
  • Write some code.
  • Write more blog entries.
  • Get OWL working in Mulgara.
I hope that future releases will require less effort, since the project is off the ground, and we've established several procedures for this first release.

Thanks to the people who helped me get it out. In particular:
  • Brian, who coded and debugged.
  • David, who reviewed everything from legal files through to web pages, and without whom I'd find myself in a world of trouble.
  • Thomas, who volunteered out of nowhere to design the web pages for us.


For anyone missing my technical musings on the structure of Mulgara, then have a look at the Mulgara Wiki. I've just started using this area, so others can contribute, and also because it allows me to structure some of the details.

Saturday, July 08, 2006

Forking Code


Of course, I can't really discuss work online, which has been a good reason to keep quiet lately. (Since I'm working so much of the time). The other reason is because my spare time is kept occupied with Mulgara.

Anyone looking at Mulgara recently would have some idea of what I'm been doing recently, instead of blogging. The short story is that NGC have come to a legal agreement acknowledging that Kowari is open source, and they should not have stopped progress in the project.

There are some who hope we can now abandon the fork, and go back to working on Kowari. In one sense, I'd certainly like that. Kowari has a name for itself, and it's a name I've worked with for years. It helps people who have a commitment to using Kowari, and for whom a name change would be an annoying distraction.

On the other hand, NGC didn't have a legal leg to stand on when they started all of this, and yet they demonstrated that you don't need one to cause problems. The problems have all come down to misunderstandings, but Open Source Software has always been difficult for NGC to understand, and it is always possible they will have difficulty understanding it in the future. I know they also offended some of the Kowari developers while mendaciously describing our work, and that never goes over well with people.

So while it would be a terrible shame to lose the name Kowari, there areenough people agreeing that there are enough reasons to go ahead with Mulgara. Besides, having started it up and got it running now (many thanks to Luigi and Herzum Software for the support) I feel I have some ownership over the process in the new project.

The other thing people might notice is that I've been working on code. Alas, there are no new features yet, but I'm getting close to a point where they will be forthcoming.

Roughly, the Mulgara work falls into 3 categories:
  • Administration for the project.
  • Forking the Kowari sources.
  • Upgrading Mulgara from Java 1.4 to Java 1.5.


Administration means learning about, and setting up Mailman, learning how to configure and use Subversion, learning the latest changes in Apache configurations (I think I last configured Apache at version 1.3), and setting up HTTPS and WebDAV access. Yuck.

I still haven't worked out how to get WebDAV to force logins over HTTPS. I can enforce logins, and I can get it over HTTPS, but not both. Oh well. I guess it just means that administration has to be done by Jesse or myself for teh time being.

Forking Mulgara has had its own difficulties. We just about had it done a few weeks ago, but NGC started insisting that certain recent code belonged to them and should not have gone into CVS at Sourceforge. They've since retracted this claim, but in the meantime we went back to the 1st of August last year and forked from that point. So we had to repeat a lot of the work we'd already done. Very annoying.

One benefit to going back to August has been that we may get to avoid a bug that we think has been introduced since then. A couple of users have reported difficulty loading large RDF/XML files. Since the parsing has all changed since August (not as a fix, but in an attempt to avoid using Jena), then this could be the cause of the problem. It will be a good opportunity to check it out anyway.

A number of bugs were introduced when all the packages, files and variables were renamed in the fork. In the course of tracking down all these bugs, I've used the opportunity to move to Java 1.5. It just created a few more bugs to track down. Fortunately, I knew where to find most of these problems from earlier personal attempts to convert to Java 1.5.

Java 1.5


The most obvious issue moving to Java 1.5 has been the new keyword "enum". I've noticed in several other places that "enum" is a popular name for packages and variables. Kowari was no exception. This was easily remedied. Unfortunately Apache has a package called enum which is referenced by code generated by the WSDL compiler. The workaround is to compile just this generated source with the -source 1.4 flag.

The other problem I encountered was in the change to Unicode 4.0.

I never gave a lot of thought to Unicode before Kowari was released as Open Source Software. Until that point I'd been able to get away with ASCII. After all, I don't know any other spoken languages (I really ought to do something about that) and all my work had been in English.

After Kowari had been out for a little while we started getting bug reports and fixes from someone in China compaining about how Kowari's strings were not handling certain characters properly. This woke me up to how international OSS is. I was used to working with people in the United States, but by being involved with something like this on the internet we were exposed to a whole new international community.

This has made me a little more sensitive to the need to support Unicode in all projects. OSS projects need it for the reasons I mentioned above, and commercial projects need it if they ever hope to succeed internationally (a real concern for me when my employer has offices in Italy, the UK, Turkey, and France).

Now 16 bit characters are all well and good, but the latest version of Unicode requires a full 20 bits. These extra characters (called "supplemental" characters) are not used very much, but I've already learnt that we can't ignore the things we are ignorant of. Unfortunately, the previous mechanism for managing these characters would not work in Java 1.5.

Previously we had used a regular expression to look for a pair of "surrogate" characters, which together form a single Supplemental Unicode character. Unfortunately, the character classes which used to be supported for this in Java 1.4 are no longer available. Some research showed that the regex library now wanted to treat these characters as a single character (which is a definite improvement), but checking documentation, and finally Java source code showed that there is no "character class" for referring to the supplemental character set.

I'm short on time here, but for anyone interested, the old regular expression for this was:
"\\p{InHighSurrogates}\\p{InLowSurrogates}"

Now the correct expression is to use the numeric range for the surrogates, and use them to build a class of unicode literals which span this region (U00010000 to U0010ffff):
"[\ud800\udc00-\udbff\udfff]"

This seems to be the worst of all worlds. We have to use a pair of 16 bit codes to create a single 20 bit unicode (using the surrogate arithmetic for calculating the final value. e.g. UD800+UDC00 = U00010000). However, we can no longer refer to these codes as a "pair of surrogates", but as a single character (even though we DEFINE that character with a 2-part code). It might have been nice if Sun created a character class like this for us. Something like:
"\\p{InSupplementals}"

Well, at least it now works in Java 1.5. It just doesn't work in Java 1.4.

geo:(41.8851942,-87.6351607)
mylocalguru

Monday, May 15, 2006

Subversion


For anyone not familiar with subversion (I include myself in this group) then the 1.2 version of the code can be retrieved via:
  svn co https://mulgara.org/svn/mulgara/trunk/kowari-1.2

I just thought I'd mention this in case anyone tries to get the whole repository (which includes the entire history of the project). The entire thing comes in at 2.6 GB. The latest branch is a more modest 204 MB.

Thursday, May 11, 2006

Mulgara

The two most important elements of Mulgara.org are now running: Mailing lists and Subversion.

I've discovered that Yum is not the way to install Mailman on Fedora Core 4. It isn't integrated with any MTA which means that it needs a lot of manual configuration. After getting it working with Exim, I discovered that it wouldn't send any messages due to a problem with Sendmail. Moving from Exim to Postfix fixed the problem, but involved reconfiguring Mailman. Overall, a frustrating experience. I'm very grateful to Jesse for his help.

Jesse is the official administrator of mulgara.org, but like me he has other work to get done. Since I'm the one who's actually developing for Mulgara, he was more than happy to give me sudo and let me install the things I needed. The only problem is that he knows what he's doing, and I don't. :-)

Services

So where are we? We have:

Still to do:
  • The web page is currently just a place holder. We need to come up with content, and we need to move all the documentation over to it.
  • From a development point of view, we need to move all the org.kowari packages over to org.mulgara. This would be easier with the refactoring tools in Eclipse, but these won't use the required svn mv command. Much of the first round will be a simple search/replace, so I may just be dusting off my perl/sed skills. To anyone with suggestions on how to do this more intelligently, please get in touch!
  • The site needs a signed certificate. Thawte and GoDaddy both have certificates for under $100, but I should check out with other what we really need. I'm guessing we only need the bare minimum.
  • We need a Wiki and bug reporting system. Herzum Software has offered to provide licensed copies of Atlassian's Jira and Confluence for use by the project. I'll have to ask Jesse to install these, as I have no idea about either of them.


Feel free to write to me with any questions, advice or complaints. I'll answer, accept, and listen, respectively.

Wednesday, May 10, 2006

Work, work , work

Yeach. Long hours in the last few weeks have left me feeling wrung out. I tried to make time for family through this (not always successfully), so I didn't have time for anything else. The first thing I tried to catch up on afterwards was exercise. This blog came way down in the list of priorities.

Bugs

I discovered a couple of problems in Kowari in the last couple of weeks. Unfortunately I had goals to push towards, so I haven't been able to look at them yet. I'll get back to them, but in the meantime I'll write them down here for others to look at.

Full Text Searching

One thing I wanted to do (fortunately I didn't need to do it at the time) was full text indexing and querying. This is straightforward using a Lucene model. Unfortunately I couldn't get it to work! All my queries came up with empty results.

I figured that I must have been doing something wrong, so I went to the examples in the documentation. I couldn't get any results here as well. So now I think that something must be wrong. I haven't had an opportunity to run the full test suite lately (I really only have a notebook, and it's slow enough without having to run the tests in the background), but I'll have to run it soon to make sure everything is still OK. If it is, then I'll be running Kowari through a debugger to figure out why I can't get any results from Lucene models.

Normally I'd have my big Linux box doing these tests for me, but it's still in a shipping container, and I've been told that I won't see it before June. Ask me if I'm frustrated.

There is one other machine that I could use, and that's the Mulgara server. I haven't discussed Mulgara here, so I'll discuss it more later.

Datatypes

This bug doesn't affect me at all (at the moment), but it's surprising to see all the same. I discovered this bug when scanning an model which restricted a class using an owl:cardinality of "1". The data type for owl:cardinality is xsd:nonNegativeInteger, so that's what I used. So far, so good.

Then in some test code, I inserted a series of numbers into a Sequence. These were just 1, 2 and 3, and each of them was typed as xsd:int. However, when looking at the full contents of the model I was surprised to see the following numbers in the sequence:
  "3"^^<http://www.w3.org/2001/XMLSchema#int>
"2"^^<http://www.w3.org/2001/XMLSchema#int>
"1"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger>
So if the number was already in the string pool as a "nonNegativeInteger", then it just picks up that value again, rather than re-inserting it.

This will be OK for simple cases like this, but it won't work in the general case. Looking at the hierarchy of XSD data types you can see that int and nonNegativeInteger are both subtypes of integer, but they are not in the same branch. So neither can be used in the place of the other, unless an application specifically calls for an integer (their closest common ancestor).

So having the string pool automatically convert one to the other is going to be a problem one day. I've already started checking literal data types, and need to make an exception for this case. I can certainly conceive of a system which would fail a necessary consistency check because of this.

Now I'm left with the option of fixing it now, or waiting to put it into the XA2 store. I still don't know when we'll have time to get to XA2, so maybe an interim fix is the way to go.

RMI Connections

By far my biggest problem is RMI connections. I've seen posts complaining about how slow Kowari is, which I find really frustrating. Queries in Kowari are nearly instantaneous, and insertions are pretty fast as well. So why do people think it's slow? The answer lies somewhere in the Session interface.

For some reason, when the first query is performed on a Kowari database, the Session can take several seconds to establish a connection. This happens whether or not the system is remote or local. I have yet to profile this delay, but it all seems to be in the RMI code. I can't imagine that RMI is that bad, so I'm guessing that it's doing a DNS lookup in there, though I may be wrong. Either day, it's an unacceptable delay. We need some way to speed this up, particularly when running on a local machine. I put a lot of work into making Kowari really, really fast, and the frustration I feel at these network layer delays nearly drives me to tears.

Another problem is the what happens with an error from the ItqlInterpreterBean. It seems that a single error (often avoidable as they are normally caused by programming bugs) can render the Session nearly useless. There is some kind of synchronization issue after an Exception is thrown, that a Session can't seem to overcome, at least for a couple of operations after that. I'm not sure why this is, but it bears investigating.

A lot of these problems seem to come down to RMI. I don't know what the best replacement technology would be (SOAP will also have DNS issues, etc) but we need to look at something. Whatever we choose, we will probably need to bypass it when accessing a local data store. At the moment we do this transparently, but perhaps we should consider something a little more optimized for local connections, since this forms the majority of the use cases I've seen.

Mulgara

As David Wood pointed out, we are forking the Kowari Project. The new project will be called "Mulgara". We felt this was necessary to keep developing the RDF database when there was uncertainty due to the relationship between Kowari and Northrop Grumman.

Some of the problems appear to be caused by a misunderstanding of Open Source Software by Northrop. This has made me reluctant to discuss Mulgara until now, as I didn't want further misunderstandings to develop. Fortunately, we've had legal advice from several quarters confirming our right to fork the project (since the project is licensed under the MPL).

My employer, Herzum Software, has offered to host the new project, putting up a new Linux box for this purpose (running Fedora Core 4). Unfortunately, the administrator for the machine has been overseas for the last couple of weeks, and with my recent long hours neither of us has had time to configure all of the necessary services. I hope to change that this week. In particular, I've discovered that Yum is not a very good tool for installing Mailman (it didn't even notice that there was no MTA installed!)

Some of the services are already going, such as Subversion, but I still need to organize a certificate (at the moment it just provides the "localhost" certificate that comes standard with Fedora Core).

Check back soon and we should have it all running.

Tuesday, April 18, 2006

Commitments


It's quite difficult to blog while working and moving residence at the same time. I've had some big things happening at work at the same time, so the preference is usually to get those things done rather than write.

On the other hand, I've let slip so many things that I wanted to talk about, that I'm getting frustrated with my lack of blogging. I don't have a lot of time tonight, but I thought I should at least get something written.

FAT_MAGIC


Following a link off Slashdot, I found an interesting discussion of otool in OS X. The author had been looking for an ldd replacement for finding the dynamically linked libraries that a binary uses, which is something I've wanted as well. I just didn't know about otool.

otool seems to be a Swiss-army-knife for inspecting Mach binary files. The array of switches seem quite large, particularly since each function seems quite unrelated to most of the others. It makes the tool feel a little overloaded.

The article points out that the file /usr/bin/crt1.o contains the entry point code for all programs in OS X. I've been interested in the structure of universal binaries, particularly dynamic libraries, so this seemed like a good library to apply otool to.

Looking at the numerous parameters, I discovered that:
  otool -f
and
  lipo -detailed_info
produce nearly identical output. I suppose this makes sense, given that both are looking at the fields in standard Mach binary structures, but it's so close that I have to wonder if they might share some code. If that is the case, then I wonder if any other otool flags duplicate the functionality of other programs?

While looking at all the information available from otool, I discovered that all of the programs I was looking at start with a magic number of: 0xcafebabe. I was shocked at this, as this is the same magic number used by Java class files. Why would they share this value?

I suspect that FreeBSD took this number first with their definition of the Mach file format, and Sun's Java team weren't aware of it when they were defining Java. Or perhaps they knew, but didn't care about a small, non-commercial operating system (little did anyone know what would become of FreeBSD in the future!). So where did FreeBSD get this format from? Did they define it themselves, or did they inherit it?

Whatever the reason, it obviously created conflict. I found a reference to it in the change log for the cctools package in the Darwin project:
Changes for the 5.16 release (the cctools-520 release):
- Fixed a bug that caused otool(1) to crash when it was called with a bad fat
file (a java class file). The calls to ofile_first_arch() in ofile.c was not
checking its return value and later causing the code to crash when it should
have returned in case of an error. Radar bug #3670740.
Looking at the code in question, it now looks for bytes at a particular offset to try to avoid mistaking a Java class file for a Mach binary. I don't know if it's justified, but I always feel uneasy when a programmer leaves a note to say that the implemented approach is a hack.

These magic numbers are supposed to help a system differentiate between different file types, particularly executables. Having the same number on two major file types seems like a problem, but it's not one that could be changed now.

Friday, March 31, 2006

Programming


I've actually been doing some programming at home lately, which seriously cuts into the time in which I might blog. At least I've felt productive. That may have to take a back seat for a few days, as we're moving apartments this weekend.

Brian at work has created a small auditing library, which can be called by developers to log when methods are entered and exited. The cool thing was that he has turned the results into some cool interactive SVG that plots the course of the program.

At this point I remembered that Java 1.5 now includes instrumentation. If this could be used to automate the calling of the auditing methods, then this would remove the burden from the programmer (who, lets face it, could get it wrong, or just be too lazy to bother). So I started learning a little about the new interfaces.

It turns out that instrumentation is quite simple. In fact, it's too simple. Look at the size of the API, and you'll see that there are only a couple of methods. All it does is inform you when a class is being loaded, and provides you with read/write access to the bytecode for that class. That's about it. If you want to take it any further, then you have to get your hands dirty and manipulate the bytecode.

Fortunately the articles I found online did just this. The library generally used for this is ASM. There are several others, but ASM has the advantage of being simple and lightweight.

ASM also has one killer feature that I have yet to see in other bytecode manipulation libraries: an Eclipse plugin with some great features. This plugin not only lets you view the bytecode of your current code, but it also lets you switch to a view where you can see the ASM code required to generate the bytecode for your current code. So all you have to do is type in an example of the code you'd like to generate the bytecode for, and ASM will show you the ASM library calls you make. This almost makes coding with ASM trivial.

So I've been using ASM to instrument code, and generating pretty SVG plots of the call stack over time. I've learnt quite a bit about Java class loading in the process. It's also moved me one step closer to being able to generate classes out of OWL descriptions (yes, I can use UML, but that's not very serializable now, is it?). Probably the biggest trick was working out how to not audit the code doing the auditing (I hit some infinite recursion on my first attempt). However, this ended up being trivial when I realised that I'm just auditing a stack, which by definition is associated with a thread (Brian's library records method entries and exits based on the current thread). All I needed was some thread-local-storage to hold a lock on the auditing process. Then if the auditing code saw that the lock was being held, it skipped the auditing and returned to the original task. The best part is that its all local to the current thread, so there's no synchronisation required.

This kind of bytecode manipulation was amazingly similar to the Windows API interception I was doing a couple of years ago. I enjoyed it then too. It was just as tough to debug as well, since debuggers don't expect to see bytecodes modified from the condition the compiler left them in. The classes drop all line numbers, and tracepoints only start to make sense when the instruction pointer moves to an unaffected class. I've also started running into the following exception when I use instrumentation on code with a logger in it:
java.util.logging.ErrorManager: 5
java.util.MissingResourceException: Can't find bundle for base name sun.text.resources.DateFormatZoneData, locale en_US
I then get shown a deep stack, which would seem to indicate that the method-calling modifications from ASM are all correct. So it looks like the classpath is wrong at this point. Frankly I'm not surprised, as I've seen different classloaders coming in during the startup process of an application (it's odd seeing an exception about a missing class when the code in question is being called from inside the class that can't be found!

Still, if I avoid the logger, then all seems well.

Wednesday, March 15, 2006

Macs


I've had some new toys to play with in the last few days, so my documentation of Kowari has slowed down. I expect to be back to it in a couple of days (though not tonight, and tomorrow is Anne's birthday).

In the meantime, I've been looking at compiling things in Eclipse on the Mac.

For a start, our new Mac Mini (which runs Java code 4 times faster than my PowerBook!) can't run Eclipse. The lack of an Intel version is listed as a bug, but it hasn't been addressed yet. As I type this I have a set of scripts downloading ALL the Eclipse source code, in the hope that I can find the JNI libraries I need. So far I've found one of the 6 I need.

This was all so I could try compiling a big project from work in an Eclipse environment.

Finding Files


While setting up the files for this project from work, I discovered that a Jar file was missing from the build path. I still like to use locate to look for files by name, as Spotlight only seems to be any good when looking for files by content. However, since the file had only just come in through CVS, it wasn't going to be in the locate database.

Of course, the simple thing to do here was to run find, but it reminded me that I wanted to update the locate database. Of course, I couldn't remember the name, as it's different to the simple updatedb found on Linux. Fortunately "locate updatedb" helped be remember that I was looking for locate.updatedb.

This led me to wonder why the update doesn't happen much on this machine. The computer is often on overnight, yet locate regularly fails to find files I know I've had for a while. I know that something has supplanted cron on the Mac, so I did a man locate.updatedb. This pointed me to the cron directory of /etc/weekly, so I decided to check who was supposed to call it and when.

/etc/weekly is a symlink to /etc/periodic/500.weekly, so I grepped through /etc and discovered:
    crontab:# The periodic and atrun jobs have moved to launchd jobs
Now this looks like the cron replacement! Running which launchd returned /sbin/launchd, which did little more than tell me it is an admin command (obviously). So I checked out the man pages.

The man pages explain that launchd is a new system to replace rc, though there are still a couple of elements going through the old rc system. launchd is configured with XML (as are most of the modern features in OSX), using files found in the following directories:
     ~/Library/LaunchAgents         Per-user agents provided by the user.
/Library/LaunchAgents Per-user agents provided by the administrator.
/Library/LaunchDaemons System wide daemons provided by the administrator.
/System/Library/LaunchAgents Mac OS X Per-user agents.
/System/Library/LaunchDaemons Mac OS X System wide daemons.
Examples are quicker to look at than documentation, so I checked these out, and discovered that all but the final directory was empty. However, the contents here included a file called com.apple.periodic-monthly.plist which appeared to be what I was after. Looking inside I found that it was all XML, as advertised, including the following fragment:
  <key>ProgramArguments</key>
<array>
<string>/usr/sbin/periodic</string>
<string>weekly</string>
</array>
Explanations of this and the rest of the file are found in man launchd.plist. But even without looking in the man pages I could see that it was calling a program called periodic to do the work when the time arrived.

So now I knew that launchd did the scheduling, and I could change the time if necessary by using the XML file /System/Library/LaunchDaemons/com.apple.periodic-monthly.plist.

But what if I wanted to run the process manually (like, right now)? So I ran file /usr/sbin/periodic, and was told:
/usr/sbin/periodic: Bourne shell script text executable
A quick look inside this script showed that it takes an argument of "daily", "weekly" or "monthly", with predictable results.

So I could run periodic directly, but the man pages from earlier made me wonder if I can do this with a "control" program for launchd called launchctl?

Looking at the man pages again told me that launchctl could be run interactively, which sounded like a better way to test things out. The program seemed unresponsive until I told it to "load /System/Library/LaunchDaemons", which got things moving along nicely. I'm still not sure if these processes were loaded before I put them in, and I didn't want to reboot to see if the clean machine had them.

I started a few of the processes, but 3 of them would not launch, and the system kept respawning them. I was worried that this try/die loop was going to go on indefinitely (or until I learned how to stop it), but fortunately the system worked out what was happening, and stopped them all.

In summary, it seems that launchd turns non-terminating programs that communicate with stdin/stdout, and turns them into logged daemons. Having a process that finishes quickly (like periodic) doesn't seem to be what it's for. So to run periodic manually I needed to call it directly.

That was a long winded way of making sure my locate database stays up to date, but I'm pleased I learned something about the system while I was at it. :-)

Wednesday, March 08, 2006

Documentation

Apologies to anyone not interested in reading my description of the inner workings of Kowari.

There's obviously a lot happening while I try to get used to a new city and lifestyle, so I haven't returned to my research yet. I'm also working on commercial code, which means I have to be more circumspect about what I'm doing. This all means I haven't had the time or motivation to do new things in my own time, making it hard to write about anything interesting!

As an exercise in re-familiarizing myself with Kowari, and keeping in the habit of writing (which can be hard to start again once you've stopped) I decided that I should write about how each class works. I mentioned this shortly after presenting all of this for NGC, but it's taken me this long to get around to it.

I'll write about more interesting things as they come to hand. For instance, I was told yesterday that a major impediment to the adoption of semantics in some areas has been the lack of a scalable statement store. Argh!

AbstractBlockFile

This class does not add much to the semantics of BlockFile, but it has a few internal details worth mentioning.

AbstractBlockFile contains a static block which determines the byte ordering to be used for all files. This is generally the native format for the platform, but maybe overridden by setting the system property tucana.xa.useByteOrder to native (the default), big_endian, or little_endian. This should usually be left alone, but it may be necessary if files have been moved from one computer to another with a different CPU architecture. Upgrading an Apple from PowerPC to the new Intel Core Duo is an good example of the need for this option.

The constructor for AbstractBlockFile performs several checks. It starts by checking if the file is already present in a static set of files. If so, then this JVM has already opened the file, so an exception is thrown. Otherwise, the file is added to the set once it has been opened.

The second check in the constructor is to ensure that the file is a multiple of the block size. If not, then it tries to truncate the file to an appropriate length.

Truncation is an awkward operation for a mapped file, as a mapping could still exist from the file being opened and subsequently closed earlier in the lifecycle of the program. If this is the case, an IOException will be thrown, and the mapping must be removed. Running the garbage collector is usually enough to clean up any unused file mappings, but on Windows it is necessary to run the collector several times for this technique to work.

The code to do this looks like:
for (;;) { 
try {
fc.truncate(fileSize);
break;
} catch (IOException ex) {
if (retries-- == 0) {
logger.error(
"Could not truncate file: \"" + file +
"\" to size: " + fileSize, ex
);
throw ex;
}
System.gc();
try { Thread.sleep(100); } catch (InterruptedException ie) { }
System.runFinalization();
}
}
This technique is duplicated in clear(), close() and delete().

Factory Methods

The openBlockFile() methods each take an IOType to determine which of the implementing classes to return an instance of. However, these methods also check the tucana.xa.forceIOType system property to see if this value should be overridden. If the property is set to mapped or explicit then that will dictate if MappedBlockFile or IOBlockFile should be used. Otherwise, the requested IOType will be used.

It should be noted that it is unusual for an abstract class to have information about its implementing subclasses like this. If another type of implementation were to be added, then this super class would need to be updated as well. This sort of dependency is undesirable in object oriented code, but the benefits here were deemed to make it worthwhile.

Opened Files

The constructor for this class does not open the file, but rather calls the ensureOpen() method. This method makes sure that a file is open, and if it is not, then it attempts to perform the open operation. The operation is called in this way to try to cover occasions where a problem occurred (such as a "rollback" operation), and there is some code in the method to handle these conditions.

Several other methods in AbstractBlockFile also call ensureOpen before performing file operations, in case anything has happened to close access to the file.

Note also, that access to ensureOpen and the isOpen set is all synchronized.

Tuesday, March 07, 2006

BlockFile

BlockFile is an interface for accessing files full of identically sized "blocks" of data. This interface is like IntFile, in that they both describe the abstract methods for accessing their respective files, and both have two concrete implementations: one for mapped file access, and the other for normal I/O access. In the case of BlockFile these are MappedBlockFile and IOBlockFile.

Unlike IntFile, the concrete implementations for BlockFile share a significant portion of their code. This is shared between the two classes by having them both extend an abstract class called AbstractBlockFile, which in turn implements BlockFile. So MappedBlockFile and IOBlockFile do not implement BlockFile directly.

Redundant Interfaces

This arrangement makes BlockFile superfluous, since AbstractBlockFile can be used in every context where BlockFile might have been. However, using an interface in this way is a well established technique in Kowari. It allows for the rapid development of alternative implementations as the need arises. It also permits an object that fulfils a number of roles to implement more than one interface. These advantages both help the codebase stay flexible for ongoing development.

Finally, these interfaces, redundant or not, provide a consistent way of documenting the method of using a class (its platonic "interface", as opposed to the grammatical construct in the Java language).

Records

The point of a BlockFile is to store a series of records of identical structure in a file. Each record will be exactly the same size. The data structure within that record is left up to the caller. Kowari uses BlockFiles to store numerous data types, including strings (of defined lengths), numbers, tree nodes, and linked lists.

Files which hold records like this are essential to scalable systems, as data from anywhere in the system can be found and modified rapidly. Variable length structures can be more space efficient, but require that a file be traversed before data can be found, and modification is often impossible. Variable length structures can be indexed for rapid retrieval, but modification is often impractical.

As a side note, the expected replacement for the Kowari file system will use variable length structures, but with inherent indexing to make traversal faster than it currently is. The problem with modifying data will still exist, but this will be managed in a linear annotation queue, which will be merged back in to the main data silently in the background. The result should demonstrate much faster read and write operations, particularly for terabyte scale files.

Methods

setNrBlocks() andgetNrBlocks() are used to set and get the length of the block file. Since the file may only be viewed as a collection of blocks, then its length should only be measured in units of blocks.

Like all the other file-manipulating classes, clear() is used to remove all data from the file, and set it to zero length. force() is used to block the process until all data is written to disk. unmap() attempts to remove any existing file mappings, close() closes the file, and delete() closes the file and removes it from the disk. See IntFile for a more detailed explanation.

The remaining methods are all for manipulating the blocks within this file.

Blocks

The Block class was created to abstract away access to the records stored within a BlockFile. The block is always tied back to its location on disk, so any operations on that block will always be to the specific location on disk. This is important for memory mapped files, as it means that a write to a block will be immediately reflected on disk (though disk I/O can be delayed by the operating system) without performing any explicit write operations with the block.

Implementations of BlockFile are also expected to manage Blocks through an object pool. This was particularly important in earlier JVMs, where object allocation and garbage collection were expensive operations. Fortunately, recent versions of Java no longer suffer these effects, and an object pool can even slow a process down.

The other reason to use an object pool for blocks is to prevent the need to re-read any blocks still in memory. This is not an issue for memory mapped files, but if the blocks are accessed with read/write operations then they need to be initialized by reading from disk. The object pool for blocks will re-use a block associated with a particular file address whenever one is available. If one is not available, then the standard pool technique of re-initializing an existing object is used. With Java 1.5, it is this last operation which may be discarded.

As a consequence of the above, the remaining BlockFile methods all accept an ObjectPool as a parameter. This may appear unnecessary, since a pool would usually be associated with an instance of the BlockFile class, but in this case there are several pools which may be used, depending on the needs of the calling code.

Block Methods

allocateBlock() is used to get a Block for a particular file position, defined by a blockId parameter, but not to read the file for any existing data. This is for new positions in the file where the data was previously unused. There is no reason to read the data at this point, and the read operation will be time consuming. Writing the resulting block will put its contents into the file at the given block ID.

readBlock is similar to allocateBlock, but the block is already expected to have data in it. The Block object is allocated and associated with the appropriate point in the file, and data is read into it.

Since read operations are never explicit in memory mapped files, the allocateBlock() and readBlock() methods are identical in the MappedBlockFile class.

Any blocks obtained from allocateBlock() and readBlock() should be passed to releaseBlock() once they have been finished with. This allows the block to be put back into the object pool. While not strictly necessary, it should improve performance.

writeBlock() writes the contents of a Block's buffer to that Block's location on disk. This performs a write() in IOBlockFile and is an empty operation (a no-op) in MappedBlockFile.

The copyBlock() method sets a new block ID on a given Block. This now means that the same data is set for a new block. On IOBlockFile this is a simple operation, as the buffer remains unchanged. However, on MappedBlockFile the contents of the first block have to be copied into the new block's position. This is unusual, since MappedBlockFile usually does less work in most of its methods.

writeBlock simply writes a block's buffer back to disk. This is a no-op in MappedBlockFile.

modifyBlock() and freeBlock() should be removed from this class, as they are operations which are designed for ManagedBlockFile. This class provides a layer of abstraction over BlockFile and will be described later. The only implementations of these methods is in AbstractBlockFile where an UnsupportedOperationException is thrown.

Block File Types

The BlockFile interface also includes a static inner class which defines an enumeration of the types of BlockFile. This is the standard enumeration pattern used before Java 1.5.

The defined types are:
MAPPED
The file is memory mapped. Reading and writing are implicit, and the read/write methods perform no function.
EXPLICIT
The file is accessed with the standard read/write mechanism.
AUTO
The file is accessed by memory mapping if the operating system supports a 64 bit address space. Otherwise file access is explicit.
DEFAULT
This is set to AUTO, unless the value is overwritten at the command line. The override is performed by defining tucana.xa.defaultIOType as mapped, explicit or auto.
The code for determining this setup is in the static initializer of the enumeration class.

This enumeration class is used by the factory methods in AbstractBlockFile to determine which concrete implementation should be instantiated.

Thursday, March 02, 2006

Posts

I've noticed that my new blog has slightly different options than my old one. For instance, the new blog has a "title" that I can fill in for every post, but this blog doesn't. I wonder why the difference, and if it's possible to upgrade?

Also, I've been using <h3> tags in the new blog, to match the template I'm using there. I notice that <strong> doesn't do a lot in the current stylesheet for this blog, so I decided to trial using <h3> here as well.

MappedIntFile

The natural approach to mapping a file is to simply map the entire file from beginning to end. Any accesses to this file become simple get() and put() operations, which in turn get passed into direct memory accesses.

In Java this suffers from 2 major problems. The first is the already mentioned issue of the 2GB limit, even for 64 bit operating systems.

The second problem is the inherent difficulties in remapping the file when it grows, since it is impossible to guarantee unmapping of a file. A new mapping plus the old one may have a combined total that is larger than the available address space, so it is impossible to have both in existence at once. Another issue, is that remapping a file that has grown is a very expensive operation, particularly when the garbage collector will be repeatedly run in an attempt to clean up older file mappings.

A way to avoid all of these issues is to break the mapping into a series of smaller consecutive mappings:
  • Even though the 2GB limit exists for a single mapping, a 64 bit system may map up to hundred of GB of files.
  • Expanding a file just means a new mapping goes onto the end of the file. For this reason, files get expanded in "chunks". This means that the old mappings hang around and don't conflict with the new mapping, and not crowding out memory.
  • Not cleaning up the old mappings means that there is no overhead in cleaning up the mappings, particularly the expensive garbage collecting.
These mappings are all managed in an array of MappedByteBuffer called (funnily enough) mappedByteBuffers. Enabling access to the same data in ints or longs is a pair of associated arrays labeled intBuffers and longBuffers.

The use of these arrays of buffers makes everything more complex in this file, but not excessively so. The constructor has to calculate the number of buffers needed to cover the entire file, and sets up the arrays to hold them. It then passes the work on to a method called mapFile(). This method loops through the array of buffers, and for each element it sets up the mappings to the appropriate areas of the file. This method is also called when a file is expanded, in which case it copies all the existing mappings from the original array, and appends the new mappings to the end of the new array. Any failed attempts at mapping will result in the garbage collector being run, and the mapping attempted once more. This process is in a loop that will run 10 times (yes, that's a magic number that seems to work well. It should probably be configurable).

A mapped file can be expanded or contracted by calling setSize(). There are currently no use cases to shrink a file, except for cases where the file is shrunk to zero size (with clear()), so smaller sizes than the current size are ignored. Once the size is determined, mapFile() is called again to update the mapping arrays and create the new mappings.

Each of the get and put methods first need to determine the correct mapping to use (by taking the absolute offset divided by the size of the mappings) and the offset into the mapping (by taking the absolute offset mod the size of the mappings). Does this remind any Intel "protected mode" programmers of selectors and offsets? :-)

Once the correct offset within the specified mapping is found, then the access goes ahead as normal.

Some of this may sound like significant overhead, but it is not. The majority of the extra work is simple integer arithmetic, which will be performed in parallel to other operations on most modern processors.

The result is a MappedIntFile which can be expanded while mapping a large portion of memory, can handle more than 2GB, and is as fast or faster on all operations than a single large mapping would have been.

Sunday, February 26, 2006

Grammar
The week was filled with lots of extra-curricula stuff, mostly (though not entirely) due to a visit from the client whose project I'm working on. But by the end of the week I finally had the grammar code parsing a lot of straight forward English into RDF. I also have a few "hacks" to make it a little more RDF friendly, such as picking up prepositions in predicates.

For instance, when I first parsed, "The quick brown fox jumps over the lazy dog." I was getting a subject of fox a predicate of jumps, and an object of dog. This says a something a bit different to the original intent. By picking up the preposition for the adverbial phrase, I instead get a predicate of jumps over, which is what I wanted.

It's still rudimentary, but it's very cool to stick in natural English and get out sensible RDF. I'd love to open source this stuff, but I did it on the company's dime, so it's their call. Probably not, but they want to get more involved with OSS, so maybe.

On the other hand, it wouldn't hurt if the code were never open sourced. I didn't really know what I was doing when hacking the grammar code, so it could be a lot prettier (and more extensible). Releasing dirty code into the wild can just be a recipe for embarrassment. :-)

Advanced Degrees
DavidW posted his result on the What Advanced Degree Should You Get test. OK, so these tests are far from giving a definitive portrayal, but I couldn't fight the temptation to fill it in. What do you know? I got:

You Should Get a PhD in Science (like chemistry, math, or engineering)

You're both smart and innovative when it comes to ideas.
Maybe you'll find a cure for cancer - or develop the latest underground drug.

Not really surprising. Maybe it's telling me to hurry up with this OWL/RDF thing and get back to my Physics postgraduate study (I never intended to be away from it for this long).

Personal
Anne has been keeping a blog about our recent move to Chicago. I've had some of my own thoughts and opinions since getting here, so I finally decided to keep a personal blog about it.

It won't be interesting to anyone who doesn't know me, but I thought I'd mention it in case any of my family ever read these posts. :-)

(Besides, an extra link never hurt a Google ranking. Oh wait... it did).

Tuesday, February 21, 2006

Chicago PD
I was going to complain cathartically tonight, but Anne's done it for me. I've started going to the gym again tonight, so that's helped me feel a bit better, despite my lack of catharsis.

We have a chain on the door now, so hopefully we'll have no more incidents. I'm still feeling paranoid though.

Bottom Up
There is a lot to be said for describing a system from the top down. The overall structure comes into view very quickly, and there is a concentration on the general architectural concepts which are so important for understanding and modifying the code.

However, I have found that describing Kowari at a high level always seems to lead to questions about the details. There are concepts that people seem to struggle with until they see how all the details are managed. This encourages me to approach the system from the ground up. An obvious advantage of this approach is that there are no dependencies on unexplained systems. Conversely, a lot of ground has to be covered before the overall structure comes into view.

There are also a number of operations at the bottom levels which are created to support higher level concepts. While the operations are easy to follow, the need for them may be unclear until the higher level code is viewed.

Possibly the best compromise is to start at the top, explain the requisite lower-level details as needed, and to return to the top whenever possible. My main problem here is that the theme of the discussion jumps around a lot, I may also end up discussing details which people didn't need, while skipping those which were more interesting.

In the end I've decided to go bottom-up. At least there I can be consistent with this approach, and I don't have to be concerned about missing details. More importantly, once I start getting higher up in the architecture, it will be possible to refer back and forth as required (that's an advantage of written explanations over spoken ones). The one piece of architecture that I feel is important to know at this level is the Phase Tree design. This was covered in my last entry.

The descriptions I'll provide for the lower level classes should be considered a supplement to the Javadoc. If anyone would like to see more info on anything, then please let me know.

I'll start in the org.kowari.util package.

IntFile
This class provides the semantics of an array of 64 bit long values, backed by a file. It has operations to set the size of the array (setSize(int)), and to set and get values from any position from 0 to the end of the array (putLong(long, long) and getLong(long)).

A less used set of options, are the getters and setters which can be used to treat the structure as an array of 32 bit int values (putInt(long, int) and getInt(long)), or 8 bit byte values(putByte(long, byte) and getByte(long)).

This class also provides storage for an array of "unsigned" 32 bit integers. This is handy for file access, and managing some data types. However, unsigned values are not directly supported in Java (which the exception of the char type). To manage them, they are passed around in 64 bit longs so they get handled correctly, but they are still stored on disk in a 32 bit pattern. The methods here are called putUInt(long, long) and getUInt(long).

Concrete Classes
IntFile is an abstract class, and implements minimal functionality. The real work is done in a pair of concrete classes called ExplicitIntFile and MappedIntFile.

MappedIntFile uses the New IO (NIO) functionality introduced in Java 1.4. It memory maps a file, allowing the underlying operating system to manage reading and writing of the required memory buffers. This leverages the efficient management of disk and memory that is essential in modern operating systems, and avoids the traditional copy to or from additional memory buffers.

Unfortunately, memory mapping files in this way is restricted by the size of the address space of the process. On a 32 bit system, this creates a maximum limit of 4GB, though the program and the operating system must use some of this space. 64 bit systems don't do much better, as many 64 bit operating systems impose limits within their address space for various operational reasons. In Java, the addresses for NIO are all 32 bit int values. Since this is a signed value (the sign using up one bit), the maximum size for a single file mapping is 2GB.

All this means that if a Kowari instance needs more space, it has to revert to using standard read/write operations for accessing the files. This is the purpose of the ExplicitIntFile class. This class implementation of each of the put/get methods calls down to read/write methods on the underlying java.nio.channels.FileChannel object. This implementation is not as fast as MappedIntFile, but it can operate on much larger files, particularly when address space is at a premium.

More IntFile
The constructors for each of the concrete implementations of IntFile are all default (not public). Instead, the IntFile class contains factory methods which instantiate the appropriate concrete class according to requirements.

The factory methods are all called open(...), and accept either a String or a java.io.File object. They will open an existing file when one exists, or create a new file otherwise.

First of all, the system properties are checked for a value called "tucana.xa.forceIOType". If it doesn't exist, then it defaults to using MappedIntFile. Otherwise, it makes a choice based on the expected values of "mapped" or "explicit". Any other value will fall back to MappedIntFile and give a warning in the log.

Unfortunately, we have found a bug that occasionally manifests in ExplicitIntFile. DavidM tracked it down, but it is tough to reproduce. I don't have the details, but on hearing this I audited this class, and it all appears correct. Until we have a solution to this problem, the factory method is temporarily using MappedIntFile in all cases (this has been a problem for some users, and needs fixing).

Forcing, Clearing, and Deleting
The remaining methods on IntFile are for managing the file, rather than the data in it.

force() is used to ensure that any data written in the putXXX(long, XXX) operations has been written to disk. This relies on the operating system for the guarantee of completeness. This is relevant to both mapped and explicit IO, as both are affected by write-behind caching. This operation is essential for data integrity, when the system needs to know that all files have reached some fixed state.

clear() leaves the file open, but truncates it to zero length, thereby deleting the entire contents of the array.

close() simply releases the resources used for managing the file, and is primarily used during shutdown of the system. delete() also releases all the resources, but then deletes the file as well.

The final method to consider is unmap(). This method is only relevant to the MappedIntFile class, but it must be called regardless of the implementing class used. This is because any calling code cannot know which implementation of IntFile is being used (this is intentionally hidden, so it can be swapped with ease).

When unmap()is called on MappedIntFile, all the references used for the mapping are explicitly set to null. This allows the Java garbage collector to find the mappings and remove them, thereby freeing up address space.

This is an important operation, but it is difficult to enforce. Java does not permit explicit unmapping of files, since allowing it would permit access to memory which is not allocated (a General Protection Fault on Windows, a segfault on x86 Linux, and a bus error on Sparc Linux and Mac OS X). The closest we can come to forcing this behavior is to make the mapping available to be garbage collected, and run the garbage collector several times until the mapping has been cleaned up. This actually works on most operating systems, but needs to be iterated much more on Windows before it works.

Late Night
There are some specific details in MappedIntFile that need to be addressed, but I'll have to leave here and get some sleep. Hopefully I won't be woken by the Police tonight...

Thursday, February 16, 2006

Expenses
No blogging last night, as I was working on expense reports. Somehow I doubt that work will appreciate that I had to stay up to 1am, but they'd certainly notice if I hadn't. Even so, I had to spend more time on them this afternoon.

As if that weren't enough, I've also filled in all the paperwork required to become a full-time employee. It's a little scary when you have no idea of the tax system in this country. It also doesn't help that I don't have an SSN.

I applied for an SSN last week (my first day in town), but they told me that USCIS (formerly the INS) have not yet entered our names into the system. Apparently that takes 10 working days. Getting an SSN after that will take another 10 working days. So it will take me a month to get the one piece of ID needed to do practically ANYTHING in this country.

It seems a little weird that it's this hard, given that every person who is given a work visa here will require an SSN. So why not process it all together and hand out the SSN with the visa? Not to mention that some official documents require a Resident Alien number, but I have to apply for that separately as well! I'm not sure where to get it either. Possibly USCIS.

In the meantime, Anne can't get an SSN because she's not entitled to work. This is despite the fact that her visa does allow her to work! It turns out that she has to go back to USCIS and apply for permission to work. Once she has that she can apply for her own SSN.

I'd hate to know what will happen to us if we have to talk to a doctor. We're insured, but won't they ask for an SSN?

Bus Error
It turns out that I was a little sloppy with some recent JNI code. If my Dictionary class failed to load in C, then it returned a null, as documented. However, I blithely stored this pointer and went on to use it.

Until now the Dictionary class always loaded, because I always made sure the dictionary files were available. But yesterday I started running a new set of tests in a separate directory, and they immediately reported: bus error.

I forgot how undescriptive a fault in C can be. :-)

It shouldn't have taken long to find this, but I used Eclipse to trace where the fault was occurring. While the GUI is slower on a Mac, for some reason the whole system brought my system to its knees. The trackpad was so jerky and intermittent that I started to worry that it would need fixing. It was only after tolerating this for an hour that I realized that the problem was not the trackpad, but that the machine was struggling too much. Shutting down Eclipse fixed the problem.

I'd be tempted to continue working from the command line, along with VIM, but Eclipse can't stand me fixing files outside of its domain. Why should it care?

At least I discovered the problem easily enough, and fixed it by bringing the dictionary files into the test area. I've also moved into the code and I now throw exceptions when a C library fails to initialize. I'm disappointed I hadn't done this in the first place, but at least I'll remember now (once bitten, twice shy).

BTW, has anyone else noticed that Intel (and x86 compatible) based systems report a segfault when dereferencing 0, while Apple and Sparc systems report a bus error? I always think of null dereferences as being segfaults, so I used to get confused when I saw bus errors like this. I wonder why there's a difference.