Wednesday, March 16, 2005

Catching Up
Just a quick note to say that, yes, I am still working. Today is Anne's birthday, so last night I was getting a few things ready for it, and tonight we went out for dinner. It just makes it difficult to any writing done at night. We're also going away this weekend, so I can't catch up then. Looks like Monday will be a late night of writing. :-)

Since it's now late, I can't really go through anything in detail. So I'll just give an overview and fill in the detail later.

Joins
Still working through the join code yesterday. I hadn't picked up all of it, but I learnt enough to work out what was needed for the difference code, and have made a design for it.

I've decided not to order the left hand side of the operation, which is a compromise on complexity, but there are good reasons for this decision. It comes down to the fact that differences like this are rarely on large enough data sets for sorting to save any time, and because data to be removed is typically sparse, and not in consecutive node IDs.

I also got to chat with Simon in person today, and I've finally worked out the last missing pieces of the join code, to the extent that I can work out the complexity of the operation (a two operand join comes to n.log(N), where n is the size of the first operand, and N is the size of the database). The implementation of the difference code will have the same complexity, so this seems reasonable. It may be possible to improve this for large enough data sets if another algorithm is used, but it will want some heuristics to work out the changeover point between the two methods. For now, the current design should work just fine.

While I'm discussing optimisations like this, it may also be possible to improve certain types of joins for large enough data sets, but again it will require some heuristics to work out the changeover point between the two algorithms. There shouldn't be a need to go this far until we see really poor performance on our joins (which we've never seen).

So I now understand joins, I have a solid design for the difference operator, and I've written part of the class to do this.

Data Retrieval and Logic
Today the department held the first part of a two part seminar on data retrieval. Today's seminar covered indexing, and query resolution. It was amazing just how much of it was like learning how Lucene works, although there were a couple of new concepts.

The other part of today covered bi-lingual and multi-lingual querying. The assumption is that all documents are indexed according to words. This means that the best way to do a multi-lingual query is to translate the query into other languages, and try to match to documents, rather than indexing different translations of the documents.

However, since I've been looking at semantic analysers recently, I started wondering about indexing on concept (an idea that was mentioned, but not pursued).

The whole thing gave me an idea for a simple document concept analyser, based on using groups of words in a thesaurus as a pseudo-concept. It has its problems, but there are almost NO open source concept analysers out there, so maybe it would be worthwhile implementing anyway. If it's useful then people might even be prepared to work on it to overcome the deficiencies. :-) I'll have to describe it soon.

After the seminar I was going to catch up with Simon for lunch, but it turned out that he had a "Logic Group" meeting with his supervisor, Guido. I need to improve my logic (for the theoretical component of the work I'm currently doing), so I asked Guido if I could sit in. I found it really valuable, as it answered a number of questions I have about notation and some fundamentals of logic syntax.

With the exception of a short discussion with Simon on the operation of "join", the rest of the day was spent on code.

No comments: