Monday, January 09, 2006

The interfaces in UIMA have seemed rather obtuse recently.

To process a document, an analysis module is given a CAS (Common Analysis System), and returns a CAS. The CAS contains a reference to the original document, and any annotations that have been made so far. A module can ask the CAS for the original document, perform its analysis, and add any annotations back into the CAS. At the end of the process, the UIMA framework returns a CAS object which can be checked for annotation properties.

Adding annotations to a CAS is an easy process for a module. An appropriate Annotation derivative is created for the current CAS (the CAS object is passed as a parameter to the Annotation's constructor). It is also easy to read these annotations after UIMA has finished. However, the CAS objects in each case are accessed through different classes, with different features. In an analysis module I have a JCas object, but the results of the UIMA run are returning a TCAS object instead. TCAS is actually just a specialization for handling text of a standard CAS object, but I didn't see that immediately (I should have paid closer attention).

Fortunately, a JCas object has a method called getCAS(), and a CAS has a getJCas() method, as these objects have a 1:1 correspondence. Why have two objects? I think it's to separate out the CAS concepts from the Java specific information needed to manipulate a CAS. But don't quote me.

Separating these two classes, and providing different types in different circumstances has tripped me up a bit. Maybe someone can explain the reasoning, but I've just found it annoying.

1 comment:

Anonymous said...

Hi. I am an IBM internal user of UIMA but not an SDK developer or official spokesperson or anything. I just ran across this entry in a web search, and here are my 2 cents on this subject:

CAS and TCAS are the original interfaces for interacting with UIMA data. If I wanted to create a feature structure of type org.example.Smurf and set a feature named "color" to "blue" using the CAS interface, I would do something like the following, if I recall correctly:

Type smurfType = myCas.getTypeSystem().getType("org.example.Smurf")
FeatureStructure mySmurf = myCas.createFS(smurfType)
Feature color = smurfType.getFeatureByBaseName("color");
mySmurf.setFeatureValue(color, "blue");

This mechanism works, but it is kind of clumsy for routine use and doesn't really make full use of the object-oriented paradigm. Thus the JCas was added to provide a different way to interact with CAS information. Systems that use JCas require that the JCasGen program be used to create distinct Java classes for each seperate type in the type system. These systems then interact directly with those types, e.g.:

Smurf smurf = new Smurf(myJCas);

This is very convenient for typical cases like the ones above. However, it is fairly inconvenient if you want to write a more generic CAS processor that doesn't have a predefined set of types and features it manipulates, e.g., one that accesses an external database to give it names of types and features. In the CAS interaction above, you could just replace the literal strings with string variables that were populated from a database. To accomplish the same thing with JCas would require some fairly complicated Java reflection code (especially to do the "setColor"). Thus JCas exists as a friendly interface for typical use and the original CAS interface is still available for more complex/abstract interactions.