Wednesday, September 22, 2010

Trials and Tribulations with the UIMA wrapper for the NCBO annotator

As I've recently learned, UIMA stands for 'Unstructured Information Management Architecture'. UIMA emerged from the bowels of IBM Research and is now a full-fledged, open source Apache software project. From the Apache description:

"Unstructured Information Management applications are software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user. An example UIM application might ingest plain text and identify entities, such as persons, places, organizations; or relations, such as works-for or located-at."
I came across UIMA via this article about "A UIMA wrapper for the NCBO annotator" by Christopher Roeder and friends from Colorado and Stanford. For those that are unfamiliar, the annotator is a fairly newish web service for identifying terms from biomedical ontologies in text. (Here is a nice little interface you can use to see what it can do.) As I'm always looking for ways to avoid reinventing the wheel, I was hoping I would be able to use this wrapper on top of this well established framework to quickly build up a nice client for processing some Gene Wiki-related text. It turned out that, aside from my hopes for a quick solution... this is pretty much what I found.

Your results may vary, but here are my key early experiences with the UIMA wrapper for the NCBO annotator :
  1. I started from the BioNLP code available from the folks in Colorado.
  2. I had a lot of trouble building it.
    1. There were missing jar files referred to in the ant build file.
    2. Once everything was there I could not run the main application via ant and quickly gave up on ant and started to rebuild the entire project and its dependencies in Eclipse. This was probably a mistake - I should have stuck with the provided ant-based setup and figured it out but I got impatient...
    3. Once you get into the insides of the system you will see that it is quite complex.. If you are not an advanced Java programmer you should not look inside.. - I thought I was fairly advanced until I got started.
    4. The saving grace throughout this process was that the primary author on the wrapper paper - Chris - , was very responsive, patient and helpful.
  3. Once built, it again took me a while to understand how the whole thing was supposed to work.
    1. A key step forward was launching the 'CPE' GUI application - otherwise known as the 'Collection Processing Engine Configurator'. (main class was
    2. From here I finally started to find the 'plug and play' kind of functionality I was hoping for in this framework. From the GUI, you can choose from XML files that configure components of an analysis such as a directory reader, a sentence detector, the NCBO annotator service(!!), and a basic results exporter. These XML files appear in a 'desc' directory that comes with the distribution. Each one maps to a class containing the code they refer to that uses the parameters they contain/collect.
  4. Now I've got it running (this took a few days) - here are the main conclusions
    1. If you aren't well-versed in Java, don't try this at home
    2. This was definitely slower than writing my own client, BUT!!
    3. This client is probably much better than what I would have naively done because:
      1. it includes error handling
      2. it throttles itself via sentence splitting (enabled via a third-party UIMA component)
      3. it seems to go much faster.. I'm not sure why (perhaps parallel requests via multi-threading, perhaps the sentence splitting step makes the annotator happier)
      4. it has also pushed me much closer to being able to run a large collection of powerful tools such as Open DMAP (whether I ever head down that road, time will tell).
In conclusion, it feels like this project (bionlp-uima) is basically one step away from being a powerful, useful tool that the bioinformatics community could really benefit from. That step is really to do the beauty work - to make an application for people rather than just a code collection for hackers. The project reminds me a lot of my all time favorite open source software project WEKA - the Waikato Environment for Knowledge Analysis. WEKA contains implementations of thousands of machine learning algorithms along with a tools for experimenting with them. The key difference is that it has a stable click-and-run user interface to provide access to those tools (though you can still access, change, and learn from the large Java stack that runs it). If the BioNLP code was wrapped up in such a framework I suspect they would get many more users, I would certainly be much happier ;).

(Note, this my first foray into this kind of unstructured information extraction. If you know of better ways to do it, please do let me know!)