For the past little while I've been working on extracting candidate Gene Ontology annotations from the hypertext of articles in the Gene Wiki. In this work I have been using two of the premier tools for concept recognition in bioinformatics - MetaMap from the National Library of Medicine and the Annotator from the National Center for Biomedical Ontology. As elaborated on in this article, these systems work very differently and, depending on the input text, can yield substantially different results. Since I was at a loss when I first had to decide which tool to use, I thought I'd share some of the results of my experiments in the hopes that I might help some one else along in their decisions.
Input: text from about 10,000 Gene Wiki articles (both complete sentences and the titles of pages linked to from a gene page)
Output: concepts from the GO with some form of linguistic match to the text.
Current Rounded Results:
GO concepts detected: Annotator about 20,000, MetaMap about 35,000, Intersection about 14,000 (See diagram below)
Another important consideration is speed of execution. For my experiments, a locally installed version of MetaMap took about twice the time to run the same jobs as the Annotator Web service - despite the lag from network latency for the Annotator. The output from both tools was fairly easy to parse. (I found it much easier to simply parse the results myself than to work in the context of UIMA wrappers which are available for both systems.)
So there you go, same same but different.