Friday, July 22, 2011

bio-ontologies 2011 - ontologies need lexicons

I recently returned from my first ISMB in Vienna, Austria where I presented a SNPedia-Gene Wiki mashup at the Bio-Ontologies special interest group meeting.  (You can get to most of the other papers at the Knowledge Blog).  In quite unusual and positive fashion, the SIG came to a close with a discussion that I, at least, felt was a specific call to arms that may end up seeding a number of research programs for the next little while.  In a nutshell, we need ontologies to provide better support for text-mining tools by providing examples of the terms/phrases that best match the concepts in the ontology as they occur in important, relevant bodies of text like PubMed abstracts.  Read on for the non-nutshell explanation..

There were a number of papers in the SIG that made use of text mining services that performed concept recognition using biomedical ontologies for one reason or another.  It seemed like every other talk mentioned the NCBO Annotator, including ours.  Basically, despite the best efforts of the semantic web community and the world's databases, the great majority of the knowledge in the world is still shared as plain old text.  As a result, if you want to get something useful done immediately, like test for an association between a drug (like Viox) and a nasty outcome (like a heart attack), you can do a much better job of it if you can somehow compute with large bodies of text (in this case clinical records) than if you limit yourself to the comparatively small body of knowledge that has been clearly represented database records (see [1]).

Now, to 'compute with large bodies of text', most approaches in the biomedical domain involve some mapping from the text to terms in an ontology.  Such mappings make it possible to test for associations between the mapped concepts (e.g. drugs and diseases) using fairly straightforward statistical techniques.      The trouble of course is that the mapping process generally remains highly error-prone (*).  One of the reasons for this is that text2concept services like the Annotator are actually text2text services.

To find an occurrence of the ontology term GO:0006915 in text, the simplest, fastest way to go about it is to search for its preferred label 'apoptosis' in the input text.  This is basically all the Annotator does (with some tricks to make it go really fast and to catch non-exact matches).  This means that any concept matching system like the Annotator is fundamentally dependent on the labels associated with the concepts in the ontologies that are fed to it.  If the labels don't line up well with actual usage in the texts where you want to perform the mapping, than the system doesn't work very well.  We noticed this in the context of our work on identifying ontology terms in the text of Gene Wiki articles.

In the Gene Wiki, terms from some ontologies are much easier to match than others.  For example, we found that the Annotator picked up Disease Ontology terms with about 90% precision while its precision with the Gene Ontology was less than 50% on exactly the same text.  Others have reported similar results (though in many cases the work is not published because the scores are so low).  In fact, the text mining community seems to have basically given up on the challenge of mapping to the GO.  After the final keynote at the SIG, I asked Martin Krallinger if the BioCreative text mining competition was planning to work on the GO again as it had in its first iteration several years ago and he basically said no because it was impossible..

I suggest that a key reason that the GO isn't working very well with the Annotator and other tools is that the labels attached to concepts weren't generated with text mining in mind.  As a result, they sometimes never occur at all in the typical target text (journal articles and abstracts in PubMed) and sometimes occur over and over again but mean something completely different than the concept they are linked to in the ontology.

To make advances in our use of the GO in the context of of text mining we need more useful labels for the concepts.  I suspect this is the case for many of the ontologies in use on the semantic web.  So, here are a couple of IMO very achievable things that our community needs done.

1) The semantic web community as a whole would benefit from a consistent, clear RDF structure for capturing textual representations of ontology concepts in OWL/RDF that are intended specifically for use by concept detection services.  An RDF property designed to signal text mining tools that its target string was inserted specifically for them would make it possible for ontology developers to intentionally aid those who build concept recognition services.  RDF:label does not solve this problem because the label that the ontology developers might want to see for the concept might not be the same as the string text miners need because the contexts are different.  This might seem like too much work for ontology developers and that the NLP people should really be doing this kind of work but I think its the most feasible.  If you consider the Annotator problem - to build a concept recognition service for any ontology submitted to the BioPortal collection (now over 200 ontologies) - then some level of standardization is required.

2) The biomedical community needs a better lexicon for the Gene Ontology.  Though this is a general problem for any ontology that could be used in a text mining context, the GO is a special case that needs to be dealt with regardless of whether or not we solve the problem for other ontologies.  Even if we have a one off, GO-specific invention, we really need to be able to figure out when the concepts from the GO occur in text and to simply throw up our hands and say that 'the GO was not designed for text mining' is a pointless cop out.  This can and should be done.


* But note that we are seeing that, with enough text, the underlying signal can sometimes drown out the errors from the NLP.  (See the unreasonable effectiveness of data for more on this topic at even grander scales).

[1] "Annotation Analysis for Testing Drug Safety Signals"  by Paea LePendu et al from bio-ontologies this year.  (their paper isn't in the knowledge blog for some reason, but I'm sure Paea will send it to you if you ask him)