Wednesday, June 30, 2010

Probabilistic Partioning of the Gene Ontology - anyone?

Yesterday, Andrew Su and I were wondering about disjointness in the Gene Ontology. We mused, "are there concepts in the GO that are mutually exclusive from the perspective of the gene products that they are associated with?" In other words, if a gene is anointed with a particular annotation, are there any other annotations that are thus rendered impossible for that gene?

With a small amount of searching, I found that the GO consortium is researching the addition of declarations of disjointness such as "intracellular is disjoint from extracellular". However, such statements about cellular space don't necessarily transfer to statements about the genes that occupy those spaces. This is clearly illustrated by 390 human proteins that are annotated as both 'intracellular' and 'extracellular' (and is stated explicitly in the GOC definition of 'Disjoint from'). So, even when/if that gets finished, its not going to help with our question. Furthermore, it seems that searching for the strict, perfect boundaries of formal logic might not be the right approach to this anyway.

What we would really like to see is a probability. Given that a gene is annotated with a particular concept from the GO, what is the probability that it could also be annotated with another GO concept? Concept pairs where the probability of co-annotation approaches zero would represent the (probably) disjoint biological classes that we were originally thinking about.

Naively, it seems this would be fairly easy to generate using the GO annotation database. Just build a contingency table for each pair of GO concepts (avoiding sub/superclasses) with counts for numbers of genes annotated with each concept and compute the statistics. If anyone would like to implement this (even better implement it in a non-naive way that accounted correctly for priors and chose the best test statistic) we would:

  1. be very happy to use the results in the context of another investigation that is already in progress and that motivated the question...
  2. be very happy to share authorship on any papers that might come out of the probabilistic partitioning service or the mysterious study already in progress...
If this work has already been accomplished and I've missed it, please let me know! If you are interested in working on it (or have already started working on it on your own) also please let me know!

Partially related efforts but not quite what we are after...


Mark Wilkinson said...

given the modularity of DNA/protein, together with recombination, I suspect that you might find very few 100% exclusive terms. GO terms, in many cases, annotate sub-functions of a protein (~~ protein domains); moreover, different (evolutionarily) domains can have ~the same function, at least at the level of GO Annotation. So... I would imagine that evolution has tried most combinations :-) What would you conclude if you didn't see a particular combination? That they were exclusionary in their function, or that the combination had never "been tried" by the organism?

Benjamin Good said...

And my response.. again..

I should probably write another post to clarify why I wrote this post.. I will when I get the data! But, to (sort of) answer your question, my hope is that given a pair of GO concepts, I can make a prediction with a quantitative level of confidence about the likelihood that a particular protein will be annotated with both. This gives me a tool to use when analyzing data (for example from a new system for predicting novel GO annotations...). If my system suggests that a particular gene product should have a new annotation, but it already has a collection of well-established annotations I want to be able to measure how 'surprised' I should be about the new suggestion.

To me, the key to making this useful is getting the statistics right. I'm sure there are many cases where GO concepts will appear to be disjoint given a simple count of co-occurrence simply because the annotations haven't been captured yet (or are incorrect). The trick is not really to find disjoint sets but to come up with a reliable way of measuring the confidence in a prediction of disjointness (or its reverse).

Back to your question. What would I conclude for a pair of concepts that appeared, with a very high probability, to be disjoint with respect to the proteins that could be annotated with them? As I said at my defense: "thats a hard question!"... In fact I think its actually two questions. 1) Is the protein pair really truly exclusionary in their function? and 2) how, from an evolutionary perspective, did that come to pass?

Certainly the hope is that the answer to 1) is "probably true", but I think more data will be needed to answer either of these questions well. One of the things we can do is start comparing the similarity function that what I'm talking about here boils down to to sequences the way Phil Lord et al did in the paper I mentioned in the post.

But first lets see the data...

Chris Mungall said...

See Val Wood's presentation (Val-Matrix-QC.pdf):

She uses the term "annotation intersections" which corresponds with what you're talking about.

Note in the slides expressions such as P1/\P2 should be read more formally in OWL as
(G and annotated_to some P1)/\(G and annotated_to some P2)