Friday, May 4, 2007

open kimono part 2

In a comment on open kimono part 1 "Luke" said that perhaps the question is more what the benefits of blogging are than the costs. As I'm pretty new to this scene, I think I will experiment a bit and tell you my results as they come in before going straight for my own predictions.

Experiment number 1 - Below is a problem I'm working on in the area of ontology quality evaluation. Lets see if blogging it out before trying to publish it / finish my research helps, hurts or doesn't matter..

Last year Barry Smith published an article indicating that the quality of a category in a terminology is best estimated through an inspection of the characteristics of the instances that get [manually] assigned to it. Basically, to evaluate a category (concept, class, set, whatever...) you try to find a pattern that is both consistent across the instances in the category and unique (the pattern is not present in instances not assigned to that category). Before any DL people start shouting - of course, if everyone wrote ontologies with logical restrictions on membership for each class, then we wouldn't need to worry about this as the desiderata would be enforced by reasoner, but, living in the world where the most powerful ontology developing force (OBO) is not just reticent, but seems actively opposed to the description logic approach, its still an important consideration. So..

A clever undergrad (Gavin Ha) and I have produced a program that:

  1. reads in an OWL/RDF document including both classes and instances
  2. builds a table with instances as rows and classes and other properties of the instances as the columns
  3. prunes out unwanted columns (depending..)
  4. sends the table off to alorithms from WEKA that attempt to build classifiers (e.g. decision trees) for a specified class within the ontology based on the attributes of the instances.

The idea being that, if we can find a pattern, the class is 'good', if not, then the class is 'less good' and we can quantify the goodness based on the number of instances the pattern correctly classifies.

So far, it works great on a positive control containing about 25 classes with 500 instances strictly obeying predetermined rules for class membership. It discovers those rules blinded and scores the classes appropriately (e.g. %correct for instances that should be in or out of the class). One of the nice things about this is that, depending on the classifier chosen, the rules discovered in the data can be automatically represented as OWL restrictions and thus given to ontology engineers as suggestions.

Now, of course, we are trying this out on bits of the gene ontology to see if there are any patterns that can be mined from the databases that correpond to GO classes. Trouble is, the GO is both large and, in many areas, relatively sparsely instantiated. The question that arises is, for a given class (term..) when we try to find a pattern than discriminates its members from those outside the class, how do we choose the negative set? In our test case, the negative set for any particular class was simply everything outside of it, but when the set in question contains only 10-100 things and everything else contains a million things this isn't really a good way to go.

Thinking of starting with only the closest neighbors for the class in question (kind of like the logic behind SVMs).

Any thoughts?


Morgan Langille said...

I think GO would be one of the hardest cases to show that your technique works (which is great if it does work)due to how noisy the data is. Would using something like GO Slim be a little easier?

Benjamin Good said...

Though I'd love to feed all of uniprot and genbank and the GO into my program and have it work, that is probably not feasible immediately. The plan is to go after small chunks of the GO first and then combine the results to give a more global picture. I think the immediate problem isn't really how "noisy" the data is, but rather, how scarce it is. To make this work, we really need quite a few instances of each class we evaluate. Because of that, I think starting with a small subset of the go, but with all of the genes/annotation I can find for the terms in that set is the way to go.

So, any suggestions for a small subset to focus on? An area likely to have strong correlations to sequence features perhaps??

Morgan Langille said...

The obvious one is protein localization (or in GO terms Cellular Component?).

Leon French said...

Hey Ben, was surfing around your blog here, pretty insightful

About this topic some work from Alberta comes to mind.

They were classifying protein sequences into the whole go tree, basically what Morgan mentioned.

The features used is not so clear to me now, I think its sequence and annotation based.

Benjamin Good said...

(from Leon)
paper 1
paper 2