In a comment on open kimono part 1 "Luke" said that perhaps the question is more what the benefits of blogging are than the costs. As I'm pretty new to this scene, I think I will experiment a bit and tell you my results as they come in before going straight for my own predictions.
Experiment number 1 - Below is a problem I'm working on in the area of ontology quality evaluation. Lets see if blogging it out before trying to publish it / finish my research helps, hurts or doesn't matter..
Last year Barry Smith published an article indicating that the quality of a category in a terminology is best estimated through an inspection of the characteristics of the instances that get [manually] assigned to it. Basically, to evaluate a category (concept, class, set, whatever...) you try to find a pattern that is both consistent across the instances in the category and unique (the pattern is not present in instances not assigned to that category). Before any DL people start shouting - of course, if everyone wrote ontologies with logical restrictions on membership for each class, then we wouldn't need to worry about this as the desiderata would be enforced by reasoner, but, living in the world where the most powerful ontology developing force (OBO) is not just reticent, but seems actively opposed to the description logic approach, its still an important consideration. So..
A clever undergrad (Gavin Ha) and I have produced a program that:
- reads in an OWL/RDF document including both classes and instances
- builds a table with instances as rows and classes and other properties of the instances as the columns
- prunes out unwanted columns (depending..)
- sends the table off to alorithms from WEKA that attempt to build classifiers (e.g. decision trees) for a specified class within the ontology based on the attributes of the instances.
The idea being that, if we can find a pattern, the class is 'good', if not, then the class is 'less good' and we can quantify the goodness based on the number of instances the pattern correctly classifies.
So far, it works great on a positive control containing about 25 classes with 500 instances strictly obeying predetermined rules for class membership. It discovers those rules blinded and scores the classes appropriately (e.g. %correct for instances that should be in or out of the class). One of the nice things about this is that, depending on the classifier chosen, the rules discovered in the data can be automatically represented as OWL restrictions and thus given to ontology engineers as suggestions.
Now, of course, we are trying this out on bits of the gene ontology to see if there are any patterns that can be mined from the databases that correpond to GO classes. Trouble is, the GO is both large and, in many areas, relatively sparsely instantiated. The question that arises is, for a given class (term..) when we try to find a pattern than discriminates its members from those outside the class, how do we choose the negative set? In our test case, the negative set for any particular class was simply everything outside of it, but when the set in question contains only 10-100 things and everything else contains a million things this isn't really a good way to go.
Thinking of starting with only the closest neighbors for the class in question (kind of like the logic behind SVMs).