Friday, May 6, 2011

Integrating the Gene Wiki with traditional publishing?

While we (Andrew Su and I) like to talk about the successes of the Gene Wiki - articles like the one for Reelin that represent arguably the best consolidated body of text associated with the gene - there remain some rather glaring holes in its content.  A couple months ago I had a look for under-developed articles linked to genes with extensive numbers of publications.  With a small bit of hacking I uncovered a list of 2,553 genes that were linked directly to more than 20 PubMed citations (using NCBI's gene2pubmed) but had less than 100 words of text in their Gene Wiki article. (Up to the previous period, this post contained 105 words.)  From this list I found 151 genes with more than 100 PubMed citations and less than 100 words of wiki text.

An example is the PIN1 gene.  When the analysis was run, this gene was linked to 154 citations in PubMed yet had only 2 sentences in the Gene Wiki.  So...  how do we fill in these gaps?  This is, of course, the fundamental question associated with wikis or any other attempt to harness community intelligence and there is no easy answer.  One model that we are very interested in was pioneered by Alex Bateman and colleagues at the journal of RNA Biology.  When hopeful authors submit an article about a new RNA family to the journal, it is a condition of publication that they contribute an article to Wikipedia about that family.   Aside from being a generally good thing to do as far as sharing knowledge with the world, these articles are subsequently used to manage the annotations for RNA families in the Rfam database (e.g. snoZ107_R87).  After a few years of operation, the Rfam team published an article that, among others things, celebrated the success of the Wikipedia connection.  So, how might we expand upon this model to tackle the challenges facing the Gene Wiki?

The beauty of this approach is that it does not rely on any changes to the incentive system currently operational in science.  Scientists need to publish in peer-reviewed journals.  Rather than complaining about the inefficiency of this outdated process and suggesting social changes with no obvious way to achieve them, lets see what we can do to make the system work for us as it stands.  Lets create a way for scientists to obtain real publications in real journals and have Gene Wiki article content generated as a natural part of the process.   Here is one idea.

A Gene Wiki Meta-Journal Special Edition
In this model we would work with a number of smallish, topic-focused journals to requisition short review articles about the molecular function and phenotypic relevance of individual human genes.  (By phenotypic relevance I mean the connection between the gene and something that non-scientists might care about such as a role in a disease or a connection to some human attribute such as height, hair color or athletic performance.)  These review articles would be published in journals appropriately matched to the key phenotypes associated with the gene.  For example, we might imagine requesting a review article in the Journal of Investigative Dermatology about the gene Filaggrin because the most important variations in this gene have been shown to relate to skin conditions such as eczema.  Each of these phenotypically targeted gene review articles would be linked from and would link back to a central article that described the meta-journal concept - ideally in a journal with an audience broad enough to span each of the more niche-specific journals that participated in the experiment.  Following the RNA Biology model, a condition of publication would be to update or create the relevant Gene Wiki article with content from the submitted review article.  
While logistically challenging, this approach appeals to me because it continues with the theme of tapping into the 'Long Tail'.  If we can distribute the labor out among a larger number of journals we ought to be able to connect with a larger number of individual contributors.  In addition, it might be appealing to the editors and contributors to more niche-specific journals to participate in a project with broad visibility.   As an alternative, we might consider attempting to organize a gene-focused special edition similar to the annual Nucleic Acids Research database edition in one gene-focused journal (e.g. Genome Biology), but it seems unlikely that this approach would have the same potential breadth of impact.  Also, following a phenotypic rather than molecular orientation aligns well with Wikipedia's notability criterion and hence might help to generate article content that would meet with less resistance from current Wikipedia editors.
If you have any thoughts on this idea (or if you have better ideas!) I would love to hear from you.


Leon French said...

You could try hooking up a document summarization tool to automatically create the gene wiki pages.

You could checkout this paper -
"Automatically Generating Wikipedia Articles: A Structure-Aware Approach"
by Christina Sauper and Regina Barzilay. Proceedings of ACL, 2009.

Benjamin Good said...

Hi Leon. You know I would have agreed with you before I got deeply involved in this wiki business. Oddly, one of the characteristics of all of the large famous wikis that have failed to attract a critical mass of editors is that they were all seeded through extensive text mining. (See wikiproteins, wikigenes, SNPedia). Now, each of these sites does have a lot of useful content for sure (in fact quite a lot in each case), but at the end of the day they aren't tapping into community intelligence and thus they are not meeting their stated goal.

I don't know if the auto-text contributes to the recruiting difficulties or not but I'm starting to guess that it does. One reason could be that when you look at auto-generated text you can tell right away that you are looking at a robot author rather than a human author and perhaps that turns people off more than a mostly empty page does.

One thing it should theoretically help with is SEO (ie some relevant text should be better than nothing), but frankly thats the one problem the gene wiki really doesn't have.

Paul Gardner said...

One of the other advantages of the deal with RNA Biology is that Landes have generously made the RNA family articles open-access and free to publish. Other publishers may not be so generous.

Benjamin Good said...

Hi Paul, we have started conversations with a couple publishers and you are correct about the licensing concerns - still, it looks pretty hopeful. They are definitely interested in exploring the concept.