Friday, February 20, 2015

Building a Garden of Biological Knowledge

March 13, 2013 I wrote up an idea in my notebook that I called 'Pubmed Daily'.  The concept was to build a system that would leverage large-scale crowdsourcing/citizen science and machine learning to produce a high-quality, structured representation of the knowledge in every abstract in PubMed on the same day that the abstract appeared online.  Nearly two years later, based mainly on the labor of outreach coordinator Ginger Tsueng, group leader Andrew Su, and programmer Max Nanis, the idea is just starting to bear fruit (albeit small, perhaps grape size fruits..) .  As the bits and pieces start to come together, I thought it would be worthwhile to share the high-level vision as it exists now.

A Garden of Biological Knowledge

We want to build an information management system (or systems) that supports the work of three key groups of people: bioinformaticians such as Andrew, biologists such as Hudson Freeze, and patient advocates such as the parents and friends of Bertrand Might, a child with a rare genetic disorder related the NGLY1 gene.  The over-arching goal is to produce more rapid biomedical advances based on more effective use of existing knowledge in the processes of of hypothesis generation and high-throughput data analysis.

The thinking goes like this.  Given a high-quality, structured knowledge base such as the Gene Ontology (GO), Andrew and people like him can make many different kinds of discovery and analytical tools that can help scientists such as Hudson work more effectively (and they do, there are thousands of tools that use the GO).  The problem is that the generation of knowledge bases like the GO is a long, slow, expensive process that in no way keeps pace with the advance of knowledge as represented in the literature.  Information extraction systems like DeepDive and SemRep can theoretically go a long way to addressing this problem.  However, humans remain more effective (though obviously dramatically slower) readers.

Can we use computers to seed a garden of knowledge that can be tended and grown by citizen scientists?

Given a compelling argument, clear instructions, and an effective user interface, we think that large numbers of people from the general public could be assembled to work on improving the results of a biomedical information extraction system.  We, and other groups, have been experimenting with various related tasks using the Amazon Mechanical Turk and are now confident that "the crowd" can, in aggregate, do text-processing work at or above expert level.  A recent conversation with a leader of the Zooniverse project, a collection of online citizen science projects with more than a million people contributing, leads us to believe that it would be possible to attract tens to hundreds of thousands of people to participate in an effort like this.  Recently, we took the first steps towards testing that assumption via a short but successful test run of the Mark2Cure Web application.

Can we build a new generation of tools for working with structured biomedical knowledge at massive scales and use these to empower the rising community of citizen scientists?

Aside from the knowledge bases themselves, we are also interested in building better tools for navigating this information and in putting them into the hands of both professionals like Hudson, and the many very intelligent, highly motivated people from other domains that might also be able to find something important given the chance (e.g. Mathew Might).

Can a large volunteer work force help teach computers to read?

By engaging volunteers at scale, we hope to provide the developers of information extraction algorithms with the data they need to raise their approaches to human levels of quality.

We have a long, challenging road ahead on this project but the path ahead is starting to take shape and the future looks bright!