i9606

1 Cyclotron Road

2018-03-25T11:18:00.001-07:00

I've recently started working as a contractor for the BBOP group at Lawerence Berkeley National Labs. To my son's great disappointment (and probably my father's as well) I am not working with the cyclotron, but instead am now working with the Gene Ontology (GO) project. I am involved in a transition that might be put most simply as a migration from 'GO as a gene tagging system' to 'GO as an activity flow modeling system'. For a preview of what the new approach looks like see a MAP kinase cascade example on the nascent GO Causal Activity Modeling environment (AKA Noctua).

Though it is a departure from the last 10+ years of working on crowdsourcing in bioinformatics, its also a much needed re-centering of focus around the core reasons why I got involved in crowdsourcing in the first place. I have this crazy notion that knowledge, once ripped at great cost from nature's powerful grasp, should be cherished, shared, and used in the production of more knowledge. All of the crowdsourcing work was originally motivated by the goal of finding new, potentially better, ways to make that process happen. I'm still interested in finding ways to include (much) larger audiences in the process of curating and growing our collective knowledge base. But, for now, I am happy and proud to have the opportunity to simply dig in with my own hands to help take the GO to the next level.

Perhaps someday (like maybe when my youngest starts kindergarten..) I will have a chance to cross the semantic and social web streams once again.

Leaving Scripps

2017-06-19T21:50:00.000-07:00

After seven years in the Su Lab with six spent at Scripps Research, I've decided to move on to something new. There are obviously many factors that go into a decision like this, but I'll leave you with just the one that is probably the most important. I want to support my wife's desire to re-enter the workforce. I think that this is the right move for our family - at least strongly enough to run the experiment! According to one smart fellow from Princeton, this concept of fathers taking on the 'lead parent' role is a good idea. Here's hoping it works out for us!

Science Game Lab: tool for the unification of biomedical games with a purpose

2017-06-16T14:28:00.000-07:00

Scripps team: Benjamin M. Good, Ginger Tsueng, Andrew I Su

Playmatics Team: Sarah Santini, Margaret Wallace, Nicholas Fortugno, John Szeder, Patrick Mooney,

With helpful ideas from: Jerome Waldispuhl, Melanie Stegman

Abstract

Games with a purpose and other kinds of citizen science initiatives demonstrate great potential for advancing biomedical science and improving STEM education. Articles documenting the success of projects such as Fold.it and Eyewire in high impact journals have raised wide interest in new applications of the distributed human intelligence that these systems have tapped into. However, the path from a good idea to a successful citizen science game remains highly challenging. Apart from the scientific difficulties of identifying suitable problems and appropriate human-powered solutions, the games still need to be created, need to be fun, and need to reach a large audience that remain engaged for the long-term. Here, we describe Science Game Lab (SGL) (https://sciencegamelab.org), a platform for bootstrapping the production, facilitating the publication, and boosting both the fun and the value of the user experience for scientific games with a purpose.

Introduction

Ever since the Fold.it project famously demonstrated that teams of human game players could often outperform supercomputers at the challenging problem of 3d protein structure prediction, so-called ‘games with a purpose’ have seen increasing attention from the biomedical research community. A few other games in this genre include: Phylo for multiple sequence alignment, EteRNA for RNA structure design, Eyewire for mapping neural connectivity, The Cure for breast cancer prognosis prediction, Dizeez for gene annotation, and MalariaSpot for image analysis. Apart from tapping into human intelligence at scale, these efforts have also produced valuable educational opportunities. Many of these games are now used to introduce their underlying concepts in classroom settings where games in all forms are increasingly working their way into curriculums. Concomitant with the rise of these ‘serious games’, citizen science efforts such as the Zooniverse and Mark2Cure have sought similar aims but have packaged their work as volunteer tasks, analogous to unpaid crowdsourcing tasks, rather than as elements of games.

Many of these initiatives have succeeded in independently addressing challenging technical problems through human computation, improving science education, and generally raising scientific awareness. However, with so much interest from the scientific community and a booming ecosystem of game developers, there are actually relatively few of these games in operation now. Recognizing the opportunity, various groups have attempted to push the area forward through new funding opportunities and through various ‘game jams’ such as the one that produced the game ‘genes in space’ for use in analyzing microarray data in cancer. Here, we take a different approach towards expanding the ecosystem of games with a scientific purpose. Rather than attempting to seed the genesis of specific new game-changing games, we hope to lower the barrier to entry for new games and related citizen science tasks to generally promote the development of the entire field. With this high-level aim in mind, we developed Science Game Lab (SGL) to make it easier for developers to create successful scientific games or game-like learning and volunteer experiences. Specifically, SGL is intended to address the challenges of recruiting players and volunteers, keeping them engaged for the long term, and reducing the development costs associated with creating a scientific gaming experience.

The Science Game Lab Web application

SGL is a unique, open-source portal supporting the integration of games and volunteer experiences meant to advance science and science education (https://sciencegamelab.org). Unlike other related sites that act more like volunteer management and/or project directory services, such as SciStarter and Science Game Center, SGL is not simply a listing of related websites. Rather, it is an attempt to create a user experience that takes place directly within the SGL context yet still incorporates content from third parties. The system is largely inspired by game industry portals such as Kongregate that enable developers to incorporate their games directly into a unified metagame experience .

Players can use the portal to find and play games with their achievements within the games tracked on site-wide high score lists and achievement boards (Figure 1). Players can earn the SGL points that drive these leaderboards for actions taken in different games. In this way, SGL provides developers with access to a metagame that can be used to encourage players in addition to the incentives offered within individual games (Figure 2). This metagame can also be used by the system administrators to help direct the player community’s attention to particular games or particular tasks within games. For example, actions taken on new games might earn more points than actions taken on more established games as a way to ‘spread the wealth’ generated by successful games.

Figure 1. SGL home page demonstrating site-wide high score list, game listing, and links to achievements, help, and user profile information.

Figure 2. Badges displayed on user’s profile page. Available badges not yet achieved are greyed out.

Developers interact with SGL by incorporating a small javascript library into their application and using the SGL ‘developer dashboard’ to pair up events in their game with points, badges and quests managed by the SGL server. At this time, SGL only supports games that operate online as Web applications. The games are hosted by the developers and rendered in the SGL context within an iframe. The SGL iframe provides a ‘heads up display’ that provides real time feedback to game players with respect to events sent back to the SGL server such as earning points, gathering badges, or progressing through the stages of a quest (Figure 3). This display provides developers with the ability to add game mechanics to sites that are not overtly games. For example, Wikipathways incorporated a pathway editing tutorial into SGL, using the heads up display to reward users with SGL points and badges for completing various stages of the tutorial. The tutorial also took advantage of the SGL quest-building tool (Figure 4). Games are submitted by developers for approval by SGL administrators. Once approved, the games appear in the public view and can be accessed by any player.

Figure 3. The heads up display provided by the SGL iframe. Shows events captured by the API and provides users with immediate feedback.

Figure 4. Tasks in SGL can be grouped into quests. The figure shows a particular user’s progress through various quests available within the system.

Discussion

If a critical initial mass of effective games can be integrated, SGL could strongly benefit new developers by providing immediate access to a large player population. Site-level status, identity and community features can help with the even greater challenge of long-term player engagement, a noted problem in the field. Within the context of science-related gaming, such status icons might eventually be used as practically useful, real-world marks of achievement inline with the notion of ‘Open Badges’. As demonstrated by the Wikipathways tutorial application, SGL can be used to replace the need for developers to host their own login systems, user tracking databases, and reward systems - all of which can be accomplished using the SGL developer tools. Citizen scientists are not homogenous in their motivations. Designing to be inclusive of gamers and non-gamers can be challenging. By offering an alternative means of experiencing a web-based citizen science application, SGL allows developers to cater to both their gaming and non-gaming contributor audience. Together, these features unite to raise the overall potential for growth within the world of citizen science and scientific gaming.

Future directions

SGL is currently functional, but so far has attracted only a small number of developers willing to integrate their content into the portal. Future work would need to address the challenge of raising the perceived value of integration with the site while lowering the perceived difficulty. Looking forward, key challenges for the future of SGL include better support for:

games meant for mobile devices
development of quests that span multiple games
teachers to build SGL-focused lesson plans and track student progress
creating new ‘SGL-native’ games
integration with external authentication systems

None of these are insurmountable challenges, but they all require significant continued investment in software development. As an open source project, we encourage contributions from anyone that shares in our vision of spreading and doing science through the grand unifying principle of fun.

Building communities of knowledge with Wikidata

2017-06-16T14:01:00.000-07:00

As the Wikimedia Movement works to define its strategy for the next fifteen years, it is worthwhile to consider how its recent product Wikidata may fit into that strategy. As its homepage states,

“Wikidata is a free and open knowledge base that can be read and edited by both humans and machines.” https://www.wikidata.org/

Wikidata is a particular kind of database designed to capture statements about items in the world with references that support those statements. Because Wikidata is a database, its contents are meant to be viewed in the context of software that retrieve the data through queries and then renders the data to meet the needs of a user in a certain context. The same data can thus be viewed on Wikidata-specific pages such as https://www.wikidata.org/wiki/Q13561329 and in the infoboxes of Wikipedia articles such as https://en.wikipedia.org/wiki/Reelin. Importantly, Wikidata content can also be used in applications outside of the Wikimedia family such as http://wikigenomes.org.

Examples of Wikidata use now include:

managing interlanguage links across all the Wikipedias
providing content for hundreds of infobox templates
answering queries across topics ranging from potential new cancer drugs to today’s birthdays to subway stations in Paris.

The molecular biology community (and in particular the Gene Wiki group) has embraced Wikidata as a global platform for knowledge integration and distribution. To help envision how Wikidata may fit into the strategic vision of the WMF movement, it is worth taking a look at how and why this particular community is using Wikidata.

History of the Gene Wiki initiative

The sequencing of the human genome at the beginning of this century and the consequent rush of data and new technology for producing even more data fundamentally changed how research in biology is conducted. Before the year 2000, research typically proceeded with a single gene focus. A typical PhD thesis would entail the analysis of the genetics or function of one gene or protein at a time. A few years after the first genome however, it became possible to measure the activity of ten’s of thousands of genes at once resulting in an omnipresent problem of generating interpretations of experimental results containing hundreds of genes. While a scientist may come to grasp the literature surrounding a single gene quite well, it is not possible to know everything there is to know about all 20,000+ genes in the genome - particularly when this knowledge is expanding on a minute by minute basis. As a consequence, there arose a need to produce summaries of what was known about each gene so that researchers could quickly grasp its nature and easily find links to more detailed references as needed. By 2008, many different research groups published wikis attempting to allow the scientific community to generate the required articles, e.g. WikiProteins, WikiGenes, and the Gene Wiki. The Gene Wiki project was unique among this group as it anchored itself directly to Wikipedia and, likely as a result of that decision, has enjoyed long term success. This initiative works within the English Wikipedia community to encourage and support the collection of articles about human genes. Its main contributions are the infobox seen on the right hand side of of these articles and software for generating new article stubs using that template.

Wikidata and the Gene Wiki project

For the past several years, the Gene Wiki core team (funded by an NIH grant) has focused primarily on seeding Wikidata with biomedical knowledge. In comparison to managing data via direct inclusion and parsing of infobox templates as before, this makes the data much easier to maintain automatically and, importantly, opens it up for use by other applications. As a result, Wikipedia isn’t the only application that can use this structured information. One of the first products of that process was a new module (Infobox_gene) that draws all the needed data to render the gene infobox dynamically from Wikidata, greatly reducing the technical challenge of keeping the data presented there in sync with primary sources.

In addition to the relatively simple collection of gene identifiers and links off to key public databases that are presented in the infoboxes, Wikidata now has an extensive and growing network of knowledge linking genes to proteins, proteins to drugs, drugs to diseases, diseases to pathogens, pathogens to places, places to events, events to people, and so on and so on. This unique, open, referenced, knowledge graph may eventually become the closest thing to ‘the sum of all human knowledge’. Capturing knowledge in this structured form makes it possible to use it in all kinds of applications, each with their own community-specific user experiences. As a case in point, the Gene Wiki group created Wikigenomes based primarily on data loaded into Wikidata. This was followed quickly by Chlambase, an application specifically focused on distributing and collecting knowledge about different Chlamydia genomes. These applications provide domain-specific user interface components such as genome browsers that are needed to present the relevant information effectively and thereby attract the attention of specialist users. These users, in turn, have the opportunity to contribute their knowledge back to the broader community through contributions to Wikidata that can be mediated by the same software.

Wikidata and the world

The molecular biology research community, as represented by the Gene Wiki project, are early adopters of Wikidata as a community platform for the collaborative curation and distribution of structured knowledge, but they are not alone. The same fundamental patterns are already being applied by other communities, e.g. those interested in digital preservation and open bibliography. In each case, we see communities working to transition from the current dominant paradigm of private knowledge management towards the knowledge commons approach made possible by wikidata. This is not unlike the transition from the world of the Encyclopedia Britannica to the world of Wikipedia. The only important difference is that the knowledge in question is structured in a way that makes it easier to reuse in different ways and in different applications.

Wikidata provides a mechanism for massively increasing the global good generated by the Wikimedia Foundation’s work by capturing knowledge in a form that can be agilely used to empower all manner of software with the sum of human knowledge.

Happy St. Patrick's Day

2017-03-17T14:25:00.003-07:00

Cognitive computing and the National Library of Medicine

2017-01-03T21:07:00.001-08:00

"IBM Watson for Drug Discovery helps researchers and organizations discover potential new drug targets and additional drug indications. IBM’s cloud-based, enterprise solution analyzes scientific knowledge and data to reveal patterns and connections that accelerate the formation of new hypotheses, increasing the likelihood and pace of scientific breakthroughs."

It bothers me that there is no true open source, open access version of this kind of system. Should it? Or should we accept that it cost a lot of money to put together software like this and that there is nothing wrong with making a profit on building good software?

The issue to me is that the root content of the product being sold is knowledge and knowledge is more useful (for producing more of itself) when more people have access to it. It is impossible to imagine the impact that PubMed/MEDLINE has had on the advance of biomedical science. Researchers simply could not do their work without it. As our collective knowledge base expands, tools for using that knowledge will inevitably need to look more and more like Watson and less like like digital paper libraries.

Will we ever see the U.S. National Library of Medicine or its equivalent in other countries move into the age of cognitive computing? Is it solely up to industry to fill the increasingly obvious gap? I guess it depends where we want to place that power.

Introducing Knowledge.Bio

2015-10-23T12:48:00.000-07:00

I just prepared the following poster abstract for the upcoming Big Data 2 Knowledge all-hands meeting at NIH. Please play with the tool it describes and let us know what you think (it is a work in progress!). Also, if you have a chance, please stop by the poster and say hello!

Knowledge.Bio: an Interactive Tool for Literature-based Discovery

Personal knowledge graph showing literature-derived connections
between Sepiapterin Reductase (SPR) and 5-Hydroxytryptophan
(a treatment for patients with deleterious mutations in SPR.

Benjamin M. Good, Ph.D.1; Richard M. Bruskiewich, Ph.D. 2; Kenneth C. Huellas-Bruskiewicz2; Farzin Ahmed2; Andrew I. Su, Ph.D.1

1 The Scripps Research Institute, La Jolla, CA, USA. 2 STAR Informatics / Delphinai Corporation, Port Moody, BC, Canada

PubMed now indexes roughly 25 million articles and is growing by more than a million per year. The scale of this “Big Knowledge” repository renders traditional, article-based modes of user interaction unsatisfactory, demanding new interfaces for integrating and summarizing widely distributed knowledge. Natural language processing (NLP) techniques coupled with rich user interfaces can help meet this demand, providing end-users with enhanced views into public knowledge, stimulating their ability to form new hypotheses.

Knowledge.Bio provides a Web interface for exploring the results from text-mining PubMed. It works with subject, predicate, object assertions (triples) extracted from individual abstracts and with predicted statistical associations between pairs of concepts. While agnostic to the NLP technology employed, the current implementation is loaded with triples from the SemRep-generated SemmedDB database and putative gene-disease pairs obtained using Leiden University Medical Center’s ‘Implicitome’ technology.

Users of Knowledge.Bio begin by identifying a concept of interest using text search. Once a concept is identified, associated triples and concept-pairs are displayed in tables. These tables have text-based and semantic filters to help refine the list of triples to relations of interest. The user then selects relations for insertion into a personal knowledge graph implemented using cytoscape.js. The graph is used as a note-taking or ‘mind-mapping’ structure that can be saved offline and then later reloaded into the application. Clicking on edges within a graph or on the ‘evidence’ element of a triple displays the abstracts where that relation was detected, thus allowing the user to judge the veracity of the statement and to read the underlying articles.

Knowledge.Bio is a free, open-source application that can provide, deep, personal, concise, shareable views into the “Big Knowledge” scattered across the biomedical literature. It is available at http://knowledge.bio, with source code at https://bitbucket.org/starinformatics/gbk.

Poof it works - using wikidata to build Wikipedia articles about genes

2015-10-21T12:10:00.000-07:00

Infobox for ARF6,
rendered entirely from
content Wikidata

The Gene Wiki team has been hard at work filling wikidata with useful content about genes, diseases, and drugs using the new and improved ProteinBoxBot. Now, we are starting to see the fruits of this labor in the context of Wikipedia.

The Gene Wiki project has programmatically created and maintained the infoboxes to the right of all the articles in Wikipedia about human genes since about 2008 [Huss 2008]. This process has entailed the construction of a unique template containing all of the relevant data for each gene. For example, here is the code for the template for the ARF6 gene. As Wikipedia previously had no database, that is where the data was stored. Altering that content programmatically involves parsing that template as a string. Its ugly (sorry Jon) and there are more than 11,000 of these templates to maintain (one per gene in Wikipedia).

Now, the same data can be represented in Wikidata, a queriable, open graph of claims about the world backed by references and specified by qualifiers [Vrandečić 2014]. Now that the content needed to render the infobox is all there, we can convert 11,000+ complex templates that require string parsing to maintain to a single, re-usable template for all of them.

The first cut at the new template is {{infobox gene}}. If you put that on any article about a human gene, you ought to get the complete infobox for the article without any further ado. Poof! You can view it in action on this revision for ARF6. We haven't rolled out the new template across all the articles yet, but hope to see that happen in the coming months. Remaining issues include: better error-handling in the template code, better ways to give users the ability to edit the associated data in wikidata, and updates to all of the code that produces gene wiki articles. If you want to help, chime in on the module:wikidata thread.

Recruiting NLP-crowdsourcing-semantic-web postdoc or staff scientist

2015-07-07T11:50:00.000-07:00

Our laboratory at the The Scripps Research Institute in beautiful San Diego, California is recruiting a talented individual to help us use crowdsourcing to push the boundaries of biomedical information extraction and its applications. We are looking for someone with experience in natural language processing (statistical or linguistic), machine learning, and knowledge representation. This person would work to integrate efforts across several related projects.

Ongoing and nascent projects include:

Collaboration with the NASA tournament laboratory at Harvard University to use TopCoder innovation contests to improve the BANNER named entity recognition tool using data collected with Amazon Mechanical Turk.
Collaboration with the Biosemantics group at Leiden University on applications of PubMed-scale knowledge discovery for human genetic diseases.
Collaboration with UCSD and Stanford to implement an open-access, biomedically-focused installation of the DeepDive knowledge base extraction system.
Collaboration with the BeFree team from the Integrative Biomedical Informatics Group of the Institut Hospital del Mar d’Investigacions Mèdiques (IMIM) to use the Crowdflower crowdsourcing platform to generate ground truth corpora for training a relation extraction system to be entered in the 2015 BioCreative challenge.
The development of the knowledge.bio application for browsing networks of extracted knowledge and gathering user feedback for improving the underlying knowledge base.
Ongoing development of the citizen science application Mark2Cure with an emphasis on its integration with all of the projects listed above.

Sound like fun? Ready to jump in?

Contact Andrew Su and or Benjamin Good for more information.

p.s. We have other openings in related areas!

crowdsourcing machine learning NLP challenge

2015-03-04T11:48:00.002-08:00

There is a TopCoder contest running right now that involves machine learning, crowdsourced data, and natural language processing. There is $41,000 up for grabs! It will be distributed in many smaller prizes so there are plenty of opportunities to win something.

You need to register by tomorrow (March 4, 2015) to participate!

More details here:
https://www.topcoder.com/blog/registration-for-banner-is-now-open/

Return of OntoLoki - on a Freaky Friday

2015-02-24T09:39:00.000-08:00

One of my favorite traditions initiated in Mark Wilkinson's laboratory (where I did my PhD) was "Freaky Friday". Following the spirit of Google's "20% time", lab members were encouraged to take out some time on Friday to pursue their own crazy ideas - leaving their core projects aside for a few hours. This was fun and ended up producing some useful things like "Tag clouds for summarizing web search results" (one of my most cited articles and only 2 pages long!) and "The Entity Describer". In this spirit, I took out some time last Freaky Friday to do something very far away from the things I should really be doing. Sadly I didn't hack something fun together. Instead, per Dr. Wilkinson's 50th request for me to do this, I performed a Frankensteinian resurrection. Digging deep into the grave of my PhD dissertation, I found the one unpublished chapter, cleaned it up, and deposited it into the arXiv. The project, called OntoLoki, is now undead! Check out this undead manuscript about automatic quality evaluation for ontologies.

I am OntoLoki! Feed me your ontologies for evaluation!

OntoLoki: an automatic, instance-based method for the evaluation of biological ontologies on the Semantic Web

Benjamin M. Good
Gavin Ha
Chi K. Ho
Mark D. Wilkinson

Building a Garden of Biological Knowledge

2015-02-20T16:06:00.000-08:00

March 13, 2013 I wrote up an idea in my notebook that I called 'Pubmed Daily'. The concept was to build a system that would leverage large-scale crowdsourcing/citizen science and machine learning to produce a high-quality, structured representation of the knowledge in every abstract in PubMed on the same day that the abstract appeared online. Nearly two years later, based mainly on the labor of outreach coordinator Ginger Tsueng, group leader Andrew Su, and programmer Max Nanis, the idea is just starting to bear fruit (albeit small, perhaps grape size fruits..) . As the bits and pieces start to come together, I thought it would be worthwhile to share the high-level vision as it exists now.

A Garden of Biological Knowledge

We want to build an information management system (or systems) that supports the work of three key groups of people: bioinformaticians such as Andrew, biologists such as Hudson Freeze, and patient advocates such as the parents and friends of Bertrand Might, a child with a rare genetic disorder related the NGLY1 gene. The over-arching goal is to produce more rapid biomedical advances based on more effective use of existing knowledge in the processes of of hypothesis generation and high-throughput data analysis.

The thinking goes like this. Given a high-quality, structured knowledge base such as the Gene Ontology (GO), Andrew and people like him can make many different kinds of discovery and analytical tools that can help scientists such as Hudson work more effectively (and they do, there are thousands of tools that use the GO). The problem is that the generation of knowledge bases like the GO is a long, slow, expensive process that in no way keeps pace with the advance of knowledge as represented in the literature. Information extraction systems like DeepDive and SemRep can theoretically go a long way to addressing this problem. However, humans remain more effective (though obviously dramatically slower) readers.

Can we use computers to seed a garden of knowledge that can be tended and grown by citizen scientists?

Given a compelling argument, clear instructions, and an effective user interface, we think that large numbers of people from the general public could be assembled to work on improving the results of a biomedical information extraction system. We, and other groups, have been experimenting with various related tasks using the Amazon Mechanical Turk and are now confident that "the crowd" can, in aggregate, do text-processing work at or above expert level. A recent conversation with a leader of the Zooniverse project, a collection of online citizen science projects with more than a million people contributing, leads us to believe that it would be possible to attract tens to hundreds of thousands of people to participate in an effort like this. Recently, we took the first steps towards testing that assumption via a short but successful test run of the Mark2Cure Web application.

Can we build a new generation of tools for working with structured biomedical knowledge at massive scales and use these to empower the rising community of citizen scientists?

Aside from the knowledge bases themselves, we are also interested in building better tools for navigating this information and in putting them into the hands of both professionals like Hudson, and the many very intelligent, highly motivated people from other domains that might also be able to find something important given the chance (e.g. Mathew Might).

Can a large volunteer work force help teach computers to read?

By engaging volunteers at scale, we hope to provide the developers of information extraction algorithms with the data they need to raise their approaches to human levels of quality.

We have a long, challenging road ahead on this project but the path ahead is starting to take shape and the future looks bright!

Hackathon recap: Network of Biothings in San Diego

2014-11-11T16:27:00.000-08:00

This past weekend, Nov. 7-9, the Neuroscience Information Framework, the Su Lab, NDex, The International Society for Biocuration and the San Diego Center for Systems Biology hosted the second Network of Biothings Hackathon. The event took place at Atkinson Hall (Calit2) on the UC San Diego campus. For the record, and to enhance what can already be seen in the Twitter story of this event, here is what went down.

Friday evening about 30 people arrived at Atkinson hall where they:

consumed Indian food and American beverages,
met each other and talked over project ideas,
each decided on a project to work on for the weekend

Saturday the group:

ate bagels for breakfast, sandwiches and leftover Indian food for lunch, and pizza for dinner
hacked furiously to develop 8 distinct team projects

Sunday morning, we came together for more bagels, some last minute hacking and rapid presentation formation. At 10:30am we got started with 8, 10 minute project presentations. Here is a very brief list of the projects with links out to all of the code that was developed. (More details are available on the ideas page.)

BioPolymer: A set of embeddable web components to display and edit bio data. Initially data from MyGene & MyVariant. Team: Mark Fortner, Keiichiro Ono

http://bitbucket.org/mark_fortner/biopolymer

SameSame: a dynamic tool for finding and visualizing the degree of similarity between any set of biomedical concepts. Team: Benjamin Good, Maulik Kamdar, Alan Higgins, Alex Williams

https://bitbucket.org/sulab/samesameweb

CIViC - Clinical Interpretation of Variants in Cancer: Crowdsourcing and web interface for curation of clinically actionable evidence in cancer. Team: Obi Griffith, Adam Coffman, Martin Jones, Karthik G, Jon Cristensen, Julianne Schneider

Citizen Science: An app to enable people to extract structured ‘facts’ (subject predicate object triples) from unstructured text. Here is the project presentation. Team: Richard Good, Hannes Niedner, Andrew Su

BRAT-CS: https://github.com/Network-of-BioThings/brat-citizenscience
iOS annotator: https://bitbucket.org/sulab/citizenscience/

SBiDer: Synthetic Biocircuit Developer: A web-app to search a database of operons [functional biochemical pathways] to use them in new and novel ways [to make synthetic organisms such as the ones used to make this Malaria treatment]. Team Justin Huang, Kwat Yeerna, Fernando Contreras, Joaquin Reyna, Jenhan Tao

NDex: The NDEx project provides a public website where scientists and organizations can share, store, manipulate, and publish biological network knowledge. Team Dexter Pratt

fiSSEA: A framework that integrates MyGene.info and MyVariant.info to retrieve functional prediction annotations (or any type of annotation) for knowledge discovery, specifically implement CADD scores for “functional impact SNP Set Enrichment Analysis". Team: Adam Mark, Erick Scott, Chunlei Wu

https://github.com/Network-of-BioThings/fiSSEA

MyGene.Info Taxonomy Query: Added detailed taxonomy information to mygene.info. Allows queries based on taxonomy ID and advanced queries based on hierarchies of taxonomic nodes. Team: Greg Stupp, Chunlei Wu

https://bitbucket.org/stuppie/metaproteomics/ (see Taxonomy Parser)

Hackathon Trophy

And the winners were...

#1 Citizen Science

#2 SBiDer

There were a lot of very exciting things about this event. In addition to a strong core of academics from multiple universities, we also had local app and algorithm developers from industry. While the US west coast lean was powerful, we did have representatives from St. Louis and one that came all the way from Canada. We also had a very strong undergraduate team (taking second place!). All of the projects clearly made real progress over the weekend with some excellent cross-pollination of code and ideas.

And the coolest thing about this event??

My Dad was on the winning team :).. Go Dad!

Special thanks to Martin Jones for running around with me picking up bagels, pizza, drinks, and snacks for everyone and to Jeffrey Grethe for keeping everything running smoothly at the event, pushing around tables, helping clean up after the hackers, and calmly handling everything that needed to be done.

What is a hackathon?

2014-11-04T11:53:00.001-08:00

As an organizer of the upcoming Network of Biothings Hackathon at UC San Diego, I've been asked by a number of people what a hackathon is exactly. I'm repurposing one of those responses here (original posted on the San Diego iOS developers meetup group).

The main idea is that a variety of different people come together to meet each other and make something together - almost always something open source. In the case of our hackathon, the purpose is specifically to engender new collaborations. For the academics these could translate into new research programs and new collaborative grant proposals. For industry folks, these could turn into new products. Ideally, a hackathon can bring together the elements of a great new team. For example, I'm a back-end database guy with an understanding of bioinformatics. I'd love to find a front-end web or app developer to help make my data and algorithms useful to the rest of the world.

Here are a few questions I've fielded:

why do developers pay to build apps for somebody else?

The money goes to pay for food, drinks, facilities fees, and to a small extent the prize. Developers don't pay to develop for someone else, they pay to meet other people and to eat.. No one is under any obligation to give their code to anyone else. You would be welcome to come by and work on your own project. In fact, we are actively trying to get more project ideas posted for our hackathon. (Note that the fee we are asking for is only $40 and many hackathons with larger sponsors are free).

why do developers put their time to do work for free?

The main point here is team formation. If you have a great app, this is also a way to advertise it - especially if it wins a prize.

do the teams who paid to participate build apps and one is chosen as winner?

Yes, this is the basic idea.

are the rest thrown away?

Nothing is thrown away. The participants maintain ownership of all code that is written. (Though open source is very strongly encouraged...)

Am I too young/old to participate?

Nope! All are welcome.

In conclusion, hackathons are fun, social events for people that like to build new things, meet new people, and perhaps the change the world. Sign up for ours and find out for yourself!

Network of Biothings Hackathon at UC San Diego

2014-10-06T14:22:00.000-07:00

Can you code?

Are you interested in the intersection of computer science and biology (bioinformatics) ?

Do you want to meet interesting people?

Are you excited about building new pieces of software that could change the face science and medicine?

Do you want to win a cash prize for your open source code?

Then its clearly time to:

Location: UC San Diego on the 5th floor of the CALIT2 building.
Sign up: sign up
Schedule:
Friday, November 7

6-10pm : Welcome social / project team formation

Saturday, November 8

9am-? : Hacking !

Sunday, November, 9

9am-10:30: Final hacking / presentation preparation
10:30:11:30 Pitches and Demos
11:45: Prize announcements

Click here for additional details.

Conference proceedings are citable, stop double-dipping

2014-10-01T10:24:00.000-07:00

Over the last several years there has been a trend among conferences such as the Bio-Ontologies Special Interest Group meeting at ISMB and the Semantic Web applications and tools for life sciences (SWAT4LS) to invite article submissions for people that want to present at the conference and then to subsequently invite presenters to expand their article in an "official" publication in an associated journal. Since the PDFs of the articles submitted to the conference usually end up online, as they should, this results in a situation where first reviewers and later readers are often confronted with two versions of essentially the same article - typically with the same title, author list, and often the same abstract. This causes problems for reviewers as this kind of overlap with prior work (even from the same authors) would typically be grounds for rejection - yet because of this bizarre arrangement with the conference, reviewers are supposed to treat the original article as if it were a pre-print of the first, despite the fact that it is a citable entity on the Web and never referred to as a preprint anywhere.

Conference organizers, please stop this madness. Here are three models that would be better.

Following the International Biocuration Conference model, invite submissions directly to the partner journal first and then choose presenters from the successful submissions and independently submitted abstracts. No confusion. One good, citable paper. Probably higher quality conference submissions.
Put the articles submitted to the conference in a pre-print server such as arXiv or BioRxiv and continue with the concept of an expanded article in a journal.
Do what the computer science community does and recognize contributions to conference proceedings as citable articles and do away with the attempt to get an 'official' journal publication in addition to the conference citation.

Network of BioThings: Hackathon 2 San Diego (when?)

2014-08-13T10:51:00.000-07:00

The Network of Biothings, first announced in December of 2013, is being imagined by a loose, self-organizing consortium of people who share the vision of uniting and linking the world's biological and medical knowledge. In support of this vision, The Su Laboratory, with partners at UCSD, is gearing up to host the second Network of Biothings Hackathon. The first hackathon was an exciting and very educational event that sparked some useful projects such as http://myvariant.info/. We are hoping to build on that momentum with an even more successful second event.

If you would like to participate in Hackathon 2, you can begin by helping us solve the most challenging problem of all: picking dates for a hackathon! Please fill in dates that you would be available to come hack with us in San Diego at this poll:

http://doodle.com/z2irpfma6apyavpk

Why should you bother?
When faced with challenges such as selecting the best treatment for a patient or coming up with the next candidate drug target for a rare disease, we are now presented with an unbelievable wealth of data including: full genome sequencing, mRNA expression, miRNA expression, methylation, metabolomics, proteomics, clinical, imaging, and on and on. In order for this new data to be useful, we depend on networks of knowledge. For example, we may be able to detect that a particular gene is acting unusually in a patient, but we need to know something about that gene's biological function before we can use the new information to inform a clinical decision. Many many valuable databases continue to arise that help address this fundamental challenge, but there is a clear consensus that most knowledge - especially the vast amount that is shared through the literature - is not accessible in any coherent form. With your help, that coherent form - whatever it ends up looking like - could arise from the Network of Biothings.

Zooniverse Proposal: Excavating a network of concepts related to Chordoma from the biomedical literature

2014-07-28T14:06:00.001-07:00

The Zooniverse team, in collaboration with other members of the Oxford community including the Faculty of English Language and Literature, has recently started an initiative about Constructing Scientific Communities. As part of this initiative, they announced an open call for proposals. Here is our proposal (originating from our work on the Mark2Cure project).

Title: Excavating a network of concepts related to Chordoma from the biomedical literature

Abstract: The life sciences are currently faced with a rapidly growing array of technologies for measuring the molecular states of living things. From sequencing platforms that can assemble the complete genome sequence of a complex organism involving billions of nucleotides in a few days to imaging systems that can just as rapidly churn out millions of snapshots of cells, biology is truly faced with a data deluge. To translate this information into new knowledge that can guide the search for new medicines, biomedical researchers increasingly need to build on the existing knowledge of the broad community. Prior knowledge can help guide searches through the masses of new data. Unfortunately, most biomedical knowledge is represented solely in the text of journal articles. Given that more than a million such articles are published every year, the challenge of using this knowledge effectively is substantial. Ideally, knowledge such as the interrelations between genes, drugs, biological processes and diseases would be represented in a structured form that enabled queries like: “show me all the genes related to this disease or related to any drugs used to treat this disease”. Currently systems exist that attempt to extract this information automatically from text, but the quality of their output remains far below what can be obtained by human readers. We propose to construct a scientific community focused on translating the knowledge in the biomedical literature into structured forms suitable for effective access, aggregation and querying. Specifically we propose to excavate a network of concepts related to Chordoma, leveraging an existing relationship we have with the Chordoma Foundation. Chordoma is a rare, devastating form of cancer that develops along the skull and bones of the spine. There are tens of thousands of articles about this disease, related diseases, related genes, and related biological processes. Extracting the network of knowledge represented in these texts will enable our group and others to more effectively identify existing drugs that might be repurposed to treat Chordoma and to produce hypotheses about genes that might be targets for new drugs.

Please provide details of the images, video or sounds which form the basis of your project, and the task or tasks you envisage volunteers carrying out. As well as a description, include details of format and any copyright restrictions.

The subject matter for this task will be the abstracts of biomedical research articles housed in the PubMed database [1]. PubMed currently has more than 23 million abstracts and is growing at a rate of approximately 1 million new articles every year. From these, we have identified a set of approximately 50,000 articles related to Chordoma that would form the basis of this project. This set was selected by: 1) searching PubMed for Chordoma (produces 3333 articles), 2) searching within these articles for genes (produces 63 genes), 3) searching for articles related to those genes (produces an average of 731 articles per gene). These abstracts can easily be accessed via an open Web API [2]. PubMed displays abstracts based on ‘fair use’ agreements with the many journals that supply them. Some journals do maintain official hold over the copyrights for these abstracts, but in practice the abstracts are free for public use. (The full text of the articles are a different matter, though many new open access articles are available without restrictions.)

The tasks involved in this proposal include the annotation of key kinds of biomedical entities in the text of these abstracts. Specifically, we will ask participants to identify words or phrases that correspond to diseases, genes, chemical entities and biological processes. After highlighting the specific phrase corresponding to one of these concepts, the volunteers will then be asked to find the highlighted concept in an existing ontology (a hierarchical organized controlled vocabulary) that we would provide. The first task is often referred to as ‘concept detection’ and the second as ‘concept normalization’. (If limited to concept detection, e.g. if the concept normalization interface was too costly to engineer in this iteration, the project would still be a very valuable contribution.) See the ‘egas’ web application for an example tool that supports these tasks http://bioinformatics.ua.pt/egas/.

1. PubMed [http://www.ncbi.nlm.nih.gov/pubmed]

2. NCBI E-utilities [http://www.ncbi.nlm.nih.gov/books/NBK25501/]

Provide a brief description of the research which will be enabled by the crowdsourcing project. *

Please write up to 1000 words for a non-specialist audience. Include references in the text.

Precisely identifying occurrences of diseases, genes, biological processes, and chemical entities in biomedical text will help to drive both biomedical research and research into natural language processing (NLP). In the long run, we anticipate that NLP technology will eventually mature to the point where manual text annotation tasks such as that proposed here are not necessary. However, progress towards that objective is slow and is hampered by the need for large, manually annotated “gold standard” corpora with which to train machine learning systems and evaluate computational predictions [3]. The annotations captured through this project will form an invaluable resource for the NLP community to use to hone their algorithms. Further, while NLP technology advances in steps that can take decades to unfold, we can make immediate use of the products of this project to advance research on Chordoma. Here, we describe the Chordoma research that could be enabled by this project (leaving discussion of the project’s impact on NLP research to the next section on automated processing routines).

Modern approaches to drug development often begin with the identification of specific genes that are ‘targets’ for the drug. Once a particular gene has been identified, drugs can be designed that repress or enhance its activity and thereby treat the intended disease. The identification of good gene targets is a critically important step in drug development because the process of creating and testing a particular drug is incredibly costly in terms of both time and money. In fact it has been estimated that, in general, it takes more than a decade and costs more than a billion dollars to bring a single drug to market - with many drugs failing at the final stages of the process [4].

In Chordoma, mutations in a gene called ‘Brachyury’ are present in more than 90% of afflicted patients [5]. This information makes it one center of the search for drug targets. Several studies have shown that if Brachyury is repressed in Chordoma cell lines (reproducing cell populations derived from Chordoma tumors), the cells’ pathological characteristics of malignant tumors, such as their capacity to proliferate, are significantly decreased [6]. Brachyury represents one promising drug target for Chordoma therapy (and for other cancers [7,8]), but we are still far from a cure. No drugs have been approved by the United States Food and Drug Administration for the treatment of Chordoma and while Brachyury is clearly an important component it does not act alone. Genes work together in complex relationships, often referred to as ‘biological pathways’, to produce both healthy and diseased phenotypes. Other genes “turn on” the expression of Brachyury which in turn activates or represses the expression of other genes downstream. Many different members of this cascade could prove to be effective drug targets. It is also important to keep in mind that these genes have normal, important functions that may make them unsuitable for drug targeting. Understanding this network of interacting genes and the biological processes that they carry out is thus a crucial step in the rational selection of candidate genes.

A thorough map of the genes and biological processes related to Chordoma would be a powerful tool for research and is a challenge well-suited to a large community. While some of the required information is present in databases such as that provided by the Gene Ontology consortium [9], which catalogues the function of genes, most remains represented in the text of scientific articles. By tagging the occurrences of the crucial concepts (genes, diseases, chemicals, and biological processes) in these articles, we can build a network that links them together. This network could then be used by scientists to guide their choice for the next experiments to execute in their search for cures.

In addition to finding novel target genes for the development of new drugs, another important direction for Chordoma research is the search for existing drugs that might be effective on this disease. The challenge here again is one of selecting which of tens of thousands of available drugs to test. This process, called “drug repositioning”, could also be enhanced through the provision of an effective map of the biological knowledge network surrounding Chordoma. Existing drugs that treat related diseases (such as other forms of cancer) or that target proteins in biological processes known to be important to Chordoma progression (such as angiogenesis) form potential candidates for repositioning. Once again, the quality and breadth of the network of knowledge related to Chordoma will have a direct impact on the success of identifying such drugs.

3. Deleger L, Li Q, Lingren T, Kaiser M, Molnar K, Stoutenborough L, Kouril M, Marsolo K, Solti I: Building gold standard corpora for medical natural language processing tasks. AMIA Annu Symp Proc 2012, 2012:144-153.

4. DiMasi JA, Grabowski HG: The cost of biopharmaceutical R&D: is biotech different?. Managerial and Decision Economics 2007, 28(4):469-479.

5. Pillay N, Plagnol V, Tarpey PS, Lobo SB, Presneau N, Szuhai K, Halai D, Berisha F, Cannon SR, Mead S et al: A common single-nucleotide variant in T is strongly associated with chordoma. Nat Genet 2012, 44(11):1185-1187.

6. Presneau N, Shalaby A, Ye H, Pillay N, Halai D, Idowu B, Tirabosco R, Whitwell D, Jacques TS, Kindblom LG et al: Role of the transcription factor T (brachyury) in the pathogenesis of sporadic chordoma: a genetic and functional-based study. The Journal of pathology 2011, 223(3):327-335.

7. Imajyo I, Sugiura T, Kobayashi Y, Shimoda M, Ishii K, Akimoto N, Yoshihama N, Kobayashi I, Mori Y: T-box transcription factor Brachyury expression is correlated with epithelial-mesenchymal transition and lymph node metastasis in oral squamous cell carcinoma. International journal of oncology 2012, 41(6):1985-1995.

8. Roselli M, Fernando RI, Guadagni F, Spila A, Alessandroni J, Palmirotta R, Costarelli L, Litzinger M, Hamilton D, Huang B et al: Brachyury, a driver of the epithelial-mesenchymal transition, is overexpressed in human lung tumors: an opportunity for novel interventions against lung cancer. Clin Cancer Res 2012, 18(14):3868-3879.

9. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT et al: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000, 25(1):25-29.

What automatic processing routines exist which attempt to solve the problem being addressed? Why can't they be used instead of humans? *

In order to avoid wasting the time of volunteers, we only support projects that require human classification. Please include references where possible

Many computational approaches for identifying concepts in text exist, but none of them provides accuracy that is comparable to manual annotation on the problems being addressed in this project. The performance of concept recognition algorithms varies substantially based on the types of concepts sought. Performance is typically measured based on Precision (true positives / (false positives + true positives)), Recall (true positives / (true positives + false negatives) and summarized as the ‘F measure’ (the harmonic mean of Precision and Recall). Specifically we are interested in identifying occurrences of diseases, genes, chemicals of interest, and biological processes. A recent study identified the best performing of three modern tools for concept recognition across a variety of concepts [10]. They found the best performing tool and parameter combination for recognizing genes (proteins) produced an F score 0.57, for chemical entities an F score of 0.56 and for biological processes an F score of 0.42. An advanced system specifically optimized for disease recognition recently reported an F measure of 0.81 [11]. For every case, humans can significantly outperform existing methods. And as described previously, the breadth and accuracy of the network strongly influence how useful they are to research scientists.

10. Funk C, Baumgartner W, Jr., Garcia B, Roeder C, Bada M, Cohen KB, Hunter LE, Verspoor K: Large-scale biomedical concept recognition: an evaluation of current automatic annotators and their parameters. BMC Bioinformatics 2014, 15:59.

11. Leaman R, Islamaj Dogan R, Lu Z: DNorm: disease name normalization with pairwise learning to rank. Bioinformatics 2013, 29(22):2909-2917.

If possible, estimate the minimum number of times a task must be performed on a given element of data to be useful for science (assuming all tasks are performed by competent citizen scientists; once might be enough for exceptionally clear tasks, more times could be required for fuzzier tasks or lots may be necessary if accurate estimates of uncertainties are needed). How many total tasks must be completed before your research goals are achievable?

This is difficult but any estimate helps.

Based on preliminary data we estimate the minimum number of times an individual task must be performed to produce useful results at 5, though additional iterations would improve quality. These estimates are based studies that we recently conducted using Amazon’s Mechanical Turk crowdsourcing system. Our results indicate that non-specialist, minimally paid workers in this marketplace can successfully identify occurrences of diseases in PubMed abstracts. (We have not yet tested other entity types.) Using a simple aggregation strategy based on unweighted voting, we found that these workers could reproduce a gold standard disease mention corpus [12] with an F measure of 0.86. We found that increasing the number of workers per document continuously increased the quality of the output but that quality increased only minimally after 15 workers per document. Using just 5 workers per document we achieved an F score of 0.82 on the same corpus.

It may be possible that the Zooniverse infrastructure would reduce the number of completions per task required for high quality. We anticipate that more sophisticated aggregation algorithms that take into account information about individual worker quality could improve performance and that a more refined user interface and instruction set could also boost scores. Further, we expect to attract more dedicated, high quality contributors from the citizen science community than the Mechanical Turk platform. It is also worth noting here that, despite the financial incentives that drive the Mechanical Turk system many of the workers expressed a strong attachment to the project that was clearly highly motivational. In fact some workers were asking if they could continue to complete these tasks outside of the Mechanical Turk context simply because they wanted to contribute to our efforts.

While there is no fixed threshold for the number of documents above which we could claim a complete reconstruction of the network of knowledge surrounding Chordoma, we estimate that 10,000 would provide an effective start, 50,000 would provide good coverage, and 100,000 would cover the domain in reasonable depth. With more than 23 million articles already indexed in PubMed and 20,000 new articles arriving every week, there is an effectively unlimited range of potential work that could be performed based on this concept recognition and normalization model.

12. Dogan RI, Leaman R, Lu Z: NCBI disease corpus: a resource for disease name recognition and concept normalization. J Biomed Inform 2014, 47:1-10.

Who will make use of the results? Is any further funding necessary?

Researchers from the natural language processing community would use this data to train and to test their computational methods. Bioinformatics scientists would also use this data to refine computational methods for identifying candidate drug targets and suggesting opportune existing drugs for repositioning. The Chordoma research community, along with biomedical researchers in related domains, would make use of the concept network identified through this work via interactive software. Given the data, generic network visualization tools such as Cytoscape [13] could be immediately applied. Ideally Web-based applications specifically devoted to browsing and querying this network would also be delivered to this community. Additional funding would be useful in delivering such focused tools, but given the data, research groups such as ours would likely be able to use other funding sources to produce the required end-user applications. We also note that we run a reasonably well-funded bioinformatics lab, so we can devote our own time and effort toward the success of this collaboration based on funding for related projects.

13. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T: Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 2003, 13(11):2498-2504.

All data from Zooniverse projects must be eventually made public. What final format (catalogue, annotated image, query tool) would be needed? What are the anticipated final outcomes (e.g., papers, catalogues)? Are the results likely to be of interest to researchers beyond your own field?

The raw and refined results from this project would be delivered as bulk data exports in a format suitable for use by computational scientists (NLP and bioinformatics) and through a tool (or tools) that allowed biomedical researchers to interact with the extracted concept network without the need for programming skills. Aside from these, we would expect to publish research articles about the process of composing this knowledge network in collaboration with citizen scientists.

We anticipate that the results of this project would be of broad interest to all communities that must process large amounts of unstructured text. Essentially identical processes might be applied to tasks in widely varying fields including both other sciences and the humanities. One additional aspect to consider regarding this project is that every document processed is already annotated with its date of publication. As a result, it would be possible to develop views that exposed the evolution of the concept network over time. This historical perspective might prove to be of interest to a variety of communities - especially those interested in epistemology and the history of science.

Are there potential extensions to the project that you have in mind?

We envision extensions to this project in terms of

1) expanding the number of different kinds of concepts identified

2) expanding to annotate different document sets targeted at different diseases

3) adding the ability for volunteers to specify relationships between the entities that they identify

4) providing volunteers with increasingly powerful computational tools for pre-processing the texts and for verifying the final annotations

The primary goal of each of our projects is to enable research, but they have significant educational impact as well. Engaging the community is an excellent way of ensuring they remain committed to producing results for you. Are there members of your team willing to write blog posts, join forum discussions on scientific topics or otherwise take part in outreach? Does the project tie in with any public engagement or education activities you are already involved with? *

Some form of continuous engagement is prerequisite for a successful project

As we hope is evident on our group’s blog, http://sulab.org/blog/, and our twitter streams (@bgood , @andrewsu ) we are avidly working on scientific outreach on a daily basis. In fact, Ginger Tsueng, one of our project team members, has recently been hired explicitly to manage community outreach for our research group. Ginger and all other members of our team would plan to actively engage with the community by all means at our disposal. This project follows directly in line with several ongoing community intelligence efforts run by our group including the Gene Wiki [14], BioGPS [15], and http://genegames.org. Our preliminary work on the annotation problem has been operated under the moniker Mark2Cure at http://mark2cure.org.

14. Good BM, Clarke EL, de Alfaro L, Su AI: The Gene Wiki in 2011: community intelligence applied to human gene annotation. Nucleic Acids Res 2012, 40(Database issue):D1255-1261.

15. Wu C, Orozco C, Boyer J, Leglise M, Goodale J, Batalov S, Hodge CL, Haase J, Janes J, Huss JW, 3rd et al: BioGPS: an extensible and customizable portal for querying and organizing gene annotation resources. Genome Biol 2009, 10(11):R130.

The Cure at Salk Cancer Day Symposium

2014-04-21T08:42:00.000-07:00

Karthik G. and I will be presenting a poster tomorrow at the Salk Institute's Cancer Day Symposium. We will be presenting data from a year with the scientific discovery game The Cure. You can read more about those results on the arXiv.

If you are coming, please stop by for a chat! We would especially love the chance to discuss the new, collaborative decision tree-building interface that Karthik has created. Who knows if the conference wifi will work, so please try it now!

The Cure: Making a game of gene selection for breast cancer survival prediction from goodb

Microtask crowdsourcing and biocuration

2014-04-10T13:25:00.002-07:00

Thanks to the hard work of my coauthors @x0xMaximus and @andrewsu , I was able to nab the award for the best presentation at The Seventh International Biocuration Conference from the International Society of Biocuration. The slides for the presentation and the poster are available from slideshare.

Yay team!

I think the presentation garnered the interest it did because many of the people in the audience had heard the term "crowdsourcing" before, but had never seen a real example of a specific application - let alone one in science. I was surprised by the number of people that I spoke to that had no idea what the Amazon Mechanical Turk was - nevermind that it might be applicable to some of the problems they were working on. We had a decent result to talk about, but much more importantly, we taught the audience about a powerful new tool that they might be able to use in their own work.

For those that do want to try scientific applications of microtask crowdsourcing I'd like to emphasize that its probably not going to be an easy process. The result we presented was from the third iteration of our system and represents several months of developer time. While resources are emerging that should make this process much faster to get started (e.g. [1-4]), expect to engage in an iterative cycle to get your system dialed in!

If you do want to give crowdsourcing a try for biocuration or other scientific objectives, (1) we would love to hear about it! and (2) it might be worth a quick look at our review of the domain [5]. Microtask systems such as the one we worked with here are just one of many ways that scientific challenges can be opened up to much broader communities.

References

Our code: mark2cure
Soltilab mention tagger for crowdflower
GATE crowdsourcing plugin
Crowd Watson from IBM
Good, Benjamin M., and Andrew I. Su. "Crowdsourcing for bioinformatics" Bioinformatics 29.16 (2013): 1925-1933.

NIH Grant proposal for sale!

2014-03-21T16:24:00.000-07:00

Last November, Andrew and I submitted an R21 proposal for consideration by the NIH. Today, we received the summary statement. Since a lot of work went into writing it, I feel compelled to share it regardless of whether its ever funded (which currently seems like a longshot, but you never know). Perhaps someone will find a useful idea in there and the world will somehow be better for the work that we did. The summary statement is at the bottom - maybe that will also be useful to other folks thinking about begging for a living. The proposal involves ideas for extensions to the games and tools that, regardless of the lack of funding specifically for them, are slowly appearing at

http://genegames.org/cure/ (thanks in part to the Google Summer Code).

(Note that the following is a resubmission. The introduction section is a response to the previous critiques and scores listed there in the table. The scores for this proposal are on the bottom of this post.)

Crowdsourcing Genomic Predictors of Disease Progression Using Serious Games

INTRODUCTION
This proposal sits at the interface between breast cancer, scientific knowledge, genomic data and community coordination. We hypothesize that data-driven attempts to make predictions of breast cancer prognosis can benefit from prior knowledge, and that current approaches for capturing knowledge from unstructured sources are inadequate. We suggest that, if properly coordinated, a motivated community could help address this challenge. In order to provide incentives and organization, we propose to create a “serious game”, also known as a “game with a purpose”. This game would serve as a focal point for community action oriented around understanding and predicting breast cancer prognosis, but could easily be generalized to other complex phenotypes. The game would attract the attention and focus the efforts of participants ranging from expert cancer biologists to students just learning about the field.

Table 1 (9-point rating scale (1 = exceptional; 9 = poor) )

This resubmission is a substantial rewrite of our original proposal. We made changes based on
additional preliminary data and the critiques offered by panel members. As summarized in Table 1, Reviewer 2 was the most critical reviewer, noting that our prototype game “lacked a playability factor”. This concern was cited as a primary weakness for all evaluation criteria except Environment. The three
other reviewers echoed this concern, as Approach was consistently judged to be the weakest area. The feedback from all four reviewers can be summarized as a need to significantly improve the game mechanics. They note, and we agree, that the game must be capable of attracting and holding the attention of a large audience. Therefore, we have made the following changes to our proposal.

We have incorporated substantial new preliminary data. Despite the shortcomings of our current prototype, our preliminary data demonstrate that the proposed concept is fundamentally sound. In the twelve months since our original proposal was submitted, over 1,200 people independently discovered and collectively played 10,500 rounds of our prototype game. New players continue to register every week. This player population provides a key new resource, not available at the time the original proposal was submitted, for iteratively refining new game designs. We have adapted the proposal to clarify plans to apply a user-centered design strategy consisting of repeated cycles of evaluation with new and existing players followed by adaptations and further testing.
Our plans for our full-length game are better described. Our prototype game was designed for a relatively narrow group of players with both substantial biological knowledge and a desire to play a casual game. Based on preliminary data and interviews with players, we altered our proposal to place a greater emphasis on stratifying the challenges in the game to better suit players with different degrees of expertise. In the current proposal, we added a significant new focus on providing a range of game levels to better meet the educational needs of beginners (see Specific Aim #1) and the tools to explore data desired by the most advanced player-scientists (see Specific Aim #2). We expect these changes to make substantial improvements in both the “fun factor” that the initial review perceived to be lacking and the value of the data collected from the more knowledgeable players.
Our proposed budget now includes specific funds dedicated to consultants in game design (see Key Personnel).
We have assembled an extensive network of colleagues with interest and expertise in scientific game development. Within this network, we have established agreements to work towards cross- pollination of our different player communities and to provide each other with invaluable discussions during early stages of development. (See letters of support from Stegman, Waldispuhl, Maclean, Himmelstein, and Khatib).

Aside from comments related to gamification, Reviewer 1 commented on a lack of clinical and translational expertise on our team, which we have addressed by recruiting additional support from colleagues at TSRI (See letters of support from Leyland-Jones, Salomon and Schork). The only additional comment received was encouragement from Reviewer 4 who said: “Definitely resubmit if this version does not receive funding.”

SPECIFIC AIMS
Breast cancer is the most common cancer in women. Molecular signatures for predicting prognosis and drug response could greatly improve the quality of care. Computational analyses of full genome expression datasets have indeed identified such signatures. However these signatures leave much to be desired in terms of their accuracy, reproducibility in validation studies and biological interpretability. Following similar trends in society, leaders of the research community have recently used crowdsourcing to focus the attention of many new data scientists on this problem through open competitions such as the Sage DREAM7 prediction challenge. While this very young approach has already yielded innovations, it has so far only been used to expand the search for and organize the work of datamining specialists. What is not known is how to expand the reach of crowdsourcing approaches aimed at identifying molecular signatures beyond data scientists to include other members of the scientific community and even of the general public. How can we recruit and organize people that can directly process the unstructured knowledge constantly accumulating in the literature to compose their own novel theories? How do we coordinate the efforts of experts, recruit and train students, and bring the minds of immunologists, developmental biologists, ecologists, economists, engineers, and interested citizen scientists to bear on this crucial problem?

Our long-term goal is to identify a collection of re-usable design patterns that leverage human knowledge and reasoning at the scale of the Web to improve the process of identifying molecular patterns associated with complex biological phenotypes. The overall objective of this proposal is to generate a better predictor of breast cancer prognosis. Our central hypothesis is that a scientific discovery game can capture knowledge and human reasoning that can be combined with existing machine learning methods to produce more effective predictors. We arrived at this hypothesis based on (1) recent successes in scientific crowdsourcing such as the DREAM challenges, (2) impressive results from similar games with a purpose such as Foldit, Fraxinus, and Phylo and (3) accumulating evidence of the value of prior knowledge in the discovery of complex predictive patterns in cancer. Further, we have already succeeded in attracting more than 1,200 players - hundreds of whom had postgraduate degrees - to play a simple prototype game (see Preliminary Data). When completed, the proposed expansions and improvements to this discovery-oriented game will allow us to collect an unprecedented database of manually-generated, hypothetical connections between molecular and clinical variables and breast cancer prognosis. This will offer the potential to create better predictors by providing machine learning methods with information not otherwise accessible. Perhaps equally important, this approach stands to greatly increase public engagement in and understanding of the challenges of modern “big data” biomedical science. We will achieve these goals through the following specific aims.

Aim #1: Attract large numbers of people with wide-ranging backgrounds to learn about and to join in the process of identifying signatures of breast cancer prognosis.
Working Hypothesis: A compelling, web-based game will incentivize, educate and focus the efforts of many citizens, scientists and citizen scientists.

Aim #2: Capture a large volume of structured expert knowledge linking genes and clinical variables with breast cancer prognosis
Working Hypothesis: Within the population that is attracted to a scientific discovery game, we will identify a sub-population of players that are either knowledgeable (e.g. cancer researchers) or are intelligent and dedicated enough to become knowledgeable (e.g. patient advocates). We can identify such expertise based on actions taken in the game and provide these special players with access to expert-level tools that will allow them to compose, test and share the hypotheses that we seek to collect.

This work will produce and validate a new process for organizing large communities of volunteer knowledge workers. Using this framework, which alone will be valuable as a reusable methodology, we expect to generate novel prognostic signatures with both good predictive performance and greater biological relevance than those that currently exist. These signatures will stand to improve the state of the art in breast cancer prognosis and thereby improve treatment efficacy. In addition, the framework can be re-used to develop predictive signatures of drug response and other complex phenotypes.

RESEARCH STRATEGY
Significance. Many studies attempt to use genomic information to predict progression and treatment response for cancers and other complex diseases. Such predictors are of interest because, if sufficiently accurate, they could be used to personalize therapy and to cast insights into the molecular underpinnings of disease. Despite extended and intense research in a variety of areas, there are few clinically useful genomic predictors. Of the few that exist, the Oncotype DX® predictor for breast cancer prognosis is among the most widely used [1]. However, its effective application is limited to ER-positive, lymph node-negative tumors, and research into the development of more accurate prognostic predictors across all the subtypes remains highly active [2]. As a case in point, in the summer of 2012, ten years after the first major attempts to produce genomic predictors of breast cancer prognosis [3], SAGE Bionetworks launched a large-scale public contest to spur research in this area because suitably accurate predictors still had not been found [4]. Though there has unquestionably been progress in the past decade, it has been incremental at best.

We suggest that a fundamentally new methodology is needed to make significant strides on this difficult problem. In this proposal, we introduce a new approach that taps into the massive reservoir of biological knowledge currently trapped in unstructured text and in the minds of scientists. Since 2000, more than 160,000 publications related to breast cancer have been added to PubMed (http://tinyurl.com/brsince2000). Our approach provides a new mechanism for marshaling this knowledge for the purpose of building better predictors. If successful, it will produce a new collection of more accurate, more interpretable predictors of breast cancer progression. These findings would be significant for the following reasons.

More accurate prognoses can be used to more effectively personalize treatment.
More interpretable predictors improve approval chances for clinical tests and inspire further research.
Many similarly structured problems could be addressed using the proposed approach.

Innovation. While many variations exist, the standard paradigm for translating high throughput experimental data (e.g. whole genome RNA expression profiles) into predictors of disease progression follows this basic pattern: (1) assemble a discovery/training dataset, (2) rank attributes according to some univariate statistic, (3) filter all but the top N attributes arbitrarily, (4) select a classification algorithm, (5) evaluate performance in cross-validation experiments and on external test datasets. Emphasis is placed on single-dataset analysis and pre-existing biological knowledge is only considered post hoc. The predictors generated with this approach consistently have problems in secondary and tertiary validation studies and in the stability of the genes selected using different training datasets [5]. Recently, methods driven by structured prior knowledge in the form of protein-protein interaction networks [6, 7], pathway databases [8, 9] and information gathered from pan-cancer datasets [10, 11] have been introduced. These methods guide the search for predictive gene sets towards cohesive groups related to each other and to the predicted phenotype through biological mechanism. In doing so, they have improved the stability of the gene selection process and the biological relevance of the identified signatures. These techniques hint at the potential of strategies that marry a top- down approach based on established knowledge with a bottom-up approach based directly on experimental data, but they have not yet produced substantially greater accuracy than other approaches. We contend that this is due in part to the lack of relevant structured knowledge to compute with. The proposed research seeks to provide a new mechanism to rapidly and inexpensively capture targeted biological knowledge that can be used directly to improve the inference of genomic predictors. This innovative approach opens up access to knowledge not currently represented in any structured database and offers a high-throughput mechanism to apply human reasoning to the predictor inference challenge.

Preliminary Data. In Sept. 2012, we released a simple proof-of-concept game called ‘The Cure’ (http://genegames.org/cure/). In this game, 1,250 genes are randomly distributed (twice) into 100 game boards, each with 25 genes. On each board, the player competes with a computer opponent to select the highest scoring set of 5 genes (Fig. 1). Each player’s score is determined by using labeled training data to infer and test decision tree classifiers that predict 10-year survival using expression data from just the selected genes. The better the gene set performs in generating predictive decision trees, the higher the score. When the player defeats their opponent, they move on to play another board and multiple players play each board. Information from the Gene Ontology, RefSeq, Entrez Gene and PubMed is provided through the game interface (see black tabs in Fig. 1) to aid players in selecting their genes. Players are also free to make use of external knowledge sources.

Figure 1. Prototype game. The player and the computer alternate in choosing genes from the board and adding them to their hand (bottom row and top row, respectively) until each player has selected five genes. The tabbed display at right provides hyperlinked information from the Gene Ontology and NCBI Gene RIFs and shows decision trees formed from the players’ selected genes.

Figure 2. Top 10 genes and enriched disease terms linked to top 82 genes derived from game play data.

Figure 3. 10yr survival accuracy. SVMs trained using prior published gene sets and game-derived 82- gene set. X-axis Griffith 2013 train/test set [13], Y-Axis Oslo validation set with Metabric training set [4].

Between Sept. 7, 2012 and Oct. 28, 2013, 1,227 players registered and collectively played 10,549 games. 35% of the players reported completion of a post-graduate degree and 33% indicated expertise in cancer biology. Beyond initial announcements in Sept. 2012, we did not promote the game in any way, yet new player registrations have continued with the most per month (192) received in May, 2013.

We analyzed games from players with both a Ph.D. and knowledge of cancer and found a set of 82 genes chosen at frequencies above chance (p less than 0.001). Using disease-annotation enrichment analysis [12], we found that the 82 gene set was highly enriched for genes related to cancer (Fig. 2). We also compared the performance of classifiers trained using these 82 genes versus classifiers trained using prior published predictor genes (see [3, 10, 13-15]). The game-derived gene set resulted in comparable accuracy to prior gene sets in two large breast cancer profiling data sets (Fig. 3). While this simple prototype game did not produce a statistically significantly better classifier than prior approaches, these preliminary data did demonstrate that many knowledgeable people will play games oriented around breast cancer prognosis and that valuable information can be captured from the results of their play.

Specific Aim #1: Attract large numbers of people with wide-ranging backgrounds to learn about and to join in the process of identifying signatures of breast cancer prognosis.

Introduction. The value of any crowdsourcing initiative is directly proportionate to the number of participants. The more people that get involved, the more overall work that can be accomplished and the greater the chances of discovering exceptional individuals that can independently make important contributions. The objective of this aim is to produce a system capable of focusing the attention of thousands of people on the challenge of identifying prognostic molecular signatures for breast cancer. To attain this objective, we will test the working hypothesis that a compelling, Web-based game can incentivize, educate and direct the efforts of many citizens, scientists and citizen scientists. Our approach centers on applying a user- centered design process [16] to iteratively construct a game that appeals to a broad audience. There are two rationales for seeking a large, heterogeneous player population. 1. The more people we attract to the game overall, the more people with expert knowledge will be identified. 2. Non-expert players can contribute useful work based on their ability to read, their ability and desire to learn, and their innate ability to translate information into new hypotheses.

When the proposed studies have been completed, we anticipate having access to a very large population motivated to help in the process of understanding breast cancer and focused by the tasks defined in the game. The first expected product would be new ranked gene lists based on the knowledge and information processing abilities of this community as expressed through their actions in the game.

Justification and feasibility. Serious games or “games with a purpose” are a new form of crowdsourcing that explicitly employs fun as a primary incentive for volunteer participation [17]. These kinds of games have recently been used to help solve several complex biological problems. The most successful example is Foldit (http://fold.it), a game that has recruited more than 300,000 players and solved an impressive string of challenges in computational protein folding [18-20]. Of the Foldit players’ many achievements, perhaps the most notable is the design of a new protein folding algorithm with performance competitive to professionally- created solutions [21]. Another successful game, Phylo has attracted more than 12,000 players that have improved upon very large multiple sequence alignments [22], and recently Fraxinus has harnessed Facebook users to improve genome assembly [23]. Foldit, Fraxinus and Phylo have demonstrated that many people are interested in playing biological games and that these games can result in tangible contributions to research.

Research Design. The game that is the subject of this specific aim will provide an entertaining and educational experience that allows players to interact directly with genomic datasets related to breast cancer. Building on what was learned from the prototype, it will add the concept of levels and will introduce a number of new game mechanics to incentivize play. Before detailing the specific phases of planned work, we introduce some of the core concepts of the new design.

Levels. By stratifying the game into different levels of difficulty we will provide all players with games that are rewarding and fun. As players move up to higher levels, they will earn access to greater numbers of features (Fig. 4). In the first training stage, players will make simple choices, answering questions like “which factor is more useful for predicting prognosis, ethnicity or the number of lymph nodes infiltrated by cancer cells?”. As they answer questions correctly, they will move up to new levels with greater complexity. The prototype game depicted in Fig. 1 shows what a middle level might look like. In that case, players choose groups of 5 predictive genes from a set of 25. In the highest levels, players will have full access to all available clinical and genomic features. Through this progression we will seek to keep players in a positive cognitive state known as “flow” in which they experience a “feeling of complete and energized focus in an activity” [24] (Fig. 4).

Figure 4. Designing game levels to keep players in state of flow.

Incentives (“gamification”). In addition to the strong underling motivations to contribute to cancer research and to learn, the game will provide a variety of other compelling incentives for participation. Players will be able to earn status based on their position on leaderboards, feel a sense of accomplishment and discovery as they advance through levels, and bond with other members of the player community via discussion boards and chat functions. As in the prototype game, individual challenges will involve competitions with computerized opponents, but these competitions will also be enabled to occur directly between players.

Aggregation. Each of the choices made by players in these games can be considered a ‘vote’ for the selected feature (e.g. gene). While some votes will be random, their aggregation will reduce such noise and will reflect consensus knowledge in a unique, computable form (see Preliminary Data and [18]).

Planned activities. This work will be carried out in the following four phases.

Phase 1. Develop control questions. We will begin by working with experts in breast cancer genomics (see letters of support from Leyland-Jones and Griffith) to devise a set of questions covering the core elements of what is known in the field. These questions will be used in the early stages of the game to provide both educational material and a method for gauging player expertise.

Phase 2. Implement predictor evaluation service. The levels of the game created for this Specific Aim will focus on feature selection. In this phase, a web service will be constructed that accepts as input a set of features (e.g. the expression levels of genes) and responds with an empirical assessment of the value of that feature set. The service will incorporate multiple breast cancer datasets to improve the ability to assess predictor generalizability and will be provided via API. Initial datasets will include (at least) the 1,992 multi- subtype, multi-treatment samples from [25], the 858 untreated, ER+, LN-, samples from [15] and the 1,809 mixed samples from [26]. This API will be used by the game but will also be made available to the public. The datasets used in the server will be divided into training sets used to provide players with feedback and separate test sets used for subsequent evaluations.

Phase 3. Iterative Design. The new game will be developed in an iterative cycle of user-centered design. Game designs will be tested both in the existing player population on the Web (see Preliminary Data) and in small groups composed of local colleagues, students, and collaborating cancer experts. Small local groups will allow us to test ideas using mockups before implementing them and to maximize our understanding of individual player reactions [27]. Web-based experiments will allow us to measure progress at the community level in terms of metrics such as virality (n new players), stickiness (time spent on the game) and production (amount and quality of information collected). Critically, we will adopt an agile development strategy focused on quickly adapting game designs based on periodic (e.g. bi-weekly) in-person and community-level evaluations.

Phase 4. Promotion. When the game meets our expectations for playability and knowledge acquisition, we will use a variety of resources at our disposal to attract players. In particular, we will leverage the BioGPS website (viewed by more than 125,000 unique visitors annually) to solicit players via postings on the home page and emails to the more than 5,000 registered users [28]. In addition, leaders of several other scientific gaming initiatives have agreed to collaboratively promote all of our games (See letters of support from Waldispuhl, MacLean, Stegman, Himmelstein, Khatib). Finally, we are partnering with educational initiatives to promote the game’s use in massively open online courses (see letter of support from Taly).

Expected Outcomes. When the proposed studies for this aim have been completed, we expect to have produced a Web-based game oriented around the challenge of breast cancer prognosis that appeals to players ranging from novices to experts. This game will be both a useful educational tool and an effective mechanism to capture knowledge. Using the data captured from game play, we expect to produce a novel, knowledge- driven ranking of genes and clinical features with respect to their value for predicting breast cancer prognosis. This ranking will improve upon the ranking generated in the preliminary study based on the opportunity to collect substantially more data and on a far more refined approach for filtering the data based on player expertise made possible by the new, early stage levels.

This community-generated, ranked list will provide a useful source of prior knowledge for selecting features for classifier construction.

Potential problems and alternative strategies. A clear danger in any crowdsourcing initiative is the mentality that ‘if you build it, they will come’. If no one ends up playing this game, the value from this project will be minimal. However, based on our success in attracting more than 1,000 players to a single-level, unadvertised, hastily composed skeleton of the proposed game, we feel strongly that this event is very unlikely. If the game does not garner a sufficiently large audience (which we would assess based on the total amount of knowledge captured by the system within the first 2 months after the launch), we would shift our model by deploying the system in different contexts. Because the system will be developed for the Web, it would be easy to change the deployment from a standalone Web application into e.g. a Facebook game like Fraxinus [23] or a mobile application. We could also shift the game to focus more deeply on either the educational needs of students or on the needs of small communities of experts. Finally, the infrastructure for the game will be agnostic to the specific datasets being used and thus it will be possible to use the game for other biomedical challenges such as predicting organ transplant rejection (see letter of support from Salomon).

Specific Aim #2: Capture a large volume of structured expert knowledge linking genes and clinical variables with breast cancer prognosis

Introduction. Crowdsourcing discussions often focus on the “Long Tail” of small-scale contributors, which is the large number of contributors who add individually small (but collectively large) amounts of value. Yet the ecosystem of any successful system also contains a complementary contributor pool (the “Short Head”) that is comprised of a few key community members that individually produce a large quantity of high quality work. These individuals also motivate and guide the other contributors. The objective of this aim is to maximize the contributions from the experts in the player community. We will test the working hypothesis that experts will gain both professional value and enjoyment from the use of a tool, embedded in the sociotechnical context of a scientific discovery game, that will allow them to rapidly test their own hypotheses regarding connections between combinations of molecular features, clinical variables and breast cancer prognosis. Successful completion of this aim will (1) provide the research community with a tool that makes it easy to test complex hypotheses without the need to write programs and (2) enable the construction of a new kind of classifier that integrates knowledge spread across many hypotheses. These achievements would be important because they would increase the possibility of individual experts identifying novel signatures and because they would make it possible to create an ensemble classifier likely to outperform any independent signature. At a high level, the work proposed in this aim will reduce barriers to information flow between scientific communities by increasing the accessibility of big data for non-data scientists and increasing data scientist’s access to structured expert knowledge (see letter of support from Margolin).

Justification and feasibility. We propose to capture hypotheses structured as decision trees. For example, “if AURKA expression is high and TOP2A expression is low, then risk of recurrence is low”. This, and many more complex hypotheses, can be expressed as decision trees and tested automatically using available datasets. Decision trees provide a familiar, visual way to represent complex logical functions that may span many features and are the representation of choice for communicating complex diagnoses and treatment regimens among the medical community. They can be induced from data automatically [29], designed by experts or constructed using hybrid systems. Research has shown that involving people in the process of inferring decision trees can improve their predictive performance, decrease their size and increase their explanatory power [30-32]. The central technical product of the proposed research is envisioned to be a Web-based interactive decision tree builder.

Interviews with several cancer biologists that interacted with the prototype game exposed a consistent desire for greater control over the process of building the trees shown in the game (see letters of support from Griffith and Morin). Motivated by these interviews, we developed a prototype Web interface for building decision tree classifiers (Fig. 5). This proof of concept established the feasibility of the proposed research in our hands and has met with an enthusiastic response from early testers.

Figure 5. Prototype Decision Tree Builder. This tool allows search, selection and placement of features as split nodes as well as real-time performance evaluations on training data.

Research Design. The goals of this specific aim will be achieved in the following 4 phases.

Phase 1. Gather tree-building requirements and evaluation criteria. During this phase, we will

work directly with experts in breast cancer to manually define decision trees that reflect their current thought processes for linking clinical and molecular data to prognosis (see letters of support from Griffith and Leyland- Jones). This work will overlap with the first phase of Specific Aim #1 in that these trees will capture knowledge that we can use to train and evaluate all players. In addition, they will provide clear guidance as to the parameters that our proposed tree-building interface must be capable of representing. We will also use this phase to incorporate services for evaluating the hypotheses represented by these trees into the evaluation service described in Specific Aim #1, Phase 2. Working closely with clinical and translational experts at this stage will ensure that our evaluations are appropriate, for example that the chosen datasets accurately reflect relevant patient populations, and will set a solid foundation for the rest of the proposed research.

Phase 2. Tree-building tool development. This phase will focus on iterative design, implementation and testing of a Web-based tool for constructing and evaluating decision trees. The tool will be evaluated based on the ability of users to reproduce the expert-derived trees captured in Phase 1 and on their comprehension of the evaluations displayed by the system. The interface will allow users to search through features to use in split nodes within their trees. Individual features will include both clinical parameters (e.g. lymph node status) and molecular measurements (e.g. gene expression). In addition, meta-features that integrate signal from multiple smaller features such as the genomic instability index [33] or the aggregate expression of pathways, will be available as independent split nodes. For example, the system would allow users to test the hypothesis that low levels of BCL2 expression are predictive of poor outcome when tumors also have a high level of genomic instability. Coupled with the evaluation server mentioned in Aim 1:Phase 2, this tool will allow all scientists to use these high-throughput datasets to test the accuracy of their own biological models, a task that would be impossible without significant bioinformatics expertise.

Phase 3. Gamification. The code created in Phase 2 will be used to build the upper levels of the game described in Specific Aim #1. The game will provide incentives for use of the application and a rich social context for knowledge transfer among the player community. Players will be encouraged to share the trees that they create and to explain how and why they composed them the way they did. The results produced by the evaluations (e.g. the percent correct on training data), will be used to generate scores for a leaderboard. To reduce overfitting, scores will reward both high accuracy and low complexity. Following the pattern outlined in Aim #1, level progression in the tree-building stages of the game will orient around increasing the number of variables and levels of control available to the players.

Phase 4. Aggregation. All of the hypotheses of the user community, represented as decision trees, along with detailed information about each user’s actions taken in the game will be captured. Using this data, we will compose ensemble classifiers, akin to random forests [34]. In effect, individual players will act as highly intelligent components of a larger machine learning system. At a simple level, each of the trees submitted by users of the tree-building tool can be used as independent members of a decision forest. To make a prediction, each tree in the forest gets one vote and the class with the most votes is selected. During this phase of research, we will work to optimize such ensembles by (1) filtering out data from players with low levels of expertise (2) tuning the system to ensure adequate quality and diversity among the trees.

Expected Outcomes. Upon completion of the work proposed in this Aim, we expect to produce (1) a valuable piece of open-source, Web-based software for constructing and evaluating decision trees using biomedical data, (2) a community hub where researchers, citizen scientists and gamers can interact and explore new hypotheses relating molecular and clinical variables to breast cancer prognosis, (3) a large database of hypotheses structured as decision trees along with information about the level of expertise of their creators, and (4) ensemble classifiers that integrate the information collected from the community. The individual hypotheses and their aggregates into ensembles will provide an excellent opportunity to advance the state of the art in predicting breast cancer prognosis.

Potential problems and alternative strategies. Despite our preliminary data indicating the contrary (e.g. see letters of support from Taly, Griffith, and Morin), one potential problem may be a difficulty in attracting experts to the tree-building tool. While our main focus will be on the deployment of this tool within the game, an alternative strategy would be to refocus our effort on the development of a public, scientist-focused website like http://kmplot.com/. Following this strategy we would also allow users to keep uploaded trees and datasets private and thus reduce potential fears of “scooping”. Another problem may be that, even with the substantial human effort that could be captured by this system, the accuracy of predictors may not improve significantly. While our initial focus will be on identifying predictors that do improve on accuracy, an alternative is to re-align the game to provide greater rewards for constructing explanations of predictors that tie them closely to biological mechanism.

FUTURE DIRECTIONS.

Our high-level goal is to harness the power of the Web to help meet the challenges of the genome era. If successful, there are several natural extensions to this work that we will pursue. First, we will add the capacity for scientists to upload their own datasets and to expand the number of datatypes represented in the game framework (see letter of support from Morin). Second, we will expand the ecosystem of tasks that players can accomplish as part of these games. For example, we are already exploring mechanisms through which players can help to translate text from scientific articles into structures such as concept networks that could be used for predictor inference (e.g. see [6]). Third, we will explore other approaches from the domain of visual analytics that can be incorporated into our interface. Emerging techniques that, for example, enable users to visualize complex decision boundaries created by machine learning systems (e.g. [35, 36]) may help less experienced users make more important contributions.

TIMELINE

This proposal will require funding for two years. The work to achieve both specific aims will be done in parallel. It will begin with a period of high engagement with breast cancer experts to capture their knowledge for use in player evaluations and training in the lower levels of the game and to

establish requirements for the tree-building interface. We will also use this period to develop the

evaluation server that will be used to drive the games and assess performance. Next we will focus on the iterative development of games described in Aim #1 and the tree-building interface described in Aim #2. When the implementation of all of the levels of the game has reached a stable state we will focus on promotion. This work will conclude with an evaluation of all of the data collected, culminating with the assessment of the predictive quality of the ensemble predictor described in Aim #2.

BIBLIOGRAPHY & REFERENCES CITED

Ross JS, Hatzis C, Symmans WF, Pusztai L, Hortobagyi GN. Commercialized multigene predictors of clinical outcome for breast cancer. Oncologist. 2008;13(5):477-93.
Weigelt B, Pusztai L, Ashworth A, Reis-Filho JS. Challenges translating breast cancer gene signatures into the clinic. Nature reviews Clinical oncology. 2012;9(1):58-64.
van 't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415(6871):530-6.
Margolin AA, Bilal E, Huang E, Norman TC, Ottestad L, Mecham BH, Sauerwine B, Kellen MR, Mangravite LM, Furia MD, Vollan HK, Rueda OM, Guinney J, Deflaux NA, Hoff B, Schildwachter X, Russnes HG, Park D, Vang VO, et al. Systematic analysis of challenge-driven improvements in molecular prognostic models for breast cancer. Sci Transl Med. 2013;5(181):181re1.
Xu JZ, Wong CW. Hunting for robust gene signature from cancer profiling data: sources of variability, different interpretations, and recent methodological developments. Cancer Lett. 2010;296(1):9-16.
Dutkowski J, Ideker T. Protein networks as logic functions in development and cancer. PLoS Computational Biology. 2011;7(9):e1002180-e. (PMC3182870)
Winter C, Kristiansen G, Kersting S, Roy J, Aust D, Knösel T, Rümmele P, Jahnke B, Hentrich V, Rückert F, Niedergethmann M, Weichert W, Bahra M, Schlitt H, Settmacher U, Friess H, Büchler M, Saeger H-D, Schroeder M, et al. Google goes cancer: improving outcome prediction for cancer patients by network-based ranking of marker genes. PLoS computational biology. 2012;8(5):e1002511-e.
Bild A, Yao G, Chang J, Wang Q, Potti A, Chasse D, Joshi M-B, Harpole D, Lancaster J, Berchuck A, Olson J, Marks J, Dressman H, West M, Nevins J. Oncogenic pathway signatures in human cancers as a guide to targeted therapies. Nature. 2006;439(7074):353-7.
Su J, Yoon B-J, Dougherty E. Accurate and reliable cancer classification based on probabilistic inference of pathway activity. PloS one. 2009;4(12):e8161-e.
Cheng WY, Ou Yang TH, Anastassiou D. Development of a prognostic model for breast cancer survival in an open challenge environment. Sci Transl Med. 2013;5(181):181ra50.
Cheng WY, Ou Yang TH, Anastassiou D. Biomolecular events in cancer revealed by attractor metagenes. PLoS Comput Biol. 2013;9(2):e1002920. (PMC3581797)
Wang J, Duncan D, Shi Z, Zhang B. WEB-based GEne SeT AnaLysis Toolkit (WebGestalt): update 2013. Nucleic Acids Res. 2013;41(Web Server issue):W77-83. (PMC3692109)
Lauss M, Kriegner A, Vierlinger K, Visne I, Yildiz A, Dilaveroglu E, Noehammer C. Consensus genes of the literature to predict breast cancer recurrence. Breast Cancer Res Treat. 2008;110(2):235-44.
Paik S. Development and clinical utility of a 21-gene recurrence score prognostic assay in patients with early breast cancer treated with tamoxifen. Oncologist. 2007;12(6):631-5.
Griffith O, Pepin F, Enache O, Heiser L, Collisson E, Spellman P, Gray J. A robust prognostic signature for hormone-positive node-negative breast cancer. Genome Medicine. 2013;5(10):92.
Gould J, Lewis C. Designing for usability: key principles and what designers think. Commun ACM. 1985;28(3):300-11.
von Ahn L, Dabbish L. Designing games with a purpose. Commun ACM. 2008;51(8):58-67.
Good B, Su A. Crowdsourcing for bioinformatics. Bioinformatics. 2013;29(16):1925-33.
Cooper S, Khatib F, Treuille A, Barbero J, Lee J, Beenen M, Leaver-Fay A, Baker D, Popovic Z, Players F.
Predicting protein structures with a multiplayer online game. Nature. 2010;466(7307):756-60.
(PMC2956414)
Khatib F, DiMaio F, Cooper S, Kazmierczyk M, Gilski M, Krzywda S, Zabranska H, Pichova I, Thompson J,
Popovic Z, Jaskolski M, Baker D. Crystal structure of a monomeric retroviral protease solved by
protein folding game players. Nat Struct Mol Biol. 2011;18(10):1175-7.
Khatib F, Cooper S, Tyka MD, Xu K, Makedon I, Popovic Z, Baker D, Players F. Algorithm discovery by
protein folding game players. Proceedings of the National Academy of Sciences of the United States of
America. 2011;108(47):18949-53. (PMC3223433)
Kawrykow A, Roumanis G, Kam A, Kwak D, Leung C, Wu C, Zarour E, Sarmenta L, Blanchette M,
Waldispuhl J. Phylo: a citizen science approach for improving multiple sequence alignment. PloS one. 2012;7(3):e31362. (PMC3296692)

MacLean D. Changing the rules of the game. eLife. 2013;2.
Chen J. Flow in games (and everything else). Commun ACM. 2007;50(4):31-4.
Curtis C, Shah SP, Chin SF, Turashvili G, Rueda OM, Dunning MJ, Speed D, Lynch AG, Samarajiwa S,
Yuan Y, Graf S, Ha G, Haffari G, Bashashati A, Russell R, McKinney S, Group M, Langerod A, Green A, et al. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature. 2012;486(7403):346-52. (PMC3440846)
Gyorffy B, Lanczky A, Eklund AC, Denkert C, Budczies J, Li Q, Szallasi Z. An online survival analysis tool to rapidly assess the effect of 22,277 genes on breast cancer prognosis using microarray data of 1,809 patients. Breast Cancer Res Treat. 2010;123(3):725-31.
Barrington L, Turnbull D, Lanckriet G. Game-powered machine learning. Proceedings of the National Academy of Sciences. 2012;109(17):6411-6.
Wu C, Macleod I, Su AI. BioGPS and MyGene.info: organizing online, gene-centric information. Nucleic Acids Res. 2013;41(Database issue):D561-5. (PMC3531157)
Quinlan JR. Induction of Decision Trees. Machine Learning. 1986;1(1):81-106.
Mihael A, Christian E, Martin E, Hans-Peter K. Visual classification: an interactive approach to
decision tree construction. Proceedings of the fifth ACM SIGKDD international conference on
Knowledge discovery and data mining; San Diego, California, USA: ACM; 1999.
Malcolm W, Eibe F, Geoffrey H, Mark H, Ian HW. Interactive machine learning: letting users build
classifiers. Int J Hum-Comput Stud. 2002;56(3):281-92.
van den Elzen S, van Wijk JJ, editors. BaobabView: Interactive construction and analysis of decision
trees. Visual Analytics Science and Technology (VAST), 2011 IEEE Conference on; 2011: IEEE.
Bilal E, Dutkowski J, Guinney J, Jang IS, Logsdon BA, Pandey G, Sauerwine BA, Shimoni Y, Moen Vollan
HK, Mecham BH, Rueda OM, Tost J, Curtis C, Alvarez MJ, Kristensen VN, Aparicio S, Borresen-Dale AL, Caldas C, Califano A, et al. Improving breast cancer survival analysis through competition-based multidimensional modeling. PLoS Comput Biol. 2013;9(5):e1003047. (PMC3649990)
Breiman L. Random Forests. Machine Learning. 2001;45(1):5-32.
Migut M, Worring M, editors. Visual exploration of classification models for risk assessment. Visual
Analytics Science and Technology (VAST), 2010 IEEE Symposium on; 2010 25-26 Oct. 2010.
Poulet F. Towards Effective Visual Data Mining with Cooperative Approaches. In: Simoff S, Böhlen M,
Mazeika A, editors. Visual Data Mining: Springer Berlin Heidelberg; 2008. p. 389-406.

Summary Statement (from NIH review panel)

Impact Score: 40

RESUME AND SUMMARY OF DISCUSSION: The proposed studies seek to continue the development of a serious game approach for biological discovery. If successful, these studies may bring new insights into predictors of breast cancer prognosis and are therefore viewed as very significant. This submission addresses the majority of concerns raised during the previous review by including a professional game designer and increasing the numbers of players. The panel noted that the investigators are well- qualified but it was thought that a more direct involvement of breast cancer experts would strengthen the team. The environment was judged to be excellent. The panel thought that the use of serious games was a novel approach to discover signatures for breast cancer prognosis. The panel felt that the preliminary results supported the feasibility of the proposed approaches. However, it was pointed out that the prototype results were similar to classifiers developed from just using prior knowledge about cancer genes. This raised the concern that new knowledge may not come from the proposed work. The panel felt that it was unclear how large a population of breast cancer specialists would be needed as players to make the results meaningful. Overall enthusiasm was again mixed among the panel members and, by averaging scores, the proposal is expected to have moderate impact in the fields of bioinformatics and disease.

CRITIQUE 1:

Significance: 2

Investigator(s): 2

Innovation: 1

Approach: 2

Environment: 1

Overall Impact: This proposal aims to create a framework for engaging the larger scientific community to participate in “scientific discovery games” to help generate better predictors of breast cancer prognosis. It is a resubmission of a previous submission. The PIs addressed the key previous critique about the “playability” of the game by engaging a professional game designer and with a demonstration of a large number of participants in the current prototype (>1200). Aim 1 is focused on attracting people to the game and Aim 2 is focused on knowledge capture linking genes to clinical variables.

1. Significance:

Strengths

Improvements in predictors of breast cancer prognosis could certainly have tremendous impact in cancer treatment.

The PIs argue that existing shortcomings in genomic predictors of disease prognosis are due to the lack of a structured knowledge about the disease from which analyses can build from, a challenge that the proposal tackles with the crowdsourcing approach.

Novel methods in developing approaches to crowdsource the development of cancer predictors could have impact in many applications.

Weaknesses

The proposal has a little bit of a mixed focus; it is both an evaluation of how to make a good “game” as well as directly focused on using a good game to identify good genomic predictors of breast cancer prognosis. More clarity to the identity of the proposal would help with plans for its execution.

The proposal makes reference to other scientific gaming activities, but there should be a more direct discussion of what features work and don’t work in those other settings and what about the gaming aspects of this proposal are different, other than simply its focus on building genomic predictors of disease prognosis.

2. Investigator(s):

Strengths

The PI has led many innovative crowdsourcing approaches in biology, including the gene wiki.

The proposal will engage a professional game designer (an “expert on fun”).

Weaknesses

More direct engagement of a breast cancer expert (beyond letters of support) would help guide the team with some critical decisions in the game design.

3. Innovation:

Strengths

Crowdsourcing the generation of predictors of breast cancer prognosis is certainly innovative.

Weaknesses

None noted.

4. Approach:

Strengths

A key critique in the previous submission was the “playability” which was directly addressed with the engagement of a professional game designer.

More than 1200 people have played 10,500 rounds of the prototype game, supportive of the feasibility and quality of the approach.

Strategies to advertise are excellent, with posting to the BioGPS site, engagement of other scientific gaming initiatives, and coupling efforts with MOOCs.

Weaknesses

The crowdsourcing value is a function of the size and the diversity of the group. The proposal does not focus on the “quality” of the participants directly. While there is mention of the “short head” of valuable contributors, this seems like such an important element of the game design and what could be learned that it should be a more direct focus of the work. Perhaps the “experts” contribute much less to the overall knowledge than would be expected. I don’t know if the diversity of backgrounds and scientific training needs to be high or low, but the proposal should evaluate this feature directly.

The proposal discusses great preliminary data on the number of participants, but a key feature would be the number of return participants. How many are coming back? How can you incentivize the return of players?

Structuring the data as decision trees seems appropriate for the game design and a direct way to capture expert knowledge. However, there are assumptions in what is “low” or “high” that should be explicitly explored since such cut-offs can significantly influence results.

There’s no discussion on how a gaming decision is deemed “right” or not. What about a predictor that selects for 60% of the correct diagnoses versus one that is 40% accurate compared to two predictors that are 70% vs. 30% correct?

5. Environment:

Strengths

Resources and environment at TSRI are excellent for the proposed work.

Weaknesses

None noted.

CRITIQUE 2:

Significance: 4

Investigator(s): 1

Innovation: 2

Approach: 4

Environment: 1

Overall Impact: This is an interesting proposal to develop a “serious game” in which user gameplay is used to develop molecular signatures for breast cancer prognosis. The approach has not been used before with this type of bioinformatics data, and it is a somewhat different type of scientific application than the sequence and structure optimization challenges that have been the basis of successful games. Innovation is a strength. I get a much stronger sense of their vision what the game will be like for the second aim, where a concrete and appealing description of how the user will build classifiers and interact with the game is given, than in the first. Significance derives from their ability to attract users, a significant proportion of them experts, and uncertainty about the appeal of the first phase approach and its ability to attract significantly more users than it has so far is a potential concern.

1. Significance:

Strengths

Addresses a clinically significant problem – development of more accurate predictors of breast cancer prognosis from experimental data.

Weaknesses

The project is by nature pretty speculative. A few science games have made a good showing, but in less high-stakes, basic science areas such as protein structure prediction and genome assembly. Can crowdsourced science provide insights that are specific and reliable enough to be used in diagnostics with direct impact on care? The prototype work does not suggest a significant step up over other classifiers.

If the game does not successfully attract users then significance will be greatly reduced – the prototype results were similar to standard classifiers developed based on prior knowledge.

2. Investigator(s):

Strengths

PI and co-I specialize in biomedical data warehousing/data mining and in cognitive science, respectively.

Dedicated funds have been included to support a part-time consultant specializing in game design (Peay) as recommended in the previous review cycle.

Weaknesses

None stated

3. Innovation:

Strengths

Serious games are becoming more common, but this is still a fairly novel approach to development of disease classifiers based on molecular data.

Weaknesses

None stated

4. Approach:

Strengths

In general, the gameification approach is popular and I have no trouble believing they will find a base of participants. Undergraduate bioinformatics students love Foldit.

The decision tree building tool that they propose to develop in Specific Aim 2 – allowing users to build a hypothesis in the form of a decision tree classifier and then validate it (or not) using the aggregated information in the database seems like it would be a fairly useful interactive data mining tool for researchers trying to integrate large volumes of data and the kind of tool that could be useful in a non-game environment. It should incentivize experts to participate and to contribute data.

Weaknesses

Preliminary results with a limited set of users in a prototype game resulted in a classifier that performed very similarly to classifiers just using prior information about cancer genes. Isn’t a general population with access to and experience of, perhaps, college level textbooks in the life sciences, going to produce a “crowd wisdom” that is centered around exactly such prior knowledge?

Despite creating a mechanism for non-expert users to participate through a leveling system, the game still relies heavily on qualified specialist participants to generate valuable insights.

The proposal doesn’t give me a good feel for what the first phase game mechanics will be like– what comes after the prototype but before the decision tree building? That experience isn’t as well described. That step seems to be critical to engaging users initially.

5. Environment:

Strengths

They have an enthusiastic group of supporters (see letters) who are willing to participate in the development of expert-level classifiers which is key to the second specific aim.

They have identified a group of scientific game developers who are willing to cross-promote The Cure to a community of users who have already demonstrated interest in scientific games.

The host institution is well equipped to sustain the proposed project.

Weaknesses

None stated

CRITIQUE 3:

Significance: 3

Investigator(s): 3

Innovation: 4

Approach: 4

Environment: 3

Overall Impact: The authors propose to improve prediction in breast cancer based of clinical genomics data through a crowdsourcing approach of serious game playing. The approach to solving the problem is interesting and the investigators appear to be a “dream-team” for this project. My concern is that this may be only an interesting exercise on how collective mind works. Of course, if it is effective (and the gained knowledge is translatable to a method or algorithm), it would be of great importance to the community.

1. Significance:

Strengths

The problem of better prediction in breast cancer is an important problem.

Crowdsourcing is becoming an interesting tool for tricky, ill-understood problems in general.

Weaknesses

None.

2. Investigator(s):

Strengths

The investigators are best suited for this project, with background in breast cancer and designing games for crowdsourcing.

Weaknesses

None.

3. Innovation:

Strengths

The approach of using serious games to improve prediction is very interesting.

Weaknesses

None.

4. Approach:

Strengths

The preliminary data with the currently implementation has already attracted many players.

The strategy and game plan appears sound as demonstrated by the authors in their earlier work.

Weaknesses

How large is the population of specialized players (breast cancer biologists) to make this game meaningful?

5. Environment:

Strengths

The host institutes are well-suited.

Weaknesses

None.

Here are the summaries of the scores for apps 1 and 2 for comparison.

Scores from original submission
(9-point rating scale (1 = exceptional; 9 = poor) )
Impact score = 55

Impact score = 40

23andme and the FDA

2013-11-26T09:42:00.000-08:00

Apparently the FDA wants 23andme to stop selling genetic testing kits. I think this is a really bad thing.

It seems that if they could, the FDA would block access to mirrors because of the detrimental effects they might have on segments of the population that might take the data provided to them there, become unhappy, eat more cheetos and then die at a faster rate than the mirrorless...

There is no doubt that some people might look at the data provided by 23andme and related services and make poor decisions about their health and its a huge challenge to translate this kind of data into clear medical advice given what is known now. But (a) its my f'ing genome I should be able to look at it if I want to (b) they are an information service, not a healthcare service and they make that distinction very carefully - they don't say "you should take this drug.." they say "you are at greater risk for ..., so you should go talk to your doctor...". Sorry, but there needs to be some accountability on the part of the consumer.

Trying to stop personal genomics companies like this from operating until every bit of information they show has run through the FDA will only improve one thing - the economies of other countries without these kinds of problems. Not to mention the fact that without data collection strategies like this, we will likely never be able to generate the data that would allow these services to get to the point of making a major positive impact on healthcare. e.g. here is proof-of-concept paper that from 23andme that has been followed up by many new discoveries made possible by their service. http://www.ncbi.nlm.nih.gov/pubmed/21858135

Molecular Predictor Repository? (not gene set repository)

2013-04-27T23:30:00.000-07:00

I have a simple question. Say that I have the results from a gene expression analysis done in my laboratory or pulled from a public repository. Say the sample has something to do with cancer (or I think that it might). Say I read about so called 'signatures' that have been found to be associated with key phenotypes related to cancer. (Here is a list of 13 signatures like this).

How do I now test to see which, if any, of these signatures are showing up in my sample?

I have my input, (e.g. the Affy CEL file from my experiment), how do I get the output that indicates that my sample shows an active wound response, suggests poor outcomes in breast cancer patients, looks like lung-specific metastasis, etc. etc.

This should be relatively easy, no? I've got data about human gene expression, these people have made useful predictive models that take human gene expression as input. Where is the website?

Some people have directed me to useful resources like GeneSigDB that provide curated repositories of "gene signatures". However, these "signatures" are just sets of genes, they are not predictive models. If all that we needed were gene sets, no one would ever need to train a random forest classifier or a support vector machine on the data associated with those gene sets. Sets of phenotypically related genes are great, but I need the full predictive model.

The only system that I know of that seems to have the capacity to answer my question (had the model builders used it) is the Synapse platform. For example, if you are good at R, you should be able to use Synapse to execute any of the models submitted to the recent breast cancer prognosis challenge. This is a great step forward for the community (though it recapitulates pretty much everything from the more generic world of scientific workflow systems like Taverna).

But still.. a) comparatively very few published predictive models are in Synapse and b) should I really have to know R to answer that question?

respond!

GSoC recap for Crowdsourcing Biology team at TSRI

2012-12-05T17:55:00.000-08:00

(As presented on the Google Open Source Blog)

The Crowdsourcing Biology team at the Scripps Research Institute participated in the Google Summer of Code for the first time this year. Five students contributed to efforts to harness the power of community intelligence to advance biomedical science.

Maximilian Ludvigsson took the first steps in the creation of Semantic BioGPS. BioGPS is a user-extensible Web portal that provides easy access to information about genes from hundreds of different websites. Maxmilian produced a tool that allows BioGPS users to annotate regions of gene-centric Web pages to state, computationally, what different areas of the page ‘mean’. These semantic annotations enable scripts to extract structured content about genes from these Web pages, paving the way for a new version of BioGPS that provides integrated views across multiple data sources.

Karthik G developed an interactive network visualization for the data linking genes to diseases in the GeneWiki+. The GeneWiki+ is a Semantic Media Wiki (SMW) installation that dynamically integrates data about human genes from Wikipedia and from SNPedia. While SMW queries provide a great way for programmers and advanced wiki users to interact with data, the graphical network that Karthik created gives ordinary biologists a new, intuitive, and sometimes beautiful way to explore connections between genes and disease.

Clarence Leung began the development of a new version of the crowdsourcing game Dizeez. In this new two-player game, players are challenged to get their partner to guess a particular disease by prompting them with related genes. This game follows in the tradition of ‘games with a purpose’ such as Foldit and the ESP game by producing novel, validated gene-disease associations as a result of game play.

Shivansh Srivastava worked on migrating BioGPS’s gene report layout windowing system from ExtJS to both a jQuery windowing environment and a Yahoo User Interface-based approach. This view in BioGPS provides biologists with a customizable environment for accessing gene-centric data from a diverse collection of sources. Shivansh’s efforts provided BioGPS developers with insight into the technical limitations of each solution, as compared to the current BioGPS ExtJS codebase.

Kevin Wu developed a scalable and efficient system for storing and analyzing biologically meaningful sets of genes. Accessible via a RESTful HTTP interface, the system uses MongoDB for storage and custom code for distributed computing that executes statistical comparisons across thousands of gene sets in parallel. For any particular gene set, Kevin’s code makes it possible to rapidly identify similar gene sets and to calculate the ‘enrichment’ (a statistical measure of overlap) of that gene set with respect to any other. This work will soon be integrated into BioGPS to allow users to save their own gene sets and to query for similar gene sets from others.

Thanks to all of our excellent students for their great contributions and to Google for sponsoring this unique program. We are looking forward to participating in the GSoC for many years to come!

Return to Moscone

2012-11-06T15:43:00.001-08:00

I'm sitting in the main hall at the enormous Moscone conference center in San Francisco awaiting the first plenary at ASHG2012 and remembering the last time I was here. Back in spring of 2007, I saw Jeff Bezos and several other Web luminaries speak here about the Web2.0 phenomenon - what it was and how they were planning to make money on it. The buzz throughout that conference was Twitter, though I admit I hadn't really noticed it before then, did not understand it, and was very skeptical that it would amount to anything. That was the meeting that really inspired me to start writing here. 5 and a half years and 214 posts later its clear that, because of that, it was probably one of the most significant meetings in my professional career. Who knows? Perhaps this genetics business will prove even more inspirational.