Building intelligent systems for biology |
As one step in testing this general hypothesis, on Sept. 7, 2012, we released a game called ‘The Cure’. The objective of this game is to build a better (more intelligent) predictor of breast cancer survival time based on gene expression and copy number variation information from tumor samples. We selected this particular objective to align with the SAGE Breast Cancer Prognosis challenge.
In this game, available at http://genegames.org/cure/, the player competes with a computer opponent to select the highest scoring set of five genes from a board containing 25 different genes. The boards are assembled in advance to include genes judged statistically ‘interesting’ using the METABRIC dataset provided for the SAGE Challenge.
Below is a game in progress. I’m on the bottom and my opponent, Barney, is on the top. We alternate turns selecting a card (a gene) from the board and adding it to our hand. When we each complete a 5 card hand, the round finishes and whoever has the most points wins. Scores are determined by using training data to automatically infer and test decision tree classifiers that predict survival time. The trees can use both RNA expression and CNV data for the selected genes to infer predictive rules. The better the gene set performs in generating predictive decision trees, the higher the score. When the player defeats their opponent, they move on to play another board. (Multiple players play each board.)
A game of the The Cure. Barney (the bad guy) is winning, I am looking at the CPB1 gene and, using the search feature, I have highlighted all genes that have the word cancer in any of their metadata in pink. |
Promotion, players and play
The Cure was promoted on launch day via a presentation by Andrew Su at Genome Informatics 2012, via Twitter and in several blog posts. As we first described in a post published on the Sage community site, more than 120 players registered and collectively played more than 2000 games in the first week that the game was alive - with much of this activity happening within the first few days. Nearly half of the players self-reported having PhDs and half claimed knowledge of cancer biology. Following the initial buzz, game-playing activity slowed down to what is now a slow but persistent trickle.
Games played at The Cure since launch |
Predicting breast cancer prognosis
Aside from entertainment, the point of this particular game is to assemble a predictor for breast cancer prognosis. The main hypothesis is that biological knowledge, accessible from players, can be used to help select good sets of genes to use to train predictive models using machine learning algorithms. The premise is that injecting distributed biological knowledge (which can not entirely be learned from any one training set) will help reduce overfitting by identifying the gene sets with biologically consistent associations with disease progression.
The data collected from game play includes information about the players (education, knowledge of cancer, profession) and the complete history of the genes that each player selects for each board that they play. While we are still considering methods for making use of this data (such as the Human Guided Forest), we used the following protocol to build a predictor to submit to the SAGE challenge.- Filter out games from players that indicated no knowledge of cancer biology.
- Rank each gene according to the ratio of the number of times that it was selected by different players to the number of times that it appeared in any played game.
- Select the top 20 genes according to this ranking.
- Insert this 20 gene ‘signature’ into the ‘Attractor Metagene’ algorithm that has dominated the SAGE challenge. To do this, we kept all of the code related to the use of clinical variables unchanged, but replaced the genes selected by the Attractor team with the genes selected by our game players.
Game-selected genes |
The predictor generated with this protocol scored 69% correct on survival concordance index on the Sage challenge test dataset, just 3% behind the best submitted predictor and significantly above the median of hundreds of submitted models. (You can see the ranked results on the challenge leaderboard - search for team HIVE - and, with a free registration, you can inspect the model directly within the Synapse system operated by SAGE.)
In experiments conducted within the training dataset, we were able to consistently generate decision tree predictors of 10-year survival with an accuracy of 65% in 10-fold cross-validation using only genomic data (no clinical information). This was substantially better than classifiers produced using randomly selected genes (55%). Using an exhaustive search through the top 10 genes, we found 10 different unique gene combinations that, when aggregated, produced statistically significant (FDR < 0.05) indicators of survival within: (1) the training dataset used in the game, (2) a validation cohort from the same study, and (3) an independent validation set from a completely different study.
Final Results from METABRIC round of BCC challenge |
!! Update, the mode submitted using the The Cure data (Team HIVE) scored 0.70 on the official test dataset for the METABRIC round of this competition, putting it at #43 of of 171 submitted models !!
Conclusions
These early results from The Cure show clearly that biologists with knowledge that is relevant to cancer biology will play scientific games, and that combined with even basic analytical techniques, meaningful knowledge for inferring predictors of disease progression can be captured from their play. We suggest that this might open the door to a new form of ‘crowdsourcing’ that operates with much smaller, more specific crowds than are typically considered.Data
The data collected from the game so far is available as an SQL dump in our repository. This is the entire database used to drive and track the game with the exception of personal information such as email and IP addresses.
Implementation
The code that operates The Cure is freely available on our BitBucket account. It consists of a Java server application (running in Tomcat) that handles database interaction, board generation, and integration with the WEKA machine learning library. WEKA is used to dynamically train and test decision trees (though we could easily use other models) while the game is running. The interface is almost entirely CSS and JavaScript that communicates with the server via JSON requests. We would be thrilled if some one wanted to use this code to build another classification game!
Trees
One aspect of the code-base that may be useful in a variety of different projects is the code that translates the Java objects that represent decision trees in WEKA into the Web-ready visualizations presented to the players. This is accomplished via server-side translation into a JSON structure that is rendered in the browser using code that builds on the D3 javascript visualization library.
Credits
Thanks to Max Nanis, Salvatore Loguercio, Chunlei Wu, Ian Macleod and Andrew Su for all of your help making The Cure. Thanks in particular to Max who authored 99% of everything you see when you play the game.
Barney
The opponent in The Cure came from a Wikipedia Commons image from the game "You have to Burn the Rope". Thanks for sharing!
0 comments:
Post a Comment