Karthik G. and I will be presenting a poster tomorrow at the Salk Institute's Cancer Day Symposium. We will be presenting data from a year with the scientific discovery game The Cure. You can read more about those results on the arXiv.
If you are coming, please stop by for a chat! We would especially love the chance to discuss the new, collaborative decision tree-building interface that Karthik has created. Who knows if the conference wifi will work, so please try it now!
Monday, April 21, 2014
Karthik G. and I will be presenting a poster tomorrow at the Salk Institute's Cancer Day Symposium. We will be presenting data from a year with the scientific discovery game The Cure. You can read more about those results on the arXiv.
Thursday, April 10, 2014
Thanks to the hard work of my coauthors @x0xMaximus and @andrewsu , I was able to nab the award for the best presentation at The Seventh International Biocuration Conference from the International Society of Biocuration. The slides for the presentation and the poster are available from slideshare.
I think the presentation garnered the interest it did because many of the people in the audience had heard the term "crowdsourcing" before, but had never seen a real example of a specific application - let alone one in science. I was surprised by the number of people that I spoke to that had no idea what the Amazon Mechanical Turk was - nevermind that it might be applicable to some of the problems they were working on. We had a decent result to talk about, but much more importantly, we taught the audience about a powerful new tool that they might be able to use in their own work.
For those that do want to try scientific applications of microtask crowdsourcing I'd like to emphasize that its probably not going to be an easy process. The result we presented was from the third iteration of our system and represents several months of developer time. While resources are emerging that should make this process much faster to get started (e.g. [1-4]), expect to engage in an iterative cycle to get your system dialed in!
If you do want to give crowdsourcing a try for biocuration or other scientific objectives, (1) we would love to hear about it! and (2) it might be worth a quick look at our review of the domain . Microtask systems such as the one we worked with here are just one of many ways that scientific challenges can be opened up to much broader communities.
Friday, March 21, 2014
(Note that the following is a resubmission. The introduction section is a response to the previous critiques and scores listed there in the table. The scores for this proposal are on the bottom of this post.)
Crowdsourcing Genomic Predictors of Disease Progression Using Serious Games
This proposal sits at the interface between breast cancer, scientific knowledge, genomic data and community coordination. We hypothesize that data-driven attempts to make predictions of breast cancer prognosis can benefit from prior knowledge, and that current approaches for capturing knowledge from unstructured sources are inadequate. We suggest that, if properly coordinated, a motivated community could help address this challenge. In order to provide incentives and organization, we propose to create a “serious game”, also known as a “game with a purpose”. This game would serve as a focal point for community action oriented around understanding and predicting breast cancer prognosis, but could easily be generalized to other complex phenotypes. The game would attract the attention and focus the efforts of participants ranging from expert cancer biologists to students just learning about the field.
|Table 1 (9-point rating scale (1 = exceptional; 9 = poor) )|
additional preliminary data and the critiques offered by panel members. As summarized in Table 1, Reviewer 2 was the most critical reviewer, noting that our prototype game “lacked a playability factor”. This concern was cited as a primary weakness for all evaluation criteria except Environment. The three
other reviewers echoed this concern, as Approach was consistently judged to be the weakest area. The feedback from all four reviewers can be summarized as a need to significantly improve the game mechanics. They note, and we agree, that the game must be capable of attracting and holding the attention of a large audience. Therefore, we have made the following changes to our proposal.
- We have incorporated substantial new preliminary data. Despite the shortcomings of our current prototype, our preliminary data demonstrate that the proposed concept is fundamentally sound. In the twelve months since our original proposal was submitted, over 1,200 people independently discovered and collectively played 10,500 rounds of our prototype game. New players continue to register every week. This player population provides a key new resource, not available at the time the original proposal was submitted, for iteratively refining new game designs. We have adapted the proposal to clarify plans to apply a user-centered design strategy consisting of repeated cycles of evaluation with new and existing players followed by adaptations and further testing.
- Our plans for our full-length game are better described. Our prototype game was designed for a relatively narrow group of players with both substantial biological knowledge and a desire to play a casual game. Based on preliminary data and interviews with players, we altered our proposal to place a greater emphasis on stratifying the challenges in the game to better suit players with different degrees of expertise. In the current proposal, we added a significant new focus on providing a range of game levels to better meet the educational needs of beginners (see Specific Aim #1) and the tools to explore data desired by the most advanced player-scientists (see Specific Aim #2). We expect these changes to make substantial improvements in both the “fun factor” that the initial review perceived to be lacking and the value of the data collected from the more knowledgeable players.
- Our proposed budget now includes specific funds dedicated to consultants in game design (see Key Personnel).
- We have assembled an extensive network of colleagues with interest and expertise in scientific game development. Within this network, we have established agreements to work towards cross- pollination of our different player communities and to provide each other with invaluable discussions during early stages of development. (See letters of support from Stegman, Waldispuhl, Maclean, Himmelstein, and Khatib).
Aside from comments related to gamification, Reviewer 1 commented on a lack of clinical and translational expertise on our team, which we have addressed by recruiting additional support from colleagues at TSRI (See letters of support from Leyland-Jones, Salomon and Schork). The only additional comment received was encouragement from Reviewer 4 who said: “Definitely resubmit if this version does not receive funding.”
Breast cancer is the most common cancer in women. Molecular signatures for predicting prognosis and drug response could greatly improve the quality of care. Computational analyses of full genome expression datasets have indeed identified such signatures. However these signatures leave much to be desired in terms of their accuracy, reproducibility in validation studies and biological interpretability. Following similar trends in society, leaders of the research community have recently used crowdsourcing to focus the attention of many new data scientists on this problem through open competitions such as the Sage DREAM7 prediction challenge. While this very young approach has already yielded innovations, it has so far only been used to expand the search for and organize the work of datamining specialists. What is not known is how to expand the reach of crowdsourcing approaches aimed at identifying molecular signatures beyond data scientists to include other members of the scientific community and even of the general public. How can we recruit and organize people that can directly process the unstructured knowledge constantly accumulating in the literature to compose their own novel theories? How do we coordinate the efforts of experts, recruit and train students, and bring the minds of immunologists, developmental biologists, ecologists, economists, engineers, and interested citizen scientists to bear on this crucial problem?
Our long-term goal is to identify a collection of re-usable design patterns that leverage human knowledge and reasoning at the scale of the Web to improve the process of identifying molecular patterns associated with complex biological phenotypes. The overall objective of this proposal is to generate a better predictor of breast cancer prognosis. Our central hypothesis is that a scientific discovery game can capture knowledge and human reasoning that can be combined with existing machine learning methods to produce more effective predictors. We arrived at this hypothesis based on (1) recent successes in scientific crowdsourcing such as the DREAM challenges, (2) impressive results from similar games with a purpose such as Foldit, Fraxinus, and Phylo and (3) accumulating evidence of the value of prior knowledge in the discovery of complex predictive patterns in cancer. Further, we have already succeeded in attracting more than 1,200 players - hundreds of whom had postgraduate degrees - to play a simple prototype game (see Preliminary Data). When completed, the proposed expansions and improvements to this discovery-oriented game will allow us to collect an unprecedented database of manually-generated, hypothetical connections between molecular and clinical variables and breast cancer prognosis. This will offer the potential to create better predictors by providing machine learning methods with information not otherwise accessible. Perhaps equally important, this approach stands to greatly increase public engagement in and understanding of the challenges of modern “big data” biomedical science. We will achieve these goals through the following specific aims.
Aim #1: Attract large numbers of people with wide-ranging backgrounds to learn about and to join in the process of identifying signatures of breast cancer prognosis.
Working Hypothesis: A compelling, web-based game will incentivize, educate and focus the efforts of many citizens, scientists and citizen scientists.
Aim #2: Capture a large volume of structured expert knowledge linking genes and clinical variables with breast cancer prognosis
Working Hypothesis: Within the population that is attracted to a scientific discovery game, we will identify a sub-population of players that are either knowledgeable (e.g. cancer researchers) or are intelligent and dedicated enough to become knowledgeable (e.g. patient advocates). We can identify such expertise based on actions taken in the game and provide these special players with access to expert-level tools that will allow them to compose, test and share the hypotheses that we seek to collect.
This work will produce and validate a new process for organizing large communities of volunteer knowledge workers. Using this framework, which alone will be valuable as a reusable methodology, we expect to generate novel prognostic signatures with both good predictive performance and greater biological relevance than those that currently exist. These signatures will stand to improve the state of the art in breast cancer prognosis and thereby improve treatment efficacy. In addition, the framework can be re-used to develop predictive signatures of drug response and other complex phenotypes.
Significance. Many studies attempt to use genomic information to predict progression and treatment response for cancers and other complex diseases. Such predictors are of interest because, if sufficiently accurate, they could be used to personalize therapy and to cast insights into the molecular underpinnings of disease. Despite extended and intense research in a variety of areas, there are few clinically useful genomic predictors. Of the few that exist, the Oncotype DX® predictor for breast cancer prognosis is among the most widely used . However, its effective application is limited to ER-positive, lymph node-negative tumors, and research into the development of more accurate prognostic predictors across all the subtypes remains highly active . As a case in point, in the summer of 2012, ten years after the first major attempts to produce genomic predictors of breast cancer prognosis , SAGE Bionetworks launched a large-scale public contest to spur research in this area because suitably accurate predictors still had not been found . Though there has unquestionably been progress in the past decade, it has been incremental at best.
We suggest that a fundamentally new methodology is needed to make significant strides on this difficult problem. In this proposal, we introduce a new approach that taps into the massive reservoir of biological knowledge currently trapped in unstructured text and in the minds of scientists. Since 2000, more than 160,000 publications related to breast cancer have been added to PubMed (http://tinyurl.com/brsince2000). Our approach provides a new mechanism for marshaling this knowledge for the purpose of building better predictors. If successful, it will produce a new collection of more accurate, more interpretable predictors of breast cancer progression. These findings would be significant for the following reasons.
- More accurate prognoses can be used to more effectively personalize treatment.
- More interpretable predictors improve approval chances for clinical tests and inspire further research.
- Many similarly structured problems could be addressed using the proposed approach.
Innovation. While many variations exist, the standard paradigm for translating high throughput experimental data (e.g. whole genome RNA expression profiles) into predictors of disease progression follows this basic pattern: (1) assemble a discovery/training dataset, (2) rank attributes according to some univariate statistic, (3) filter all but the top N attributes arbitrarily, (4) select a classification algorithm, (5) evaluate performance in cross-validation experiments and on external test datasets. Emphasis is placed on single-dataset analysis and pre-existing biological knowledge is only considered post hoc. The predictors generated with this approach consistently have problems in secondary and tertiary validation studies and in the stability of the genes selected using different training datasets . Recently, methods driven by structured prior knowledge in the form of protein-protein interaction networks [6, 7], pathway databases [8, 9] and information gathered from pan-cancer datasets [10, 11] have been introduced. These methods guide the search for predictive gene sets towards cohesive groups related to each other and to the predicted phenotype through biological mechanism. In doing so, they have improved the stability of the gene selection process and the biological relevance of the identified signatures. These techniques hint at the potential of strategies that marry a top- down approach based on established knowledge with a bottom-up approach based directly on experimental data, but they have not yet produced substantially greater accuracy than other approaches. We contend that this is due in part to the lack of relevant structured knowledge to compute with. The proposed research seeks to provide a new mechanism to rapidly and inexpensively capture targeted biological knowledge that can be used directly to improve the inference of genomic predictors. This innovative approach opens up access to knowledge not currently represented in any structured database and offers a high-throughput mechanism to apply human reasoning to the predictor inference challenge.
Preliminary Data. In Sept. 2012, we released a simple proof-of-concept game called ‘The Cure’ (http://genegames.org/cure/). In this game, 1,250 genes are randomly distributed (twice) into 100 game boards, each with 25 genes. On each board, the player competes with a computer opponent to select the highest scoring set of 5 genes (Fig. 1). Each player’s score is determined by using labeled training data to infer and test decision tree classifiers that predict 10-year survival using expression data from just the selected genes. The better the gene set performs in generating predictive decision trees, the higher the score. When the player defeats their opponent, they move on to play another board and multiple players play each board. Information from the Gene Ontology, RefSeq, Entrez Gene and PubMed is provided through the game interface (see black tabs in Fig. 1) to aid players in selecting their genes. Players are also free to make use of external knowledge sources.
|Figure 2. Top 10 genes and enriched disease terms linked to top 82 genes derived from game play data.|
|Figure 3. 10yr survival accuracy. SVMs trained using prior published gene sets and game-derived 82- gene set. X-axis Griffith 2013 train/test set , Y-Axis Oslo validation set with Metabric training set .|
Specific Aim #1: Attract large numbers of people with wide-ranging backgrounds to learn about and to join in the process of identifying signatures of breast cancer prognosis.
|Figure 4. Designing game levels to keep players in state of flow.|
- Ross JS, Hatzis C, Symmans WF, Pusztai L, Hortobagyi GN. Commercialized multigene predictors of clinical outcome for breast cancer. Oncologist. 2008;13(5):477-93.
- Weigelt B, Pusztai L, Ashworth A, Reis-Filho JS. Challenges translating breast cancer gene signatures into the clinic. Nature reviews Clinical oncology. 2012;9(1):58-64.
- van 't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415(6871):530-6.
- Margolin AA, Bilal E, Huang E, Norman TC, Ottestad L, Mecham BH, Sauerwine B, Kellen MR, Mangravite LM, Furia MD, Vollan HK, Rueda OM, Guinney J, Deflaux NA, Hoff B, Schildwachter X, Russnes HG, Park D, Vang VO, et al. Systematic analysis of challenge-driven improvements in molecular prognostic models for breast cancer. Sci Transl Med. 2013;5(181):181re1.
- Xu JZ, Wong CW. Hunting for robust gene signature from cancer profiling data: sources of variability, different interpretations, and recent methodological developments. Cancer Lett. 2010;296(1):9-16.
- Dutkowski J, Ideker T. Protein networks as logic functions in development and cancer. PLoS Computational Biology. 2011;7(9):e1002180-e. (PMC3182870)
- Winter C, Kristiansen G, Kersting S, Roy J, Aust D, Knösel T, Rümmele P, Jahnke B, Hentrich V, Rückert F, Niedergethmann M, Weichert W, Bahra M, Schlitt H, Settmacher U, Friess H, Büchler M, Saeger H-D, Schroeder M, et al. Google goes cancer: improving outcome prediction for cancer patients by network-based ranking of marker genes. PLoS computational biology. 2012;8(5):e1002511-e.
- Bild A, Yao G, Chang J, Wang Q, Potti A, Chasse D, Joshi M-B, Harpole D, Lancaster J, Berchuck A, Olson J, Marks J, Dressman H, West M, Nevins J. Oncogenic pathway signatures in human cancers as a guide to targeted therapies. Nature. 2006;439(7074):353-7.
- Su J, Yoon B-J, Dougherty E. Accurate and reliable cancer classification based on probabilistic inference of pathway activity. PloS one. 2009;4(12):e8161-e.
- Cheng WY, Ou Yang TH, Anastassiou D. Development of a prognostic model for breast cancer survival in an open challenge environment. Sci Transl Med. 2013;5(181):181ra50.
- Cheng WY, Ou Yang TH, Anastassiou D. Biomolecular events in cancer revealed by attractor metagenes. PLoS Comput Biol. 2013;9(2):e1002920. (PMC3581797)
- Wang J, Duncan D, Shi Z, Zhang B. WEB-based GEne SeT AnaLysis Toolkit (WebGestalt): update 2013. Nucleic Acids Res. 2013;41(Web Server issue):W77-83. (PMC3692109)
- Lauss M, Kriegner A, Vierlinger K, Visne I, Yildiz A, Dilaveroglu E, Noehammer C. Consensus genes of the literature to predict breast cancer recurrence. Breast Cancer Res Treat. 2008;110(2):235-44.
- Paik S. Development and clinical utility of a 21-gene recurrence score prognostic assay in patients with early breast cancer treated with tamoxifen. Oncologist. 2007;12(6):631-5.
- Griffith O, Pepin F, Enache O, Heiser L, Collisson E, Spellman P, Gray J. A robust prognostic signature for hormone-positive node-negative breast cancer. Genome Medicine. 2013;5(10):92.
- Gould J, Lewis C. Designing for usability: key principles and what designers think. Commun ACM. 1985;28(3):300-11.
- von Ahn L, Dabbish L. Designing games with a purpose. Commun ACM. 2008;51(8):58-67.
- Good B, Su A. Crowdsourcing for bioinformatics. Bioinformatics. 2013;29(16):1925-33.
- Cooper S, Khatib F, Treuille A, Barbero J, Lee J, Beenen M, Leaver-Fay A, Baker D, Popovic Z, Players F.
Predicting protein structures with a multiplayer online game. Nature. 2010;466(7307):756-60.
- Khatib F, DiMaio F, Cooper S, Kazmierczyk M, Gilski M, Krzywda S, Zabranska H, Pichova I, Thompson J,
Popovic Z, Jaskolski M, Baker D. Crystal structure of a monomeric retroviral protease solved by
protein folding game players. Nat Struct Mol Biol. 2011;18(10):1175-7.
- Khatib F, Cooper S, Tyka MD, Xu K, Makedon I, Popovic Z, Baker D, Players F. Algorithm discovery by
protein folding game players. Proceedings of the National Academy of Sciences of the United States of
America. 2011;108(47):18949-53. (PMC3223433)
- Kawrykow A, Roumanis G, Kam A, Kwak D, Leung C, Wu C, Zarour E, Sarmenta L, Blanchette M,
Waldispuhl J. Phylo: a citizen science approach for improving multiple sequence alignment. PloS one. 2012;7(3):e31362. (PMC3296692)
- MacLean D. Changing the rules of the game. eLife. 2013;2.
- Chen J. Flow in games (and everything else). Commun ACM. 2007;50(4):31-4.
- Curtis C, Shah SP, Chin SF, Turashvili G, Rueda OM, Dunning MJ, Speed D, Lynch AG, Samarajiwa S,
Yuan Y, Graf S, Ha G, Haffari G, Bashashati A, Russell R, McKinney S, Group M, Langerod A, Green A, et al. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature. 2012;486(7403):346-52. (PMC3440846)
- Gyorffy B, Lanczky A, Eklund AC, Denkert C, Budczies J, Li Q, Szallasi Z. An online survival analysis tool to rapidly assess the effect of 22,277 genes on breast cancer prognosis using microarray data of 1,809 patients. Breast Cancer Res Treat. 2010;123(3):725-31.
- Barrington L, Turnbull D, Lanckriet G. Game-powered machine learning. Proceedings of the National Academy of Sciences. 2012;109(17):6411-6.
- Wu C, Macleod I, Su AI. BioGPS and MyGene.info: organizing online, gene-centric information. Nucleic Acids Res. 2013;41(Database issue):D561-5. (PMC3531157)
- Quinlan JR. Induction of Decision Trees. Machine Learning. 1986;1(1):81-106.
- Mihael A, Christian E, Martin E, Hans-Peter K. Visual classification: an interactive approach to
decision tree construction. Proceedings of the fifth ACM SIGKDD international conference on
Knowledge discovery and data mining; San Diego, California, USA: ACM; 1999.
- Malcolm W, Eibe F, Geoffrey H, Mark H, Ian HW. Interactive machine learning: letting users build
classifiers. Int J Hum-Comput Stud. 2002;56(3):281-92.
- van den Elzen S, van Wijk JJ, editors. BaobabView: Interactive construction and analysis of decision
trees. Visual Analytics Science and Technology (VAST), 2011 IEEE Conference on; 2011: IEEE.
- Bilal E, Dutkowski J, Guinney J, Jang IS, Logsdon BA, Pandey G, Sauerwine BA, Shimoni Y, Moen Vollan
HK, Mecham BH, Rueda OM, Tost J, Curtis C, Alvarez MJ, Kristensen VN, Aparicio S, Borresen-Dale AL, Caldas C, Califano A, et al. Improving breast cancer survival analysis through competition-based multidimensional modeling. PLoS Comput Biol. 2013;9(5):e1003047. (PMC3649990)
- Breiman L. Random Forests. Machine Learning. 2001;45(1):5-32.
- Migut M, Worring M, editors. Visual exploration of classification models for risk assessment. Visual
Analytics Science and Technology (VAST), 2010 IEEE Symposium on; 2010 25-26 Oct. 2010.
- Poulet F. Towards Effective Visual Data Mining with Cooperative Approaches. In: Simoff S, Böhlen M,
Mazeika A, editors. Visual Data Mining: Springer Berlin Heidelberg; 2008. p. 389-406.
Summary Statement (from NIH review panel)
Here are the summaries of the scores for apps 1 and 2 for comparison.
|Scores from original submission|
(9-point rating scale (1 = exceptional; 9 = poor) )
Impact score = 55
|Impact score = 40|
Tuesday, November 26, 2013
Apparently the FDA wants 23andme to stop selling genetic testing kits. I think this is a really bad thing.
It seems that if they could, the FDA would block access to mirrors because of the detrimental effects they might have on segments of the population that might take the data provided to them there, become unhappy, eat more cheetos and then die at a faster rate than the mirrorless...
There is no doubt that some people might look at the data provided by 23andme and related services and make poor decisions about their health and its a huge challenge to translate this kind of data into clear medical advice given what is known now. But (a) its my f'ing genome I should be able to look at it if I want to (b) they are an information service, not a healthcare service and they make that distinction very carefully - they don't say "you should take this drug.." they say "you are at greater risk for ..., so you should go talk to your doctor...". Sorry, but there needs to be some accountability on the part of the consumer.
Trying to stop personal genomics companies like this from operating until every bit of information they show has run through the FDA will only improve one thing - the economies of other countries without these kinds of problems. Not to mention the fact that without data collection strategies like this, we will likely never be able to generate the data that would allow these services to get to the point of making a major positive impact on healthcare. e.g. here is proof-of-concept paper that from 23andme that has been followed up by many new discoveries made possible by their service. http://www.ncbi.nlm.nih.gov/pubmed/21858135
Saturday, April 27, 2013
I have a simple question. Say that I have the results from a gene expression analysis done in my laboratory or pulled from a public repository. Say the sample has something to do with cancer (or I think that it might). Say I read about so called 'signatures' that have been found to be associated with key phenotypes related to cancer. (Here is a list of 13 signatures like this).
How do I now test to see which, if any, of these signatures are showing up in my sample?
I have my input, (e.g. the Affy CEL file from my experiment), how do I get the output that indicates that my sample shows an active wound response, suggests poor outcomes in breast cancer patients, looks like lung-specific metastasis, etc. etc.
This should be relatively easy, no? I've got data about human gene expression, these people have made useful predictive models that take human gene expression as input. Where is the website?
Some people have directed me to useful resources like GeneSigDB that provide curated repositories of "gene signatures". However, these "signatures" are just sets of genes, they are not predictive models. If all that we needed were gene sets, no one would ever need to train a random forest classifier or a support vector machine on the data associated with those gene sets. Sets of phenotypically related genes are great, but I need the full predictive model.
The only system that I know of that seems to have the capacity to answer my question (had the model builders used it) is the Synapse platform. For example, if you are good at R, you should be able to use Synapse to execute any of the models submitted to the recent breast cancer prognosis challenge. This is a great step forward for the community (though it recapitulates pretty much everything from the more generic world of scientific workflow systems like Taverna).
But still.. a) comparatively very few published predictive models are in Synapse and b) should I really have to know R to answer that question?
Wednesday, December 5, 2012
Maximilian Ludvigsson took the first steps in the creation of Semantic BioGPS. BioGPS is a user-extensible Web portal that provides easy access to information about genes from hundreds of different websites. Maxmilian produced a tool that allows BioGPS users to annotate regions of gene-centric Web pages to state, computationally, what different areas of the page ‘mean’. These semantic annotations enable scripts to extract structured content about genes from these Web pages, paving the way for a new version of BioGPS that provides integrated views across multiple data sources.
Karthik G developed an interactive network visualization for the data linking genes to diseases in the GeneWiki+. The GeneWiki+ is a Semantic Media Wiki (SMW) installation that dynamically integrates data about human genes from Wikipedia and from SNPedia. While SMW queries provide a great way for programmers and advanced wiki users to interact with data, the graphical network that Karthik created gives ordinary biologists a new, intuitive, and sometimes beautiful way to explore connections between genes and disease.
Clarence Leung began the development of a new version of the crowdsourcing game Dizeez. In this new two-player game, players are challenged to get their partner to guess a particular disease by prompting them with related genes. This game follows in the tradition of ‘games with a purpose’ such as Foldit and the ESP game by producing novel, validated gene-disease associations as a result of game play.
Shivansh Srivastava worked on migrating BioGPS’s gene report layout windowing system from ExtJS to both a jQuery windowing environment and a Yahoo User Interface-based approach. This view in BioGPS provides biologists with a customizable environment for accessing gene-centric data from a diverse collection of sources. Shivansh’s efforts provided BioGPS developers with insight into the technical limitations of each solution, as compared to the current BioGPS ExtJS codebase.
Kevin Wu developed a scalable and efficient system for storing and analyzing biologically meaningful sets of genes. Accessible via a RESTful HTTP interface, the system uses MongoDB for storage and custom code for distributed computing that executes statistical comparisons across thousands of gene sets in parallel. For any particular gene set, Kevin’s code makes it possible to rapidly identify similar gene sets and to calculate the ‘enrichment’ (a statistical measure of overlap) of that gene set with respect to any other. This work will soon be integrated into BioGPS to allow users to save their own gene sets and to query for similar gene sets from others.
Thanks to all of our excellent students for their great contributions and to Google for sponsoring this unique program. We are looking forward to participating in the GSoC for many years to come!