Monday, July 28, 2014

Zooniverse Proposal: Excavating a network of concepts related to Chordoma from the biomedical literature

The Zooniverse team, in collaboration with other members of the Oxford community including the Faculty of English Language and Literature, has recently started an initiative about Constructing Scientific Communities.  As part of this initiative, they announced an open call for proposals.  Here is our proposal (originating from our work on the Mark2Cure project).

Title: Excavating a network of concepts related to Chordoma from the biomedical literature

Abstract:  The life sciences are currently faced with a rapidly growing array of technologies for measuring the molecular states of living things.  From sequencing platforms that can assemble the complete genome sequence of a complex organism involving billions of nucleotides in a few days to imaging systems that can just as rapidly churn out millions of snapshots of cells, biology is truly faced with a data deluge.  To translate this information into new knowledge that can guide the search for new medicines, biomedical researchers increasingly need to build on the existing knowledge of the broad community.  Prior knowledge can help guide searches through the masses of new data.  Unfortunately, most biomedical knowledge is represented solely in the text of journal articles.  Given that more than a million such articles are published every year, the challenge of using this knowledge effectively is substantial.  Ideally, knowledge such as the interrelations between genes, drugs, biological processes and diseases would be represented in a structured form that enabled queries like: “show me all the genes related to this disease or related to any drugs used to treat this disease”.  Currently systems exist that attempt to extract this information automatically from text, but the quality of their output remains far below what can be obtained by human readers.  We propose to construct a scientific community focused on translating the knowledge in the biomedical literature into structured forms suitable for effective access, aggregation and querying.  Specifically we propose to excavate a network of concepts related to Chordoma, leveraging an existing relationship we have with the Chordoma Foundation.  Chordoma is a rare, devastating form of cancer that develops along the skull and bones of the spine.  There are tens of thousands of articles about this disease, related diseases, related genes, and related biological processes.  Extracting the network of knowledge represented in these texts will enable our group and others to more effectively identify existing drugs that might be repurposed to treat Chordoma and to produce hypotheses about genes that might be targets for new drugs.  

Please provide details of the images, video or sounds which form the basis of your project, and the task or tasks you envisage volunteers carrying out. As well as a description, include details of format and any copyright restrictions.

The subject matter for this task will be the abstracts of biomedical research articles housed in the PubMed database [1].  PubMed currently has more than 23 million abstracts and is growing at a rate of approximately 1 million new articles every year.  From these, we have identified a set of approximately 50,000 articles related to Chordoma that would form the basis of this project.  This set was selected by: 1) searching PubMed for Chordoma (produces 3333 articles), 2) searching within these articles for genes (produces 63 genes), 3) searching for articles related to those genes (produces an average of 731 articles per gene).  These abstracts can easily be accessed via an open Web API [2].  PubMed displays abstracts based on ‘fair use’ agreements with the many journals that supply them.  Some journals do maintain official hold over the copyrights for these abstracts, but in practice the abstracts are free for public use.  (The full text of the articles are a different matter, though many new open access articles are available without restrictions.)  

The tasks involved in this proposal include the annotation of key kinds of biomedical entities in the text of these abstracts.  Specifically, we will ask participants to identify words or phrases that correspond to diseases, genes, chemical entities and biological processes.  After highlighting the specific phrase corresponding to one of these concepts, the volunteers will then be asked to find the highlighted concept in an existing ontology (a hierarchical organized controlled vocabulary) that we would provide.  The first task is often referred to as ‘concept detection’ and the second as ‘concept normalization’.  (If limited to concept detection, e.g. if the concept normalization interface was too costly to engineer in this iteration, the project would still be a very valuable contribution.)  See the ‘egas’ web application for an example tool that supports these tasks

2.      NCBI E-utilities []

Provide a brief description of the research which will be enabled by the crowdsourcing project. *
Please write up to 1000 words for a non-specialist audience. Include references in the text.

Precisely identifying occurrences of diseases, genes, biological processes, and chemical entities in biomedical text will help to drive both biomedical research and research into natural language processing (NLP).  In the long run, we anticipate that NLP technology will eventually mature to the point where manual text annotation tasks such as that proposed here are not necessary.  However, progress towards that objective is slow and is hampered by the need for large, manually annotated “gold standard” corpora with which to train machine learning systems and evaluate computational predictions [3].  The annotations captured through this project will form an invaluable resource for the NLP community to use to hone their algorithms.  Further, while NLP technology advances in steps that can take decades to unfold, we can make immediate use of the products of this project to advance research on Chordoma.  Here, we describe the Chordoma research that could be enabled by this project (leaving discussion of the project’s impact on NLP research to the next section on automated processing routines).

Modern approaches to drug development often begin with the identification of specific genes that are ‘targets’ for the drug.  Once a particular gene has been identified, drugs can be designed that repress or enhance its activity and thereby treat the intended disease.  The identification of good gene targets is a critically important step in drug development because the process of creating and testing a particular drug is incredibly costly in terms of both time and money.  In fact it has been estimated that, in general, it takes more than a decade and costs more than a billion dollars to bring a single drug to market - with many drugs failing at the final stages of the process [4].

In Chordoma, mutations in a gene called ‘Brachyury’ are present in more than 90% of afflicted patients [5].  This information makes it one center of the search for drug targets.  Several studies have shown that if Brachyury is repressed in Chordoma cell lines (reproducing cell populations derived from Chordoma tumors), the cells’ pathological characteristics of malignant tumors, such as their capacity to proliferate, are significantly decreased [6].  Brachyury represents one promising drug target for Chordoma therapy (and for other cancers [7,8]), but we are still far from a cure.  No drugs have been approved by the United States Food and Drug Administration for the treatment of Chordoma and while Brachyury is clearly an important component it does not act alone.  Genes work together in complex relationships, often referred to as ‘biological pathways’, to produce both healthy and diseased phenotypes. Other genes “turn on” the expression of Brachyury which in turn activates or represses the expression of other genes downstream.  Many different members of this cascade could prove to be effective drug targets.  It is also important to keep in mind that these genes have normal, important functions that may make them unsuitable for drug targeting.  Understanding this network of interacting genes and the biological processes that they carry out is thus a crucial step in the rational selection of candidate genes.  

A thorough map of the genes and biological processes related to Chordoma would be a powerful tool for research and is a challenge well-suited to a large community.  While some of the required information is present in databases such as that provided by the Gene Ontology consortium [9], which catalogues the function of genes, most remains represented in the text of scientific articles.  By tagging the occurrences of the crucial concepts (genes, diseases, chemicals, and biological processes) in these articles, we can build a network that links them together.  This network could then be used by scientists to guide their choice for the next experiments to execute in their search for cures.  

In addition to finding novel target genes for the development of new drugs, another important direction for Chordoma research is the search for existing drugs that might be effective on this disease.  The challenge here again is one of selecting which of tens of thousands of available drugs to test.  This process, called “drug repositioning”, could also be enhanced through the provision of an effective map of the biological knowledge network surrounding Chordoma.  Existing drugs that treat related diseases (such as other forms of cancer) or that target proteins in biological processes known to be important to Chordoma progression (such as angiogenesis) form potential candidates for repositioning.  Once again, the quality and breadth of the network of knowledge related to Chordoma will have a direct impact on the success of identifying such drugs.  

3.      Deleger L, Li Q, Lingren T, Kaiser M, Molnar K, Stoutenborough L, Kouril M, Marsolo K, Solti I: Building gold standard corpora for medical natural language processing tasks. AMIA Annu Symp Proc 2012, 2012:144-153.
4.      DiMasi JA, Grabowski HG: The cost of biopharmaceutical R&D: is biotech different?. Managerial and Decision Economics 2007, 28(4):469-479.
5.      Pillay N, Plagnol V, Tarpey PS, Lobo SB, Presneau N, Szuhai K, Halai D, Berisha F, Cannon SR, Mead S et al: A common single-nucleotide variant in T is strongly associated with chordoma. Nat Genet 2012, 44(11):1185-1187.
6.      Presneau N, Shalaby A, Ye H, Pillay N, Halai D, Idowu B, Tirabosco R, Whitwell D, Jacques TS, Kindblom LG et al: Role of the transcription factor T (brachyury) in the pathogenesis of sporadic chordoma: a genetic and functional-based study. The Journal of pathology 2011, 223(3):327-335.
7.      Imajyo I, Sugiura T, Kobayashi Y, Shimoda M, Ishii K, Akimoto N, Yoshihama N, Kobayashi I, Mori Y: T-box transcription factor Brachyury expression is correlated with epithelial-mesenchymal transition and lymph node metastasis in oral squamous cell carcinoma. International journal of oncology 2012, 41(6):1985-1995.
8.      Roselli M, Fernando RI, Guadagni F, Spila A, Alessandroni J, Palmirotta R, Costarelli L, Litzinger M, Hamilton D, Huang B et al: Brachyury, a driver of the epithelial-mesenchymal transition, is overexpressed in human lung tumors: an opportunity for novel interventions against lung cancer. Clin Cancer Res 2012, 18(14):3868-3879.
9.      Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT et al: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000, 25(1):25-29.

What automatic processing routines exist which attempt to solve the problem being addressed? Why can't they be used instead of humans? *
In order to avoid wasting the time of volunteers, we only support projects that require human classification. Please include references where possible

Many computational approaches for identifying concepts in text exist, but none of them provides accuracy that is comparable to manual annotation on the problems being addressed in this project.  The performance of concept recognition algorithms varies substantially based on the types of concepts sought.  Performance is typically measured based on Precision (true positives / (false positives + true positives)), Recall (true positives / (true positives + false negatives) and summarized as the ‘F measure’ (the harmonic mean of Precision and Recall).  Specifically we are interested in identifying occurrences of diseases, genes, chemicals of interest, and biological processes.  A recent study identified the best performing of three modern tools for concept recognition across a variety of concepts [10].  They found the best performing tool and parameter combination for recognizing genes (proteins) produced an F score 0.57, for chemical entities an F score of 0.56 and for biological processes an F score of 0.42.  An advanced system specifically optimized for disease recognition recently reported an F measure of 0.81 [11].  For every case, humans can significantly outperform existing methods.  And as described previously, the breadth and accuracy of the network strongly influence how useful they are to research scientists.

10.    Funk C, Baumgartner W, Jr., Garcia B, Roeder C, Bada M, Cohen KB, Hunter LE, Verspoor K: Large-scale biomedical concept recognition: an evaluation of current automatic annotators and their parameters. BMC Bioinformatics 2014, 15:59.
11.    Leaman R, Islamaj Dogan R, Lu Z: DNorm: disease name normalization with pairwise learning to rank. Bioinformatics 2013, 29(22):2909-2917.

If possible, estimate the minimum number of times a task must be performed on a given element of data to be useful for science (assuming all tasks are performed by competent citizen scientists; once might be enough for exceptionally clear tasks, more times could be required for fuzzier tasks or lots may be necessary if accurate estimates of uncertainties are needed). How many total tasks must be completed before your research goals are achievable?
This is difficult but any estimate helps.

Based on preliminary data we estimate the minimum number of times an individual task must be performed to produce useful results at 5, though additional iterations would improve quality.  These estimates are based studies that we recently conducted using Amazon’s Mechanical Turk crowdsourcing system.  Our results indicate that non-specialist, minimally paid workers in this marketplace can successfully identify occurrences of diseases in PubMed abstracts.  (We have not yet tested other entity types.)  Using a simple aggregation strategy based on unweighted voting, we found that these workers could reproduce a gold standard disease mention corpus [12] with an F measure of 0.86.  We found that increasing the number of workers per document continuously increased the quality of the output but that quality increased only minimally after 15 workers per document.  Using just 5 workers per document we achieved an F score of 0.82 on the same corpus.

It may be possible that the Zooniverse infrastructure would reduce the number of completions per task required for high quality.  We anticipate that more sophisticated aggregation algorithms that take into account information about individual worker quality could improve performance and that a more refined user interface and instruction set could also boost scores.  Further, we expect to attract more dedicated, high quality contributors from the citizen science community than the Mechanical Turk platform.  It is also worth noting here that, despite the financial incentives that drive the Mechanical Turk system many of the workers expressed a strong attachment to the project that was clearly highly motivational.  In fact some workers were asking if they could continue to complete these tasks outside of the Mechanical Turk context simply because they wanted to contribute to our efforts.

While there is no fixed threshold for the number of documents above which we could claim a complete reconstruction of the network of knowledge surrounding Chordoma, we estimate that 10,000 would provide an effective start, 50,000 would provide good coverage, and 100,000 would cover the domain in reasonable depth.  With more than 23 million articles already indexed in PubMed and 20,000 new articles arriving every week, there is an effectively unlimited range of potential work that could be performed based on this concept recognition and normalization model.  

12.    Dogan RI, Leaman R, Lu Z: NCBI disease corpus: a resource for disease name recognition and concept normalization. J Biomed Inform 2014, 47:1-10.

Who will make use of the results? Is any further funding necessary?

Researchers from the natural language processing community would use this data to train and to test their computational methods.  Bioinformatics scientists would also use this data to refine computational methods for identifying candidate drug targets and suggesting opportune existing drugs for repositioning.  The Chordoma research community, along with biomedical researchers in related domains, would make use of the concept network identified through this work via interactive software.  Given the data, generic network visualization tools such as Cytoscape [13] could be immediately applied.  Ideally Web-based applications specifically devoted to browsing and querying this network would also be delivered to this community.  Additional funding would be useful in delivering such focused tools, but given the data, research groups such as ours would likely be able to use other funding sources to produce the required end-user applications.  We also note that we run a reasonably well-funded bioinformatics lab, so we can devote our own time and effort toward the success of this collaboration based on funding for related projects.

13.    Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T: Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 2003, 13(11):2498-2504.

All data from Zooniverse projects must be eventually made public. What final format (catalogue, annotated image, query tool) would be needed? What are the anticipated final outcomes (e.g., papers, catalogues)? Are the results likely to be of interest to researchers beyond your own field?

The raw and refined results from this project would be delivered as bulk data exports in a format suitable for use by computational scientists (NLP and bioinformatics) and through a tool (or tools) that allowed biomedical researchers to interact with the extracted concept network without the need for programming skills.  Aside from these, we would expect to publish research articles about the process of composing this knowledge network in collaboration with citizen scientists.

We anticipate that the results of this project would be of broad interest to all communities that must process large amounts of unstructured text.  Essentially identical processes might be applied to tasks in widely varying fields including both other sciences and the humanities.  One additional aspect to consider regarding this project is that every document processed is already annotated with its date of publication.  As a result, it would be possible to develop views that exposed the evolution of the concept network over time.  This historical perspective might prove to be of interest to a variety of communities - especially those interested in epistemology and the history of science.  

Are there potential extensions to the project that you have in mind?

We envision extensions to this project in terms of
1) expanding the number of different kinds of concepts identified
2) expanding to annotate different document sets targeted at different diseases
3) adding the ability for volunteers to specify relationships between the entities that they identify
4) providing volunteers with increasingly powerful computational tools for pre-processing the texts and for verifying the final annotations

The primary goal of each of our projects is to enable research, but they have significant educational impact as well. Engaging the community is an excellent way of ensuring they remain committed to producing results for you. Are there members of your team willing to write blog posts, join forum discussions on scientific topics or otherwise take part in outreach? Does the project tie in with any public engagement or education activities you are already involved with? *
Some form of continuous engagement is prerequisite for a successful project

As we hope is evident on our group’s blog,, and our twitter streams (@bgood , @andrewsu ) we are avidly working on scientific outreach on a daily basis.  In fact, Ginger Tsueng, one of our project team members, has recently been hired explicitly to manage community outreach for our research group.  Ginger and all other members of our team would plan to actively engage with the community by all means at our disposal.  This project follows directly in line with several ongoing community intelligence efforts run by our group including the Gene Wiki [14], BioGPS [15], and  Our preliminary work on the annotation problem has been operated under the moniker Mark2Cure at

14.    Good BM, Clarke EL, de Alfaro L, Su AI: The Gene Wiki in 2011: community intelligence applied to human gene annotation. Nucleic Acids Res 2012, 40(Database issue):D1255-1261.
15.    Wu C, Orozco C, Boyer J, Leglise M, Goodale J, Batalov S, Hodge CL, Haase J, Janes J, Huss JW, 3rd et al: BioGPS: an extensible and customizable portal for querying and organizing gene annotation resources. Genome Biol 2009, 10(11):R130.