Wednesday, March 4, 2015

crowdsourcing machine learning NLP challenge

There is a TopCoder contest running right now that involves machine learning, crowdsourced data, and natural language processing.  There is $41,000 up for grabs!  It will be distributed in many smaller prizes so there are plenty of opportunities to win something.

You need to register by tomorrow (March 4, 2015) to participate!  

More details here:
https://www.topcoder.com/blog/registration-for-banner-is-now-open/


Tuesday, February 24, 2015

Return of OntoLoki - on a Freaky Friday

One of my favorite traditions initiated in Mark Wilkinson's laboratory (where I did my PhD) was "Freaky Friday".  Following the spirit of Google's "20% time", lab members were encouraged to take out some time on Friday to pursue their own crazy ideas - leaving their core projects aside for a few hours.  This was fun and ended up producing some useful things like "Tag clouds for summarizing web search results" (one of my most cited articles and only 2 pages long!) and "The Entity Describer".  In this spirit, I took out some time last Freaky Friday to do something very far away from the things I should really be doing.  Sadly I didn't hack something fun together.  Instead, per Dr. Wilkinson's 50th request for me to do this, I performed a Frankensteinian resurrection.  Digging deep into the grave of my PhD dissertation, I found the one unpublished chapter, cleaned it up, and deposited it into the arXiv.  The project, called OntoLoki, is now undead!  Check out this undead manuscript about automatic quality evaluation for ontologies.

I am OntoLoki!  Feed me your ontologies for evaluation!

OntoLoki: an automatic, instance-based method for the evaluation of biological ontologies on the Semantic Web

Benjamin M. Good
Gavin Ha
Chi K. Ho
Mark D. Wilkinson


Friday, February 20, 2015

Building a Garden of Biological Knowledge

March 13, 2013 I wrote up an idea in my notebook that I called 'Pubmed Daily'.  The concept was to build a system that would leverage large-scale crowdsourcing/citizen science and machine learning to produce a high-quality, structured representation of the knowledge in every abstract in PubMed on the same day that the abstract appeared online.  Nearly two years later, based mainly on the labor of outreach coordinator Ginger Tsueng, group leader Andrew Su, and programmer Max Nanis, the idea is just starting to bear fruit (albeit small, perhaps grape size fruits..) .  As the bits and pieces start to come together, I thought it would be worthwhile to share the high-level vision as it exists now.

A Garden of Biological Knowledge


We want to build an information management system (or systems) that supports the work of three key groups of people: bioinformaticians such as Andrew, biologists such as Hudson Freeze, and patient advocates such as the parents and friends of Bertrand Might, a child with a rare genetic disorder related the NGLY1 gene.  The over-arching goal is to produce more rapid biomedical advances based on more effective use of existing knowledge in the processes of of hypothesis generation and high-throughput data analysis.

The thinking goes like this.  Given a high-quality, structured knowledge base such as the Gene Ontology (GO), Andrew and people like him can make many different kinds of discovery and analytical tools that can help scientists such as Hudson work more effectively (and they do, there are thousands of tools that use the GO).  The problem is that the generation of knowledge bases like the GO is a long, slow, expensive process that in no way keeps pace with the advance of knowledge as represented in the literature.  Information extraction systems like DeepDive and SemRep can theoretically go a long way to addressing this problem.  However, humans remain more effective (though obviously dramatically slower) readers.

Can we use computers to seed a garden of knowledge that can be tended and grown by citizen scientists?

Given a compelling argument, clear instructions, and an effective user interface, we think that large numbers of people from the general public could be assembled to work on improving the results of a biomedical information extraction system.  We, and other groups, have been experimenting with various related tasks using the Amazon Mechanical Turk and are now confident that "the crowd" can, in aggregate, do text-processing work at or above expert level.  A recent conversation with a leader of the Zooniverse project, a collection of online citizen science projects with more than a million people contributing, leads us to believe that it would be possible to attract tens to hundreds of thousands of people to participate in an effort like this.  Recently, we took the first steps towards testing that assumption via a short but successful test run of the Mark2Cure Web application.

Can we build a new generation of tools for working with structured biomedical knowledge at massive scales and use these to empower the rising community of citizen scientists?

Aside from the knowledge bases themselves, we are also interested in building better tools for navigating this information and in putting them into the hands of both professionals like Hudson, and the many very intelligent, highly motivated people from other domains that might also be able to find something important given the chance (e.g. Mathew Might).

Can a large volunteer work force help teach computers to read?

By engaging volunteers at scale, we hope to provide the developers of information extraction algorithms with the data they need to raise their approaches to human levels of quality.

We have a long, challenging road ahead on this project but the path ahead is starting to take shape and the future looks bright!


Tuesday, November 11, 2014

Hackathon recap: Network of Biothings in San Diego

This past weekend, Nov. 7-9, the Neuroscience Information Framework, the Su Lab, NDex, The International Society for Biocuration and the San Diego Center for Systems Biology hosted the second Network of Biothings Hackathon.  The event took place at Atkinson Hall (Calit2) on the UC San Diego campus.  For the record, and to enhance what can already be seen in the Twitter story of this event, here is what went down.

Friday evening about 30 people arrived at Atkinson hall where they:
  1. consumed Indian food and American beverages, 
  2. met each other and talked over project ideas, 
  3. each decided on a project to work on for the weekend
Saturday the group:
  1. ate bagels for breakfast, sandwiches and leftover Indian food for lunch, and pizza for dinner
  2. hacked furiously to develop 8 distinct team projects
Sunday morning, we came together for more bagels, some last minute hacking and rapid presentation formation.  At 10:30am we got started with 8, 10 minute project presentations.  Here is a very brief list of the projects with links out to all of the code that was developed.  (More details are available on the ideas page.)
  1. BioPolymer: A set of embeddable web components to display and edit bio data. Initially data from MyGene & MyVariant. Team: Mark Fortner, Keiichiro Ono
  2. SameSame: a dynamic tool for finding and visualizing the degree of similarity between any set of biomedical concepts.  Team: Benjamin Good, Maulik Kamdar, Alan Higgins, Alex Williams
  3. CIViC - Clinical Interpretation of Variants in Cancer:  Crowdsourcing and web interface for curation of clinically actionable evidence in cancer.  Team: Obi Griffith, Adam Coffman, Martin Jones, Karthik G, Jon Cristensen, Julianne Schneider
  4. Citizen Science: An app to enable people to extract structured ‘facts’ (subject predicate object triples) from unstructured text.  Here is the project presentation.  Team: Richard Good, Hannes Niedner, Andrew Su
  5. SBiDer: Synthetic Biocircuit Developer: A web-app to search a database of operons [functional biochemical pathways] to use them in new and novel ways [to make synthetic organisms such as the ones used to make this Malaria treatment].  Team Justin Huang, Kwat Yeerna, Fernando Contreras, Joaquin Reyna, Jenhan Tao
  6. NDex: The NDEx project provides a public website where scientists and organizations can share, store, manipulate, and publish biological network knowledge. Team Dexter Pratt
  7. fiSSEA: A framework that integrates MyGene.info and MyVariant.info to retrieve functional prediction annotations (or any type of annotation) for knowledge discovery, specifically implement CADD scores for “functional impact SNP Set Enrichment Analysis". Team: Adam Mark, Erick Scott, Chunlei Wu
  8. MyGene.Info Taxonomy Query: Added detailed taxonomy information to mygene.info. Allows queries based on taxonomy ID and advanced queries based on hierarchies of taxonomic nodes.  Team: Greg Stupp, Chunlei Wu
Hackathon Trophy


And the winners were...
#1 Citizen Science
#2 SBiDer
!



There were a lot of very exciting things about this event.  In addition to a strong core of academics from multiple universities, we also had local app and algorithm developers from industry.  While the US west coast lean was powerful, we did have representatives from St. Louis and one that came all the way from Canada.  We also had a very strong undergraduate team (taking second place!).  All of the projects clearly made real progress over the weekend with some excellent cross-pollination of code and ideas.
And the coolest thing about this event??
My Dad was on the winning team :).. Go Dad!
Special thanks to Martin Jones for running around with me picking up bagels, pizza, drinks, and snacks for everyone and to Jeffrey Grethe for keeping everything running smoothly at the event, pushing around tables, helping clean up after the hackers, and calmly handling everything that needed to be done.  

Tuesday, November 4, 2014

What is a hackathon?

As an organizer of the upcoming Network of Biothings Hackathon at UC San Diego, I've been asked by a number of people what a hackathon is exactly.  I'm repurposing one of those responses here (original posted on the San Diego iOS developers meetup group).

The main idea is that a variety of different people come together to meet each other and make something together - almost always something open source. In the case of our hackathon, the purpose is specifically to engender new collaborations. For the academics these could translate into new research programs and new collaborative grant proposals. For industry folks, these could turn into new products.  Ideally, a hackathon can bring together the elements of a great new team. For example, I'm a back-end database guy with an understanding of bioinformatics. I'd love to find a front-end web or app developer to help make my data and algorithms useful to the rest of the world.

Here are a few questions I've fielded:

  • why do developers pay to build apps for somebody else? 
The money goes to pay for food, drinks, facilities fees, and to a small extent the prize. Developers don't pay to develop for someone else, they pay to meet other people and to eat.. No one is under any obligation to give their code to anyone else. You would be welcome to come by and work on your own project. In fact, we are actively trying to get more project ideas posted for our hackathon. (Note that the fee we are asking for is only $40 and many hackathons with larger sponsors are free).
  • why do developers put their time to do work for free?
The main point here is team formation. If you have a great app, this is also a way to advertise it - especially if it wins a prize.
  • do the teams who paid to participate build apps and one is chosen as winner? 
Yes, this is the basic idea.
  • are the rest thrown away? 
Nothing is thrown away. The participants maintain ownership of all code that is written. (Though open source is very strongly encouraged...)
  • Am I too young/old to participate?
Nope!  All are welcome.

In conclusion, hackathons are fun, social events for people that like to build new things, meet new people, and perhaps the change the world.  Sign up for ours and find out for yourself!

Monday, October 6, 2014

Network of Biothings Hackathon at UC San Diego

Can you code?

Are you interested in the intersection of computer science and biology (bioinformatics) ?

Do you want to meet interesting people?

Are you excited about building new pieces of software that could change the face science and medicine? 

Do you want to win a cash prize for your open source code?

Then its clearly time to:
  • Location:  UC San Diego on the 5th floor of the CALIT2 building
  • Sign up:  sign up 
  • Schedule:
  • Friday, November 7
    • 6-10pm : Welcome social / project team formation
  • Saturday, November 8
    • 9am-? : Hacking !
  • Sunday, November, 9 
    • 9am-10:30: Final hacking / presentation preparation
    • 10:30:11:30 Pitches and Demos
    • 11:45: Prize announcements

Click here for additional details.

Wednesday, October 1, 2014

Conference proceedings are citable, stop double-dipping

Over the last several years there has been a trend among conferences such as the Bio-Ontologies Special Interest Group meeting at ISMB and the Semantic Web applications and tools for life sciences (SWAT4LS) to invite article submissions for people that want to present at the conference and then to subsequently invite presenters to expand their article in an "official" publication in an associated journal.  Since the PDFs of the articles submitted to the conference usually end up online, as they should, this results in a situation where first reviewers and later readers are often confronted with two versions of essentially the same article - typically with the same title, author list, and often the same abstract.  This causes problems for reviewers as this kind of overlap with prior work (even from the same authors) would typically be grounds for rejection - yet because of this bizarre arrangement with the conference, reviewers are supposed to treat the original article as if it were a pre-print of the first, despite the fact that it is a citable entity on the Web and never referred to as a preprint anywhere.

Conference organizers, please stop this madness.  Here are three models that would be better.

  1. Following the International Biocuration Conference model, invite submissions directly to the partner journal first and then choose presenters from the successful submissions and independently submitted abstracts.  No confusion.  One good, citable paper.  Probably higher quality conference submissions.
  2. Put the articles submitted to the conference in a pre-print server such as arXiv or BioRxiv and continue with the concept of an expanded article in a journal.
  3. Do what the computer science community does and recognize contributions to conference proceedings as citable articles and do away with the attempt to get an 'official' journal publication in addition to the conference citation.

Wednesday, August 13, 2014

Network of BioThings: Hackathon 2 San Diego (when?)

The Network of Biothings, first announced in December of 2013, is being imagined by a loose, self-organizing consortium of people who share the vision of uniting and linking the world's biological and medical knowledge.  In support of this vision, The Su Laboratory, with partners at UCSD, is gearing up to host the second Network of Biothings Hackathon.  The first hackathon was an exciting and very educational event that sparked some useful projects such as http://myvariant.info/.  We are hoping to build on that momentum with an even more successful second event.

If you would like to participate in Hackathon 2, you can begin by helping us solve the most challenging problem of all: picking dates for a hackathon!  Please fill in dates that you would be available to come hack with us in San Diego at this poll:

http://doodle.com/z2irpfma6apyavpk

Why should you bother?
When faced with challenges such as selecting the best treatment for a patient or coming up with the next candidate drug target for a rare disease, we are now presented with an unbelievable wealth of data including: full genome sequencing, mRNA expression, miRNA expression, methylation, metabolomics, proteomics, clinical, imaging, and on and on.  In order for this new data to be useful, we depend on networks of knowledge.  For example, we may be able to detect that a particular gene is acting unusually in a patient, but we need to know something about that gene's biological function before we can use the new information to inform a clinical decision.  Many many valuable databases continue to arise that help address this fundamental challenge, but there is a clear consensus that most knowledge - especially the vast amount that is shared through the literature - is not accessible in any coherent form.  With your help, that coherent form - whatever it ends up looking like - could arise from the Network of Biothings.


Monday, July 28, 2014

Zooniverse Proposal: Excavating a network of concepts related to Chordoma from the biomedical literature


The Zooniverse team, in collaboration with other members of the Oxford community including the Faculty of English Language and Literature, has recently started an initiative about Constructing Scientific Communities.  As part of this initiative, they announced an open call for proposals.  Here is our proposal (originating from our work on the Mark2Cure project).

Title: Excavating a network of concepts related to Chordoma from the biomedical literature

Abstract:  The life sciences are currently faced with a rapidly growing array of technologies for measuring the molecular states of living things.  From sequencing platforms that can assemble the complete genome sequence of a complex organism involving billions of nucleotides in a few days to imaging systems that can just as rapidly churn out millions of snapshots of cells, biology is truly faced with a data deluge.  To translate this information into new knowledge that can guide the search for new medicines, biomedical researchers increasingly need to build on the existing knowledge of the broad community.  Prior knowledge can help guide searches through the masses of new data.  Unfortunately, most biomedical knowledge is represented solely in the text of journal articles.  Given that more than a million such articles are published every year, the challenge of using this knowledge effectively is substantial.  Ideally, knowledge such as the interrelations between genes, drugs, biological processes and diseases would be represented in a structured form that enabled queries like: “show me all the genes related to this disease or related to any drugs used to treat this disease”.  Currently systems exist that attempt to extract this information automatically from text, but the quality of their output remains far below what can be obtained by human readers.  We propose to construct a scientific community focused on translating the knowledge in the biomedical literature into structured forms suitable for effective access, aggregation and querying.  Specifically we propose to excavate a network of concepts related to Chordoma, leveraging an existing relationship we have with the Chordoma Foundation.  Chordoma is a rare, devastating form of cancer that develops along the skull and bones of the spine.  There are tens of thousands of articles about this disease, related diseases, related genes, and related biological processes.  Extracting the network of knowledge represented in these texts will enable our group and others to more effectively identify existing drugs that might be repurposed to treat Chordoma and to produce hypotheses about genes that might be targets for new drugs.  

Please provide details of the images, video or sounds which form the basis of your project, and the task or tasks you envisage volunteers carrying out. As well as a description, include details of format and any copyright restrictions.

The subject matter for this task will be the abstracts of biomedical research articles housed in the PubMed database [1].  PubMed currently has more than 23 million abstracts and is growing at a rate of approximately 1 million new articles every year.  From these, we have identified a set of approximately 50,000 articles related to Chordoma that would form the basis of this project.  This set was selected by: 1) searching PubMed for Chordoma (produces 3333 articles), 2) searching within these articles for genes (produces 63 genes), 3) searching for articles related to those genes (produces an average of 731 articles per gene).  These abstracts can easily be accessed via an open Web API [2].  PubMed displays abstracts based on ‘fair use’ agreements with the many journals that supply them.  Some journals do maintain official hold over the copyrights for these abstracts, but in practice the abstracts are free for public use.  (The full text of the articles are a different matter, though many new open access articles are available without restrictions.)  

The tasks involved in this proposal include the annotation of key kinds of biomedical entities in the text of these abstracts.  Specifically, we will ask participants to identify words or phrases that correspond to diseases, genes, chemical entities and biological processes.  After highlighting the specific phrase corresponding to one of these concepts, the volunteers will then be asked to find the highlighted concept in an existing ontology (a hierarchical organized controlled vocabulary) that we would provide.  The first task is often referred to as ‘concept detection’ and the second as ‘concept normalization’.  (If limited to concept detection, e.g. if the concept normalization interface was too costly to engineer in this iteration, the project would still be a very valuable contribution.)  See the ‘egas’ web application for an example tool that supports these tasks http://bioinformatics.ua.pt/egas/.

2.      NCBI E-utilities [http://www.ncbi.nlm.nih.gov/books/NBK25501/]

Provide a brief description of the research which will be enabled by the crowdsourcing project. *
Please write up to 1000 words for a non-specialist audience. Include references in the text.

Precisely identifying occurrences of diseases, genes, biological processes, and chemical entities in biomedical text will help to drive both biomedical research and research into natural language processing (NLP).  In the long run, we anticipate that NLP technology will eventually mature to the point where manual text annotation tasks such as that proposed here are not necessary.  However, progress towards that objective is slow and is hampered by the need for large, manually annotated “gold standard” corpora with which to train machine learning systems and evaluate computational predictions [3].  The annotations captured through this project will form an invaluable resource for the NLP community to use to hone their algorithms.  Further, while NLP technology advances in steps that can take decades to unfold, we can make immediate use of the products of this project to advance research on Chordoma.  Here, we describe the Chordoma research that could be enabled by this project (leaving discussion of the project’s impact on NLP research to the next section on automated processing routines).

Modern approaches to drug development often begin with the identification of specific genes that are ‘targets’ for the drug.  Once a particular gene has been identified, drugs can be designed that repress or enhance its activity and thereby treat the intended disease.  The identification of good gene targets is a critically important step in drug development because the process of creating and testing a particular drug is incredibly costly in terms of both time and money.  In fact it has been estimated that, in general, it takes more than a decade and costs more than a billion dollars to bring a single drug to market - with many drugs failing at the final stages of the process [4].

In Chordoma, mutations in a gene called ‘Brachyury’ are present in more than 90% of afflicted patients [5].  This information makes it one center of the search for drug targets.  Several studies have shown that if Brachyury is repressed in Chordoma cell lines (reproducing cell populations derived from Chordoma tumors), the cells’ pathological characteristics of malignant tumors, such as their capacity to proliferate, are significantly decreased [6].  Brachyury represents one promising drug target for Chordoma therapy (and for other cancers [7,8]), but we are still far from a cure.  No drugs have been approved by the United States Food and Drug Administration for the treatment of Chordoma and while Brachyury is clearly an important component it does not act alone.  Genes work together in complex relationships, often referred to as ‘biological pathways’, to produce both healthy and diseased phenotypes. Other genes “turn on” the expression of Brachyury which in turn activates or represses the expression of other genes downstream.  Many different members of this cascade could prove to be effective drug targets.  It is also important to keep in mind that these genes have normal, important functions that may make them unsuitable for drug targeting.  Understanding this network of interacting genes and the biological processes that they carry out is thus a crucial step in the rational selection of candidate genes.  

A thorough map of the genes and biological processes related to Chordoma would be a powerful tool for research and is a challenge well-suited to a large community.  While some of the required information is present in databases such as that provided by the Gene Ontology consortium [9], which catalogues the function of genes, most remains represented in the text of scientific articles.  By tagging the occurrences of the crucial concepts (genes, diseases, chemicals, and biological processes) in these articles, we can build a network that links them together.  This network could then be used by scientists to guide their choice for the next experiments to execute in their search for cures.  

In addition to finding novel target genes for the development of new drugs, another important direction for Chordoma research is the search for existing drugs that might be effective on this disease.  The challenge here again is one of selecting which of tens of thousands of available drugs to test.  This process, called “drug repositioning”, could also be enhanced through the provision of an effective map of the biological knowledge network surrounding Chordoma.  Existing drugs that treat related diseases (such as other forms of cancer) or that target proteins in biological processes known to be important to Chordoma progression (such as angiogenesis) form potential candidates for repositioning.  Once again, the quality and breadth of the network of knowledge related to Chordoma will have a direct impact on the success of identifying such drugs.  

3.      Deleger L, Li Q, Lingren T, Kaiser M, Molnar K, Stoutenborough L, Kouril M, Marsolo K, Solti I: Building gold standard corpora for medical natural language processing tasks. AMIA Annu Symp Proc 2012, 2012:144-153.
4.      DiMasi JA, Grabowski HG: The cost of biopharmaceutical R&D: is biotech different?. Managerial and Decision Economics 2007, 28(4):469-479.
5.      Pillay N, Plagnol V, Tarpey PS, Lobo SB, Presneau N, Szuhai K, Halai D, Berisha F, Cannon SR, Mead S et al: A common single-nucleotide variant in T is strongly associated with chordoma. Nat Genet 2012, 44(11):1185-1187.
6.      Presneau N, Shalaby A, Ye H, Pillay N, Halai D, Idowu B, Tirabosco R, Whitwell D, Jacques TS, Kindblom LG et al: Role of the transcription factor T (brachyury) in the pathogenesis of sporadic chordoma: a genetic and functional-based study. The Journal of pathology 2011, 223(3):327-335.
7.      Imajyo I, Sugiura T, Kobayashi Y, Shimoda M, Ishii K, Akimoto N, Yoshihama N, Kobayashi I, Mori Y: T-box transcription factor Brachyury expression is correlated with epithelial-mesenchymal transition and lymph node metastasis in oral squamous cell carcinoma. International journal of oncology 2012, 41(6):1985-1995.
8.      Roselli M, Fernando RI, Guadagni F, Spila A, Alessandroni J, Palmirotta R, Costarelli L, Litzinger M, Hamilton D, Huang B et al: Brachyury, a driver of the epithelial-mesenchymal transition, is overexpressed in human lung tumors: an opportunity for novel interventions against lung cancer. Clin Cancer Res 2012, 18(14):3868-3879.
9.      Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT et al: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000, 25(1):25-29.

What automatic processing routines exist which attempt to solve the problem being addressed? Why can't they be used instead of humans? *
In order to avoid wasting the time of volunteers, we only support projects that require human classification. Please include references where possible

Many computational approaches for identifying concepts in text exist, but none of them provides accuracy that is comparable to manual annotation on the problems being addressed in this project.  The performance of concept recognition algorithms varies substantially based on the types of concepts sought.  Performance is typically measured based on Precision (true positives / (false positives + true positives)), Recall (true positives / (true positives + false negatives) and summarized as the ‘F measure’ (the harmonic mean of Precision and Recall).  Specifically we are interested in identifying occurrences of diseases, genes, chemicals of interest, and biological processes.  A recent study identified the best performing of three modern tools for concept recognition across a variety of concepts [10].  They found the best performing tool and parameter combination for recognizing genes (proteins) produced an F score 0.57, for chemical entities an F score of 0.56 and for biological processes an F score of 0.42.  An advanced system specifically optimized for disease recognition recently reported an F measure of 0.81 [11].  For every case, humans can significantly outperform existing methods.  And as described previously, the breadth and accuracy of the network strongly influence how useful they are to research scientists.

10.    Funk C, Baumgartner W, Jr., Garcia B, Roeder C, Bada M, Cohen KB, Hunter LE, Verspoor K: Large-scale biomedical concept recognition: an evaluation of current automatic annotators and their parameters. BMC Bioinformatics 2014, 15:59.
11.    Leaman R, Islamaj Dogan R, Lu Z: DNorm: disease name normalization with pairwise learning to rank. Bioinformatics 2013, 29(22):2909-2917.

If possible, estimate the minimum number of times a task must be performed on a given element of data to be useful for science (assuming all tasks are performed by competent citizen scientists; once might be enough for exceptionally clear tasks, more times could be required for fuzzier tasks or lots may be necessary if accurate estimates of uncertainties are needed). How many total tasks must be completed before your research goals are achievable?
This is difficult but any estimate helps.

Based on preliminary data we estimate the minimum number of times an individual task must be performed to produce useful results at 5, though additional iterations would improve quality.  These estimates are based studies that we recently conducted using Amazon’s Mechanical Turk crowdsourcing system.  Our results indicate that non-specialist, minimally paid workers in this marketplace can successfully identify occurrences of diseases in PubMed abstracts.  (We have not yet tested other entity types.)  Using a simple aggregation strategy based on unweighted voting, we found that these workers could reproduce a gold standard disease mention corpus [12] with an F measure of 0.86.  We found that increasing the number of workers per document continuously increased the quality of the output but that quality increased only minimally after 15 workers per document.  Using just 5 workers per document we achieved an F score of 0.82 on the same corpus.

It may be possible that the Zooniverse infrastructure would reduce the number of completions per task required for high quality.  We anticipate that more sophisticated aggregation algorithms that take into account information about individual worker quality could improve performance and that a more refined user interface and instruction set could also boost scores.  Further, we expect to attract more dedicated, high quality contributors from the citizen science community than the Mechanical Turk platform.  It is also worth noting here that, despite the financial incentives that drive the Mechanical Turk system many of the workers expressed a strong attachment to the project that was clearly highly motivational.  In fact some workers were asking if they could continue to complete these tasks outside of the Mechanical Turk context simply because they wanted to contribute to our efforts.

While there is no fixed threshold for the number of documents above which we could claim a complete reconstruction of the network of knowledge surrounding Chordoma, we estimate that 10,000 would provide an effective start, 50,000 would provide good coverage, and 100,000 would cover the domain in reasonable depth.  With more than 23 million articles already indexed in PubMed and 20,000 new articles arriving every week, there is an effectively unlimited range of potential work that could be performed based on this concept recognition and normalization model.  

12.    Dogan RI, Leaman R, Lu Z: NCBI disease corpus: a resource for disease name recognition and concept normalization. J Biomed Inform 2014, 47:1-10.

Who will make use of the results? Is any further funding necessary?

Researchers from the natural language processing community would use this data to train and to test their computational methods.  Bioinformatics scientists would also use this data to refine computational methods for identifying candidate drug targets and suggesting opportune existing drugs for repositioning.  The Chordoma research community, along with biomedical researchers in related domains, would make use of the concept network identified through this work via interactive software.  Given the data, generic network visualization tools such as Cytoscape [13] could be immediately applied.  Ideally Web-based applications specifically devoted to browsing and querying this network would also be delivered to this community.  Additional funding would be useful in delivering such focused tools, but given the data, research groups such as ours would likely be able to use other funding sources to produce the required end-user applications.  We also note that we run a reasonably well-funded bioinformatics lab, so we can devote our own time and effort toward the success of this collaboration based on funding for related projects.

13.    Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T: Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 2003, 13(11):2498-2504.

All data from Zooniverse projects must be eventually made public. What final format (catalogue, annotated image, query tool) would be needed? What are the anticipated final outcomes (e.g., papers, catalogues)? Are the results likely to be of interest to researchers beyond your own field?

The raw and refined results from this project would be delivered as bulk data exports in a format suitable for use by computational scientists (NLP and bioinformatics) and through a tool (or tools) that allowed biomedical researchers to interact with the extracted concept network without the need for programming skills.  Aside from these, we would expect to publish research articles about the process of composing this knowledge network in collaboration with citizen scientists.

We anticipate that the results of this project would be of broad interest to all communities that must process large amounts of unstructured text.  Essentially identical processes might be applied to tasks in widely varying fields including both other sciences and the humanities.  One additional aspect to consider regarding this project is that every document processed is already annotated with its date of publication.  As a result, it would be possible to develop views that exposed the evolution of the concept network over time.  This historical perspective might prove to be of interest to a variety of communities - especially those interested in epistemology and the history of science.  

Are there potential extensions to the project that you have in mind?

We envision extensions to this project in terms of
1) expanding the number of different kinds of concepts identified
2) expanding to annotate different document sets targeted at different diseases
3) adding the ability for volunteers to specify relationships between the entities that they identify
4) providing volunteers with increasingly powerful computational tools for pre-processing the texts and for verifying the final annotations

The primary goal of each of our projects is to enable research, but they have significant educational impact as well. Engaging the community is an excellent way of ensuring they remain committed to producing results for you. Are there members of your team willing to write blog posts, join forum discussions on scientific topics or otherwise take part in outreach? Does the project tie in with any public engagement or education activities you are already involved with? *
Some form of continuous engagement is prerequisite for a successful project

As we hope is evident on our group’s blog, http://sulab.org/blog/, and our twitter streams (@bgood , @andrewsu ) we are avidly working on scientific outreach on a daily basis.  In fact, Ginger Tsueng, one of our project team members, has recently been hired explicitly to manage community outreach for our research group.  Ginger and all other members of our team would plan to actively engage with the community by all means at our disposal.  This project follows directly in line with several ongoing community intelligence efforts run by our group including the Gene Wiki [14], BioGPS [15], and http://genegames.org.  Our preliminary work on the annotation problem has been operated under the moniker Mark2Cure at http://mark2cure.org.

14.    Good BM, Clarke EL, de Alfaro L, Su AI: The Gene Wiki in 2011: community intelligence applied to human gene annotation. Nucleic Acids Res 2012, 40(Database issue):D1255-1261.
15.    Wu C, Orozco C, Boyer J, Leglise M, Goodale J, Batalov S, Hodge CL, Haase J, Janes J, Huss JW, 3rd et al: BioGPS: an extensible and customizable portal for querying and organizing gene annotation resources. Genome Biol 2009, 10(11):R130.

Monday, April 21, 2014

The Cure at Salk Cancer Day Symposium

Karthik G. and I will be presenting a poster tomorrow at the Salk Institute's Cancer Day Symposium.  We will be presenting data from a year with the scientific discovery game The Cure.  You can read more about those results on the arXiv.

If you are coming, please stop by for a chat!  We would especially love the chance to discuss the new, collaborative decision tree-building interface that Karthik has created.  Who knows if the conference wifi will work, so please try it now!