Friday, October 23, 2015

Introducing Knowledge.Bio

I just prepared the following poster abstract for the upcoming Big Data 2 Knowledge all-hands meeting at NIH.  Please play with the tool it describes and let us know what you think (it is a work in progress!).  Also, if you have a chance, please stop by the poster and say hello!

Knowledge.Bio: an Interactive Tool for Literature-based Discovery 
Personal knowledge graph showing literature-derived connections
 between Sepiapterin Reductase (SPR) and 5-Hydroxytryptophan
(a treatment for patients with deleterious mutations in SPR.
Benjamin M. Good, Ph.D.1; Richard M. Bruskiewich, Ph.D. 2; Kenneth C. Huellas-Bruskiewicz2; Farzin Ahmed2; Andrew I. Su, Ph.D.1
1 The Scripps Research Institute, La Jolla, CA, USA. 2 STAR Informatics / Delphinai Corporation, Port Moody, BC, Canada

PubMed now indexes roughly 25 million articles and is growing by more than a million per year.  The scale of this “Big Knowledge” repository renders traditional, article-based modes of user interaction unsatisfactory, demanding new interfaces for integrating and summarizing widely distributed knowledge.  Natural language processing (NLP) techniques coupled with rich user interfaces can help meet this demand, providing end-users with enhanced views into public knowledge, stimulating their ability to form new hypotheses.

Knowledge.Bio provides a Web interface for exploring the results from text-mining PubMed.  It works with subject, predicate, object assertions (triples) extracted from individual abstracts and with predicted statistical associations between pairs of concepts.  While agnostic to the NLP technology employed, the current implementation is loaded with triples from the SemRep-generated SemmedDB database and putative gene-disease pairs obtained using Leiden University Medical Center’s ‘Implicitome’ technology.  

Users of Knowledge.Bio begin by identifying a concept of interest using text search.  Once a concept is identified, associated triples and concept-pairs are displayed in tables.  These tables have text-based and semantic filters to help refine the list of triples to relations of interest.  The user then selects relations for insertion into a personal knowledge graph implemented using cytoscape.js.  The graph is used as a note-taking or ‘mind-mapping’ structure that can be saved offline and then later reloaded into the application.  Clicking on edges within a graph or on the ‘evidence’ element of a triple displays the abstracts where that relation was detected, thus allowing the user to judge the veracity of the statement and to read the underlying articles.

Knowledge.Bio is a free, open-source application that can provide, deep, personal, concise, shareable views into the “Big Knowledge” scattered across the biomedical literature.  It is available at, with source code at

Wednesday, October 21, 2015

Poof it works - using wikidata to build Wikipedia articles about genes

Infobox for ARF6,
rendered entirely from
content Wikidata
The Gene Wiki team has been hard at work filling wikidata with useful content about genes, diseases, and drugs using the new and improved ProteinBoxBot.  Now, we are starting to see the fruits of this labor in the context of Wikipedia.

The Gene Wiki project has programmatically created and maintained the infoboxes to the right of all the articles in Wikipedia about human genes since about 2008 [Huss 2008].  This process has entailed the construction of a unique template containing all of the relevant data for each gene.  For example, here is the code for the template for the ARF6 gene.  As Wikipedia previously had no database, that is where the data was stored.  Altering that content programmatically involves parsing that template as a string.  Its ugly (sorry Jon) and there are more than 11,000 of these templates to maintain (one per gene in Wikipedia).

Now, the same data can be represented in Wikidata, a queriable, open graph of claims about the world backed by references and specified by qualifiers [Vrandečić 2014].  Now that the content needed to render the infobox is all there, we can convert 11,000+ complex templates that require string parsing to maintain to a single, re-usable template for all of them.

The first cut at the new template is {{infobox gene}}.  If you put that on any article about a human gene, you ought to get the complete infobox for the article without any further ado.  Poof!  You can view it in action on this revision for ARF6.  We haven't rolled out the new template across all the articles yet, but hope to see that happen in the coming months.  Remaining issues include: better error-handling in the template code, better ways to give users the ability to edit the associated data in wikidata, and updates to all of the code that produces gene wiki articles.  If you want to help, chime in on the module:wikidata thread.

Tuesday, July 7, 2015

Recruiting NLP-crowdsourcing-semantic-web postdoc or staff scientist

Our laboratory at the The Scripps Research Institute in beautiful San Diego, California is recruiting a talented individual to help us use crowdsourcing to push the boundaries of biomedical information extraction and its applications.   We are looking for someone with experience in natural language processing (statistical or linguistic), machine learning, and knowledge representation.  This person would work to integrate efforts across several related projects.  
Ongoing and nascent projects include:
Sound like fun? Ready to jump in?
Contact Andrew Su and or Benjamin Good for more information.

p.s. We have other openings in related areas!

Wednesday, March 4, 2015

crowdsourcing machine learning NLP challenge

There is a TopCoder contest running right now that involves machine learning, crowdsourced data, and natural language processing.  There is $41,000 up for grabs!  It will be distributed in many smaller prizes so there are plenty of opportunities to win something.

You need to register by tomorrow (March 4, 2015) to participate!  

More details here:

Tuesday, February 24, 2015

Return of OntoLoki - on a Freaky Friday

One of my favorite traditions initiated in Mark Wilkinson's laboratory (where I did my PhD) was "Freaky Friday".  Following the spirit of Google's "20% time", lab members were encouraged to take out some time on Friday to pursue their own crazy ideas - leaving their core projects aside for a few hours.  This was fun and ended up producing some useful things like "Tag clouds for summarizing web search results" (one of my most cited articles and only 2 pages long!) and "The Entity Describer".  In this spirit, I took out some time last Freaky Friday to do something very far away from the things I should really be doing.  Sadly I didn't hack something fun together.  Instead, per Dr. Wilkinson's 50th request for me to do this, I performed a Frankensteinian resurrection.  Digging deep into the grave of my PhD dissertation, I found the one unpublished chapter, cleaned it up, and deposited it into the arXiv.  The project, called OntoLoki, is now undead!  Check out this undead manuscript about automatic quality evaluation for ontologies.

I am OntoLoki!  Feed me your ontologies for evaluation!

OntoLoki: an automatic, instance-based method for the evaluation of biological ontologies on the Semantic Web

Benjamin M. Good
Gavin Ha
Chi K. Ho
Mark D. Wilkinson

Friday, February 20, 2015

Building a Garden of Biological Knowledge

March 13, 2013 I wrote up an idea in my notebook that I called 'Pubmed Daily'.  The concept was to build a system that would leverage large-scale crowdsourcing/citizen science and machine learning to produce a high-quality, structured representation of the knowledge in every abstract in PubMed on the same day that the abstract appeared online.  Nearly two years later, based mainly on the labor of outreach coordinator Ginger Tsueng, group leader Andrew Su, and programmer Max Nanis, the idea is just starting to bear fruit (albeit small, perhaps grape size fruits..) .  As the bits and pieces start to come together, I thought it would be worthwhile to share the high-level vision as it exists now.

A Garden of Biological Knowledge

We want to build an information management system (or systems) that supports the work of three key groups of people: bioinformaticians such as Andrew, biologists such as Hudson Freeze, and patient advocates such as the parents and friends of Bertrand Might, a child with a rare genetic disorder related the NGLY1 gene.  The over-arching goal is to produce more rapid biomedical advances based on more effective use of existing knowledge in the processes of of hypothesis generation and high-throughput data analysis.

The thinking goes like this.  Given a high-quality, structured knowledge base such as the Gene Ontology (GO), Andrew and people like him can make many different kinds of discovery and analytical tools that can help scientists such as Hudson work more effectively (and they do, there are thousands of tools that use the GO).  The problem is that the generation of knowledge bases like the GO is a long, slow, expensive process that in no way keeps pace with the advance of knowledge as represented in the literature.  Information extraction systems like DeepDive and SemRep can theoretically go a long way to addressing this problem.  However, humans remain more effective (though obviously dramatically slower) readers.

Can we use computers to seed a garden of knowledge that can be tended and grown by citizen scientists?

Given a compelling argument, clear instructions, and an effective user interface, we think that large numbers of people from the general public could be assembled to work on improving the results of a biomedical information extraction system.  We, and other groups, have been experimenting with various related tasks using the Amazon Mechanical Turk and are now confident that "the crowd" can, in aggregate, do text-processing work at or above expert level.  A recent conversation with a leader of the Zooniverse project, a collection of online citizen science projects with more than a million people contributing, leads us to believe that it would be possible to attract tens to hundreds of thousands of people to participate in an effort like this.  Recently, we took the first steps towards testing that assumption via a short but successful test run of the Mark2Cure Web application.

Can we build a new generation of tools for working with structured biomedical knowledge at massive scales and use these to empower the rising community of citizen scientists?

Aside from the knowledge bases themselves, we are also interested in building better tools for navigating this information and in putting them into the hands of both professionals like Hudson, and the many very intelligent, highly motivated people from other domains that might also be able to find something important given the chance (e.g. Mathew Might).

Can a large volunteer work force help teach computers to read?

By engaging volunteers at scale, we hope to provide the developers of information extraction algorithms with the data they need to raise their approaches to human levels of quality.

We have a long, challenging road ahead on this project but the path ahead is starting to take shape and the future looks bright!

Tuesday, November 11, 2014

Hackathon recap: Network of Biothings in San Diego

This past weekend, Nov. 7-9, the Neuroscience Information Framework, the Su Lab, NDex, The International Society for Biocuration and the San Diego Center for Systems Biology hosted the second Network of Biothings Hackathon.  The event took place at Atkinson Hall (Calit2) on the UC San Diego campus.  For the record, and to enhance what can already be seen in the Twitter story of this event, here is what went down.

Friday evening about 30 people arrived at Atkinson hall where they:
  1. consumed Indian food and American beverages, 
  2. met each other and talked over project ideas, 
  3. each decided on a project to work on for the weekend
Saturday the group:
  1. ate bagels for breakfast, sandwiches and leftover Indian food for lunch, and pizza for dinner
  2. hacked furiously to develop 8 distinct team projects
Sunday morning, we came together for more bagels, some last minute hacking and rapid presentation formation.  At 10:30am we got started with 8, 10 minute project presentations.  Here is a very brief list of the projects with links out to all of the code that was developed.  (More details are available on the ideas page.)
  1. BioPolymer: A set of embeddable web components to display and edit bio data. Initially data from MyGene & MyVariant. Team: Mark Fortner, Keiichiro Ono
  2. SameSame: a dynamic tool for finding and visualizing the degree of similarity between any set of biomedical concepts.  Team: Benjamin Good, Maulik Kamdar, Alan Higgins, Alex Williams
  3. CIViC - Clinical Interpretation of Variants in Cancer:  Crowdsourcing and web interface for curation of clinically actionable evidence in cancer.  Team: Obi Griffith, Adam Coffman, Martin Jones, Karthik G, Jon Cristensen, Julianne Schneider
  4. Citizen Science: An app to enable people to extract structured ‘facts’ (subject predicate object triples) from unstructured text.  Here is the project presentation.  Team: Richard Good, Hannes Niedner, Andrew Su
  5. SBiDer: Synthetic Biocircuit Developer: A web-app to search a database of operons [functional biochemical pathways] to use them in new and novel ways [to make synthetic organisms such as the ones used to make this Malaria treatment].  Team Justin Huang, Kwat Yeerna, Fernando Contreras, Joaquin Reyna, Jenhan Tao
  6. NDex: The NDEx project provides a public website where scientists and organizations can share, store, manipulate, and publish biological network knowledge. Team Dexter Pratt
  7. fiSSEA: A framework that integrates and to retrieve functional prediction annotations (or any type of annotation) for knowledge discovery, specifically implement CADD scores for “functional impact SNP Set Enrichment Analysis". Team: Adam Mark, Erick Scott, Chunlei Wu
  8. MyGene.Info Taxonomy Query: Added detailed taxonomy information to Allows queries based on taxonomy ID and advanced queries based on hierarchies of taxonomic nodes.  Team: Greg Stupp, Chunlei Wu
Hackathon Trophy

And the winners were...
#1 Citizen Science
#2 SBiDer

There were a lot of very exciting things about this event.  In addition to a strong core of academics from multiple universities, we also had local app and algorithm developers from industry.  While the US west coast lean was powerful, we did have representatives from St. Louis and one that came all the way from Canada.  We also had a very strong undergraduate team (taking second place!).  All of the projects clearly made real progress over the weekend with some excellent cross-pollination of code and ideas.
And the coolest thing about this event??
My Dad was on the winning team :).. Go Dad!
Special thanks to Martin Jones for running around with me picking up bagels, pizza, drinks, and snacks for everyone and to Jeffrey Grethe for keeping everything running smoothly at the event, pushing around tables, helping clean up after the hackers, and calmly handling everything that needed to be done.  

Tuesday, November 4, 2014

What is a hackathon?

As an organizer of the upcoming Network of Biothings Hackathon at UC San Diego, I've been asked by a number of people what a hackathon is exactly.  I'm repurposing one of those responses here (original posted on the San Diego iOS developers meetup group).

The main idea is that a variety of different people come together to meet each other and make something together - almost always something open source. In the case of our hackathon, the purpose is specifically to engender new collaborations. For the academics these could translate into new research programs and new collaborative grant proposals. For industry folks, these could turn into new products.  Ideally, a hackathon can bring together the elements of a great new team. For example, I'm a back-end database guy with an understanding of bioinformatics. I'd love to find a front-end web or app developer to help make my data and algorithms useful to the rest of the world.

Here are a few questions I've fielded:

  • why do developers pay to build apps for somebody else? 
The money goes to pay for food, drinks, facilities fees, and to a small extent the prize. Developers don't pay to develop for someone else, they pay to meet other people and to eat.. No one is under any obligation to give their code to anyone else. You would be welcome to come by and work on your own project. In fact, we are actively trying to get more project ideas posted for our hackathon. (Note that the fee we are asking for is only $40 and many hackathons with larger sponsors are free).
  • why do developers put their time to do work for free?
The main point here is team formation. If you have a great app, this is also a way to advertise it - especially if it wins a prize.
  • do the teams who paid to participate build apps and one is chosen as winner? 
Yes, this is the basic idea.
  • are the rest thrown away? 
Nothing is thrown away. The participants maintain ownership of all code that is written. (Though open source is very strongly encouraged...)
  • Am I too young/old to participate?
Nope!  All are welcome.

In conclusion, hackathons are fun, social events for people that like to build new things, meet new people, and perhaps the change the world.  Sign up for ours and find out for yourself!

Monday, October 6, 2014

Network of Biothings Hackathon at UC San Diego

Can you code?

Are you interested in the intersection of computer science and biology (bioinformatics) ?

Do you want to meet interesting people?

Are you excited about building new pieces of software that could change the face science and medicine? 

Do you want to win a cash prize for your open source code?

Then its clearly time to:
  • Location:  UC San Diego on the 5th floor of the CALIT2 building
  • Sign up:  sign up 
  • Schedule:
  • Friday, November 7
    • 6-10pm : Welcome social / project team formation
  • Saturday, November 8
    • 9am-? : Hacking !
  • Sunday, November, 9 
    • 9am-10:30: Final hacking / presentation preparation
    • 10:30:11:30 Pitches and Demos
    • 11:45: Prize announcements

Click here for additional details.

Wednesday, October 1, 2014

Conference proceedings are citable, stop double-dipping

Over the last several years there has been a trend among conferences such as the Bio-Ontologies Special Interest Group meeting at ISMB and the Semantic Web applications and tools for life sciences (SWAT4LS) to invite article submissions for people that want to present at the conference and then to subsequently invite presenters to expand their article in an "official" publication in an associated journal.  Since the PDFs of the articles submitted to the conference usually end up online, as they should, this results in a situation where first reviewers and later readers are often confronted with two versions of essentially the same article - typically with the same title, author list, and often the same abstract.  This causes problems for reviewers as this kind of overlap with prior work (even from the same authors) would typically be grounds for rejection - yet because of this bizarre arrangement with the conference, reviewers are supposed to treat the original article as if it were a pre-print of the first, despite the fact that it is a citable entity on the Web and never referred to as a preprint anywhere.

Conference organizers, please stop this madness.  Here are three models that would be better.

  1. Following the International Biocuration Conference model, invite submissions directly to the partner journal first and then choose presenters from the successful submissions and independently submitted abstracts.  No confusion.  One good, citable paper.  Probably higher quality conference submissions.
  2. Put the articles submitted to the conference in a pre-print server such as arXiv or BioRxiv and continue with the concept of an expanded article in a journal.
  3. Do what the computer science community does and recognize contributions to conference proceedings as citable articles and do away with the attempt to get an 'official' journal publication in addition to the conference citation.