Sunday, March 25, 2018

1 Cyclotron Road

I've recently started working as a contractor for the BBOP group at Lawerence Berkeley National Labs.  To my son's great disappointment (and probably my father's as well) I am not working with the cyclotron, but instead am now working with the Gene Ontology (GO) project.  I am involved in a transition that might be put most simply as a migration from 'GO as a gene tagging system' to 'GO as an activity flow modeling system'.  For a preview of what the new approach looks like see a MAP kinase cascade example on the nascent GO Causal Activity Modeling environment (AKA Noctua).

Though it is a departure from the last 10+ years of working on crowdsourcing in bioinformatics, its also a much needed re-centering of focus around the core reasons why I got involved in crowdsourcing in the first place.  I have this crazy notion that knowledge, once ripped at great cost from nature's powerful grasp, should be cherished, shared, and used in the production of more knowledge.  All of the crowdsourcing work was originally motivated by the goal of finding new, potentially better, ways to make that process happen.  I'm still interested in finding ways to include (much) larger audiences in the process of curating and growing our collective knowledge base.  But, for now, I am happy and proud to have the opportunity to simply dig in with my own hands to help take the GO to the next level.

Perhaps someday (like maybe when my youngest starts kindergarten..) I will have a chance to cross the semantic and social web streams once again.

Monday, June 19, 2017

Leaving Scripps

After seven years in the Su Lab with six spent at Scripps Research, I've decided to move on to something new. There are obviously many factors that go into a decision like this, but I'll leave you with just the one that is probably the most important. I want to support my wife's desire to re-enter the workforce. I think that this is the right move for our family - at least strongly enough to run the experiment! According to one smart fellow from Princeton, this concept of fathers taking on the 'lead parent' role is a good idea. Here's hoping it works out for us!

Leaving Las Vegas (2134219795)

Friday, June 16, 2017

Science Game Lab: tool for the unification of biomedical games with a purpose

Scripps team: Benjamin M. Good, Ginger Tsueng, Andrew I Su
Playmatics Team: Sarah Santini, Margaret Wallace, Nicholas Fortugno, John Szeder, Patrick Mooney, 
With helpful ideas from: Jerome Waldispuhl, Melanie Stegman

Games with a purpose and other kinds of citizen science initiatives demonstrate great potential for advancing biomedical science and improving STEM education.  Articles documenting the success of projects such as and Eyewire in high impact journals have raised wide interest in new applications of the distributed human intelligence that these systems have tapped into.  However, the path from a good idea to a successful citizen science game remains highly challenging.  Apart from the scientific difficulties of identifying suitable problems and appropriate human-powered solutions, the games still need to be created, need to be fun, and need to reach a large audience that remain engaged for the long-term.  Here, we describe Science Game Lab (SGL) (, a platform for bootstrapping the production, facilitating the publication, and boosting both the fun and the value of the user experience for scientific games with a purpose.  

Ever since the project famously demonstrated that teams of human game players could often outperform supercomputers at the challenging problem of 3d protein structure prediction, so-called ‘games with a purpose’ have seen increasing attention from the biomedical research community.  A few other games in this genre include: Phylo for multiple sequence alignment, EteRNA for RNA structure design, Eyewire for mapping neural connectivity, The Cure for breast cancer prognosis prediction, Dizeez for gene annotation, and MalariaSpot for image analysis.  Apart from tapping into human intelligence at scale, these efforts have also produced valuable educational opportunities.  Many of these games are now used to introduce their underlying concepts in classroom settings where games in all forms are increasingly working their way into curriculums.  Concomitant with the rise of these ‘serious games’, citizen science efforts such as the Zooniverse and Mark2Cure have sought similar aims but have packaged their work as volunteer tasks, analogous to unpaid crowdsourcing tasks, rather than as elements of games.  

Many of these initiatives have succeeded in independently addressing challenging technical problems through human computation, improving science education, and generally raising scientific awareness.  However, with so much interest from the scientific community and a booming ecosystem of game developers, there are actually relatively few of these games in operation now.  Recognizing the opportunity, various groups have attempted to push the area forward through new funding opportunities and through various ‘game jams’ such as the one that produced the game ‘genes in space’ for use in analyzing microarray data in cancer.  Here, we take a different approach towards expanding the ecosystem of games with a scientific purpose.  Rather than attempting to seed the genesis of specific new game-changing games, we hope to lower the barrier to entry for new games and related citizen science tasks to generally promote the development of the entire field.  With this high-level aim in mind, we developed Science Game Lab (SGL) to make it easier for developers to create successful scientific games or game-like learning and volunteer experiences.  Specifically, SGL is intended to address the challenges of recruiting players and volunteers, keeping them engaged for the long term, and reducing the development costs associated with creating a scientific gaming experience.

The Science Game Lab Web application
SGL is a unique, open-source portal supporting the integration of games and volunteer experiences meant to advance science and science education (  Unlike other related sites that act more like volunteer management and/or project directory services, such as SciStarter and Science Game Center, SGL is not simply a listing of related websites.  Rather, it is an attempt to create a user experience that takes place directly within the SGL context yet still incorporates content from third parties.  The system is largely inspired by game industry portals such as Kongregate that enable developers to incorporate their games directly into a unified metagame experience .

Players can use the portal to find and play games with their achievements within the games tracked on site-wide high score lists and achievement boards (Figure 1).  Players can earn the SGL points that drive these leaderboards for actions taken in different games.  In this way, SGL provides developers with access to a metagame that can be used to encourage players in addition to the incentives offered within individual games (Figure 2).  This metagame can also be used by the system administrators to help direct the player community’s attention to particular games or particular tasks within games.  For example, actions taken on new games might earn more points than actions taken on more established games as a way to ‘spread the wealth’ generated by successful games.    

Figure 1.  SGL home page demonstrating site-wide high score list, game listing, and links to achievements, help, and user profile information.
Figure 2.  Badges displayed on user’s profile page.  Available badges not yet achieved are greyed out.
 Developers interact with SGL by incorporating a small javascript library into their application and using the SGL ‘developer dashboard’ to pair up events in their game with points, badges and quests managed by the SGL server.  At this time, SGL only supports games that operate online as Web applications.  The games are hosted by the developers and rendered in the SGL context within an iframe.  The SGL iframe provides a ‘heads up display’ that provides real time feedback to game players with respect to events sent back to the SGL server such as earning points, gathering badges, or progressing through the stages of a quest (Figure 3).  This display provides developers with the ability to add game mechanics to sites that are not overtly games.  For example, Wikipathways incorporated a pathway editing tutorial into SGL, using the heads up display to reward users with SGL points and badges for completing various stages of the tutorial.   The tutorial also took advantage of the SGL quest-building tool (Figure 4).  Games are submitted by developers for approval by SGL administrators.  Once approved, the games appear in the public view and can be accessed by any player.  

Figure 3.  The heads up display provided by the SGL iframe.  Shows events captured by the API and provides users with immediate feedback.   

Figure 4.  Tasks in SGL can be grouped into quests.  The figure shows a particular user’s progress through various quests available within the system.

If a critical initial mass of effective games can be integrated, SGL could strongly benefit new developers by providing immediate access to a large player population.  Site-level status, identity and community features can help with the even greater challenge of long-term player engagement, a noted problem in the field.  Within the context of science-related gaming, such status icons might eventually be used as practically useful, real-world marks of achievement inline with the notion of ‘Open Badges.  As demonstrated by the Wikipathways tutorial application, SGL can be used to replace the need for developers to host their own login systems, user tracking databases, and reward systems - all of which can be accomplished using the SGL developer tools. Citizen scientists are not homogenous in their motivations. Designing to be inclusive of gamers and non-gamers can be challenging. By offering an alternative means of experiencing a web-based citizen science application, SGL allows developers to cater to both their gaming and non-gaming contributor audience. Together, these features unite to raise the overall potential for growth within the world of citizen science and scientific gaming.  

Future directions
SGL is currently functional, but so far has attracted only a small number of developers willing to integrate their content into the portal.  Future work would need to address the challenge of raising the perceived value of integration with the site while lowering the perceived difficulty.  Looking forward, key challenges for the future of SGL include better support for:
  • games meant for mobile devices
  • development of quests that span multiple games
  • teachers to build SGL-focused lesson plans and track student progress
  • creating new ‘SGL-native’ games
  • integration with external authentication systems

None of these are insurmountable challenges, but they all require significant continued investment in software development.  As an open source project, we encourage contributions from anyone that shares in our vision of spreading and doing science through the grand unifying principle of fun.

Building communities of knowledge with Wikidata

As the Wikimedia Movement works to define its strategy for the next fifteen years, it is worthwhile to consider how its recent product Wikidata may fit into that strategy.  As its homepage states,

Wikidata is a free and open knowledge base that can be read and edited by both humans and machines.”

Wikidata is a particular kind of database designed to capture statements about items in the world with references that support those statements.  Because Wikidata is a database, its contents are meant to be viewed in the context of software that retrieve the data through queries and then renders the data to meet the needs of a user in a certain context.  The same data can thus be viewed on Wikidata-specific pages such as and in the infoboxes of Wikipedia articles such as  Importantly, Wikidata content can also be used in applications outside of the Wikimedia family such as   

Examples of Wikidata use now include:

The molecular biology community (and in particular the Gene Wiki group) has embraced Wikidata as a global platform for knowledge integration and distribution.  To help envision how Wikidata may fit into the strategic vision of the WMF movement, it is worth taking a look at how and why this particular community is using Wikidata.  

History of the Gene Wiki initiative
The sequencing of the human genome at the beginning of this century and the consequent rush of data and new technology for producing even more data fundamentally changed how research in biology is conducted.  Before the year 2000, research typically proceeded with a single gene focus.  A typical PhD thesis would entail the analysis of the genetics or function of one gene or protein at a time.  A few years after the first genome however, it became possible to measure the activity of ten’s of thousands of genes at once resulting in an omnipresent problem of generating interpretations of experimental results containing hundreds of genes.  While a scientist may come to grasp the literature surrounding a single gene quite well, it is not possible to know everything there is to know about all 20,000+ genes in the genome - particularly when this knowledge is expanding on a minute by minute basis.  As a consequence, there arose a need to produce summaries of what was known about each gene so that researchers could quickly grasp its nature and easily find links to more detailed references as needed.  By 2008, many different research groups published wikis attempting to allow the scientific community to generate the required articles, e.g. WikiProteins, WikiGenes, and the Gene Wiki.  The Gene Wiki project was unique among this group as it anchored itself directly to Wikipedia and, likely as a result of that decision, has enjoyed long term success.  This initiative works within the English Wikipedia community to encourage and support the collection of articles about human genes.  Its main contributions are the infobox seen on the right hand side of of these articles and software for generating new article stubs using that template.  

Wikidata and the Gene Wiki project

For the past several years, the Gene Wiki core team (funded by an NIH grant) has focused primarily on seeding Wikidata with biomedical knowledge.  In comparison to managing data via direct inclusion and parsing of infobox templates as before, this makes the data much easier to maintain automatically and, importantly, opens it up for use by other applications.  As a result, Wikipedia isn’t the only application that can use this structured information.   One of the first products of that process was a new module (Infobox_gene) that draws all the needed data to render the gene infobox dynamically from Wikidata, greatly reducing the technical challenge of keeping the data presented there in sync with primary sources.  

In addition to the relatively simple collection of gene identifiers and links off to key public databases that are presented in the infoboxes, Wikidata now has an extensive and growing network of knowledge linking genes to proteins, proteins to drugs, drugs to diseases, diseases to pathogens, pathogens to places, places to events, events to people, and so on and so on.  This unique, open, referenced, knowledge graph may eventually become the closest thing to ‘the sum of all human knowledge’.  Capturing knowledge in this structured form makes it possible to use it in all kinds of applications, each with their own community-specific user experiences.  As a case in point, the Gene Wiki group created Wikigenomes based primarily on data loaded into Wikidata.  This was followed quickly by Chlambase, an application specifically focused on distributing and collecting knowledge about different Chlamydia genomes.  These applications provide domain-specific user interface components such as genome browsers that are needed to present the relevant information effectively and thereby attract the attention of specialist users.  These users, in turn, have the opportunity to contribute their knowledge back to the broader community through contributions to Wikidata that can be mediated by the same software.  

Wikidata and the world
The molecular biology research community, as represented by the Gene Wiki project, are early adopters of Wikidata as a community platform for the collaborative curation and distribution of structured knowledge, but they are not alone.  The same fundamental patterns are already being applied by other communities, e.g. those interested in digital preservation and open bibliography.  In each case, we see communities working to transition from the current dominant paradigm of private knowledge management towards the knowledge commons approach made possible by wikidata.  This is not unlike the transition from the world of the Encyclopedia Britannica to the world of Wikipedia.  The only important difference is that the knowledge in question is structured in a way that makes it easier to reuse in different ways and in different applications.  

Wikidata provides a mechanism for massively increasing the global good generated by the Wikimedia Foundation’s work by capturing knowledge in a form that can be agilely used to empower all manner of software with the sum of human knowledge.  

Friday, March 17, 2017

Happy St. Patrick's Day

Tuesday, January 3, 2017

Cognitive computing and the National Library of Medicine

"IBM Watson for Drug Discovery helps researchers and organizations discover potential new drug targets and additional drug indications. IBM’s cloud-based, enterprise solution analyzes scientific knowledge and data to reveal patterns and connections that accelerate the formation of new hypotheses, increasing the likelihood and pace of scientific breakthroughs."

It bothers me that there is no true open source, open access version of this kind of system.  Should it?  Or should we accept that it cost a lot of money to put together software like this and that there is nothing wrong with making a profit on building good software?

The issue to me is that the root content of the product being sold is knowledge and knowledge is more useful (for producing more of itself) when more people have access to it.  It is impossible to imagine the impact that PubMed/MEDLINE has had on the advance of biomedical science.  Researchers simply could not do their work without it.  As our collective knowledge base expands, tools for using that knowledge will inevitably need to look more and more like Watson and less like like digital paper libraries.

Will we ever see the U.S. National Library of Medicine or its equivalent in other countries move into the age of cognitive computing?  Is it solely up to industry to fill the increasingly obvious gap?  I guess it depends where we want to place that power.

Friday, October 23, 2015

Introducing Knowledge.Bio

I just prepared the following poster abstract for the upcoming Big Data 2 Knowledge all-hands meeting at NIH.  Please play with the tool it describes and let us know what you think (it is a work in progress!).  Also, if you have a chance, please stop by the poster and say hello!

Knowledge.Bio: an Interactive Tool for Literature-based Discovery 
Personal knowledge graph showing literature-derived connections
 between Sepiapterin Reductase (SPR) and 5-Hydroxytryptophan
(a treatment for patients with deleterious mutations in SPR.
Benjamin M. Good, Ph.D.1; Richard M. Bruskiewich, Ph.D. 2; Kenneth C. Huellas-Bruskiewicz2; Farzin Ahmed2; Andrew I. Su, Ph.D.1
1 The Scripps Research Institute, La Jolla, CA, USA. 2 STAR Informatics / Delphinai Corporation, Port Moody, BC, Canada

PubMed now indexes roughly 25 million articles and is growing by more than a million per year.  The scale of this “Big Knowledge” repository renders traditional, article-based modes of user interaction unsatisfactory, demanding new interfaces for integrating and summarizing widely distributed knowledge.  Natural language processing (NLP) techniques coupled with rich user interfaces can help meet this demand, providing end-users with enhanced views into public knowledge, stimulating their ability to form new hypotheses.

Knowledge.Bio provides a Web interface for exploring the results from text-mining PubMed.  It works with subject, predicate, object assertions (triples) extracted from individual abstracts and with predicted statistical associations between pairs of concepts.  While agnostic to the NLP technology employed, the current implementation is loaded with triples from the SemRep-generated SemmedDB database and putative gene-disease pairs obtained using Leiden University Medical Center’s ‘Implicitome’ technology.  

Users of Knowledge.Bio begin by identifying a concept of interest using text search.  Once a concept is identified, associated triples and concept-pairs are displayed in tables.  These tables have text-based and semantic filters to help refine the list of triples to relations of interest.  The user then selects relations for insertion into a personal knowledge graph implemented using cytoscape.js.  The graph is used as a note-taking or ‘mind-mapping’ structure that can be saved offline and then later reloaded into the application.  Clicking on edges within a graph or on the ‘evidence’ element of a triple displays the abstracts where that relation was detected, thus allowing the user to judge the veracity of the statement and to read the underlying articles.

Knowledge.Bio is a free, open-source application that can provide, deep, personal, concise, shareable views into the “Big Knowledge” scattered across the biomedical literature.  It is available at, with source code at

Wednesday, October 21, 2015

Poof it works - using wikidata to build Wikipedia articles about genes

Infobox for ARF6,
rendered entirely from
content Wikidata
The Gene Wiki team has been hard at work filling wikidata with useful content about genes, diseases, and drugs using the new and improved ProteinBoxBot.  Now, we are starting to see the fruits of this labor in the context of Wikipedia.

The Gene Wiki project has programmatically created and maintained the infoboxes to the right of all the articles in Wikipedia about human genes since about 2008 [Huss 2008].  This process has entailed the construction of a unique template containing all of the relevant data for each gene.  For example, here is the code for the template for the ARF6 gene.  As Wikipedia previously had no database, that is where the data was stored.  Altering that content programmatically involves parsing that template as a string.  Its ugly (sorry Jon) and there are more than 11,000 of these templates to maintain (one per gene in Wikipedia).

Now, the same data can be represented in Wikidata, a queriable, open graph of claims about the world backed by references and specified by qualifiers [Vrandečić 2014].  Now that the content needed to render the infobox is all there, we can convert 11,000+ complex templates that require string parsing to maintain to a single, re-usable template for all of them.

The first cut at the new template is {{infobox gene}}.  If you put that on any article about a human gene, you ought to get the complete infobox for the article without any further ado.  Poof!  You can view it in action on this revision for ARF6.  We haven't rolled out the new template across all the articles yet, but hope to see that happen in the coming months.  Remaining issues include: better error-handling in the template code, better ways to give users the ability to edit the associated data in wikidata, and updates to all of the code that produces gene wiki articles.  If you want to help, chime in on the module:wikidata thread.

Tuesday, July 7, 2015

Recruiting NLP-crowdsourcing-semantic-web postdoc or staff scientist

Our laboratory at the The Scripps Research Institute in beautiful San Diego, California is recruiting a talented individual to help us use crowdsourcing to push the boundaries of biomedical information extraction and its applications.   We are looking for someone with experience in natural language processing (statistical or linguistic), machine learning, and knowledge representation.  This person would work to integrate efforts across several related projects.  
Ongoing and nascent projects include:
Sound like fun? Ready to jump in?
Contact Andrew Su and or Benjamin Good for more information.

p.s. We have other openings in related areas!

Wednesday, March 4, 2015

crowdsourcing machine learning NLP challenge

There is a TopCoder contest running right now that involves machine learning, crowdsourced data, and natural language processing.  There is $41,000 up for grabs!  It will be distributed in many smaller prizes so there are plenty of opportunities to win something.

You need to register by tomorrow (March 4, 2015) to participate!  

More details here: