Friday, June 16, 2017

Building communities of knowledge with Wikidata

As the Wikimedia Movement works to define its strategy for the next fifteen years, it is worthwhile to consider how its recent product Wikidata may fit into that strategy.  As its homepage states,

Wikidata is a free and open knowledge base that can be read and edited by both humans and machines.”

Wikidata is a particular kind of database designed to capture statements about items in the world with references that support those statements.  Because Wikidata is a database, its contents are meant to be viewed in the context of software that retrieve the data through queries and then renders the data to meet the needs of a user in a certain context.  The same data can thus be viewed on Wikidata-specific pages such as and in the infoboxes of Wikipedia articles such as  Importantly, Wikidata content can also be used in applications outside of the Wikimedia family such as   

Examples of Wikidata use now include:

The molecular biology community (and in particular the Gene Wiki group) has embraced Wikidata as a global platform for knowledge integration and distribution.  To help envision how Wikidata may fit into the strategic vision of the WMF movement, it is worth taking a look at how and why this particular community is using Wikidata.  

History of the Gene Wiki initiative
The sequencing of the human genome at the beginning of this century and the consequent rush of data and new technology for producing even more data fundamentally changed how research in biology is conducted.  Before the year 2000, research typically proceeded with a single gene focus.  A typical PhD thesis would entail the analysis of the genetics or function of one gene or protein at a time.  A few years after the first genome however, it became possible to measure the activity of ten’s of thousands of genes at once resulting in an omnipresent problem of generating interpretations of experimental results containing hundreds of genes.  While a scientist may come to grasp the literature surrounding a single gene quite well, it is not possible to know everything there is to know about all 20,000+ genes in the genome - particularly when this knowledge is expanding on a minute by minute basis.  As a consequence, there arose a need to produce summaries of what was known about each gene so that researchers could quickly grasp its nature and easily find links to more detailed references as needed.  By 2008, many different research groups published wikis attempting to allow the scientific community to generate the required articles, e.g. WikiProteins, WikiGenes, and the Gene Wiki.  The Gene Wiki project was unique among this group as it anchored itself directly to Wikipedia and, likely as a result of that decision, has enjoyed long term success.  This initiative works within the English Wikipedia community to encourage and support the collection of articles about human genes.  Its main contributions are the infobox seen on the right hand side of of these articles and software for generating new article stubs using that template.  

Wikidata and the Gene Wiki project

For the past several years, the Gene Wiki core team (funded by an NIH grant) has focused primarily on seeding Wikidata with biomedical knowledge.  In comparison to managing data via direct inclusion and parsing of infobox templates as before, this makes the data much easier to maintain automatically and, importantly, opens it up for use by other applications.  As a result, Wikipedia isn’t the only application that can use this structured information.   One of the first products of that process was a new module (Infobox_gene) that draws all the needed data to render the gene infobox dynamically from Wikidata, greatly reducing the technical challenge of keeping the data presented there in sync with primary sources.  

In addition to the relatively simple collection of gene identifiers and links off to key public databases that are presented in the infoboxes, Wikidata now has an extensive and growing network of knowledge linking genes to proteins, proteins to drugs, drugs to diseases, diseases to pathogens, pathogens to places, places to events, events to people, and so on and so on.  This unique, open, referenced, knowledge graph may eventually become the closest thing to ‘the sum of all human knowledge’.  Capturing knowledge in this structured form makes it possible to use it in all kinds of applications, each with their own community-specific user experiences.  As a case in point, the Gene Wiki group created Wikigenomes based primarily on data loaded into Wikidata.  This was followed quickly by Chlambase, an application specifically focused on distributing and collecting knowledge about different Chlamydia genomes.  These applications provide domain-specific user interface components such as genome browsers that are needed to present the relevant information effectively and thereby attract the attention of specialist users.  These users, in turn, have the opportunity to contribute their knowledge back to the broader community through contributions to Wikidata that can be mediated by the same software.  

Wikidata and the world
The molecular biology research community, as represented by the Gene Wiki project, are early adopters of Wikidata as a community platform for the collaborative curation and distribution of structured knowledge, but they are not alone.  The same fundamental patterns are already being applied by other communities, e.g. those interested in digital preservation and open bibliography.  In each case, we see communities working to transition from the current dominant paradigm of private knowledge management towards the knowledge commons approach made possible by wikidata.  This is not unlike the transition from the world of the Encyclopedia Britannica to the world of Wikipedia.  The only important difference is that the knowledge in question is structured in a way that makes it easier to reuse in different ways and in different applications.  

Wikidata provides a mechanism for massively increasing the global good generated by the Wikimedia Foundation’s work by capturing knowledge in a form that can be agilely used to empower all manner of software with the sum of human knowledge.