Saturday, October 8, 2011

Stepping towards a Semantic Wikipedia

(Update, check out our publication in Database for a full-length, peer-reviewed version of this article.)
It is now possible to specify the nature of the relationships between things described by Wikipedia articles directly in the context of the article.  The image below is a screenshot taken a few moments ago of the Phospholamban article on Wikipedia (with excited arrows added).  The infobox at the top right is dynamically generated from semantic markup in the article using a Wikipedia user script written by my colleague Sal and accessible from his user page

Semantic markup now live in Wikipedia

How it works
Wikilinks in the article have been annotated with the kind of relationship that they indicate using the Semantic Wiki Link (SWL) template. The template allows any Wikipedia editor (including you!) to specify the type of connection that exists between the article where the link is being placed and the target of the link. This information is encoded following the microformat pattern. Essentially, we encode the meaning of the links in class attributes that wrap the link.
This works as follows:
  1. Editor inserts a SWL into a Wikipedia article with this syntax:
    • {{SWL | target=protein kinase S | label=PKA | type =substrate_for}}
    • This means "the concept where you see this link is related to protein kinase A (labeled PKA) with the relationship type "substrate for". So, in the example above, it says: "Phospholamban is a substrate for PKA".
    • The {{}} denotes a Wikipedia template. Templates can take parameters (here parameters are separated by |'s) and use them to produce new WikiText dynamically which, in turn, is rendered as HTML when a page is loaded.
  2. When the page is rendered, the template generates the following semi-semantic HTML markup (with some formatting omitted for clarity):

  3. Programs, like the script that generated that infobox and added the green highlighting, can look for the SWL class attribute can then extract the meaning of the SWL links based on the class of its first child element - here "substrate_for".
  4. In addition, when the template is processed it adds a category to the article it is placed on that corresponds to the relationship type. (See for example, the category for substrate.) This category provides a logical grouping (e.g. all things that serve as a biochemical substrate) but, perhaps more importantly, it provides a place to record the meaning of the relationship. This meaning can be defined as text, but can also be defined through reference to external sources such as ontologies on the semantic web.


Why its awesome
This pattern makes it possible for the vast number of Wikipedia users to simply and easily contribute machine readable content to the Web. This enormous user community collaboratively created the world's largest encyclopedia and one of the most valuable websites on the planet. Who better to help build the semantic Web? While, technically, the microformat-like implementation leaves much to be desired in terms of its robustness and its precision, it is a solution that can work. This is demonstrated by the success of projects like Google's recent recipe search that are based entirely on simple microformats.
How you can help
This is a new idea that not everyone in Wikipedia will be thrilled to see. They will claim that the SWLs will clutter the markup and will not provide enough value to make it worth it because Wikipedia itself does not support semantic links. You can help by:
  • Using the template to enhance articles.
  • Writing code that makes use of the added meaning such as user scripts, aggregators, or scripts that import the relationships into other structured repositories like FreeBase or DBpedia.
  • Helping define the nature of the semantic links (at their associated category pages) and mapping them to properties defined in ontologies.
  • Discussing (and voting for) the idea on the various 'talk pages' on Wikipedia.

Why its awesome again
Did I mention that this pattern makes it possible for the vast number of Wikipedia users to simply and easily contribute machine readable content to the Web? Thats pretty cool if you think about it...

Update: To make the user script work for you so you can see the infobox, do this:

  • Create a Wikipedia user account if you don't have one already
  • Go to/create your user page. (e.g. my user name there is i9606 and my user page is located at http://en.wikipedia.org/wiki/User:I9606 )
  • Edit your user page add this to it - 
[[/common.js]]
  • Visit your new common.js add this to it - 
importScript('User:Sal9000/SWLinfobox.js');
  • When that is saved, you should be all set. Now go visit an enhanced page like Phospholamban and look for the green box at the upper right corner.
The script will run whenever you access a Wikipedia page while you are logged in to your account. Its a lot like GreaseMonkey, (which I've had some fun with in the past) but its not tied to your browser and will only work on Wikipedia. If enough people like a user script, it can be added to the default set of Wikipedia user preferences.. which would be pretty cool ;). The best part? You (or a programmer friend of yours) can write your own script and make it do whatever you want with the data!

12 comments:

Chris Evelo said...

And still...

Weren't computers meant to help humans? Not the other way around? Your example about a substance being a substrate for a kinase, could just as well be solved by using existing biochemical databases. This also already being structured into useful semantic resources (e.g. by OpenPhacts.org).

Don't get me wrong it is wonderful to have such information available from central resources like Wikipedia. But wouldn't it be better to first at the already available machine readable information in an automated way, and then ask users to just correct that if it might be wrong. Then machines would also help humans to make data available for machine reading and not the other way around.

Egon Willighagen said...

Ben, this looks brilliant!

Chris, two thoughts here:

1. human language is still our primary way of communication, and databases always come next. Thus, the semantics must really be part of the former, not the latter.

2. the second think is fact duplication. By having both the human-readable and the machine-readable in *one* document, you reduce the change the become contradicting significantly.

Benjamin Good said...

Hi Chris,

Just to clarify a little bit. This is not an attempt to replace curated databases. It is an attempt to make it possible (and easy) for people that would never be able to contribute to a typical database to share their knowledge effectively. The many efforts to mine facts from text illustrate the point that there is a lot of important knowledge that is not yet in structured form and this is just one more way to try to get to it.

If natural language processing was a perfected discipline, this kind of thing really wouldn't be needed. We could simply let people write and then let our programs interpret the text. As a user of NLP tools, I can tell you that we aren't there yet - not even close. By empowering users with the ability to specify machine readable facts, we can do much better.

Again - not a replacement, just another useful tool in the ecosystem.

Benjamin Good said...

Egon, that is the best comment I've ever received on a blog post, thanks! I nominate you to review this when its submitted as a manuscript ;)

BTW - In case there is any confusion this wasn't my brilliant idea, it came from Andrew Su. I'm just the messenger.

Chris Evelo said...

Hi Ben, Egon,

Don't get me wrong, I think this is a very important effort. I like it that it enables anyone to structure knowledge and thus make it available for semantic approaches or for inclusion in existing databases. Sorry if my first comment sounded negative.

However, in some fields, especially in biology we really have a lot of high quality curated information in databases. And that is primarily because we do have expert curator teams, lots of them! Two years ago I was at a biocurator conference and there really were hundreds of people there that do curation of biological databases full time. Thanks to them we have databases like UniProt (well actually the SwissProt part), Reactome and IntAct to name just a few.

Now I think it is a waste of effort to have other people recreate that knowledge from scratch it would be much better if they could see the existing structured knowledge already and could improve that where applicable. Also I think that it would just be good to have access to that data at for instance Wikipedia, for instance because people might write applications that use it from there instead of from the existing databases. What I really advocate is thus to make a coordinated effort to see whether we can pre-fill the datastructures that Andrew made. I think that should be done coordinated in such a sense that we try a page, review how well it goes, how relevant the added information is for instance, and only after we are happy with that move on to the next. We might for instance decide not to add metabolizing enzymes to compounds that are mentioned on pages about chemical production or usage in paints or so.

The same argument (look at what there is already there first) also goes for filling databases with textmining results. Only use textmining as such if there is no expert curated information available. Otherwise I think you should just use the textmining results as suggestions to curators (expert or community) to update existing knowledge.

One other thing that is really needed is a feedback mechanism to feed the information that was added by Wikipedia users back to the expert curators that maintain databases. That really is something to think about.

We have an example where we tried something like this. In collaboration with the Reactome team we have create a bidirectional converter for Reactome pathways to WikiPathways. So users of WikiPathways cannot only use Reactome pathways but also change them. Reactome curators can look at the changes and decide to incorporate them in the expert curated versions (which are tagged at WikiPathways. So that really makes it a round trip.
The other side of this is that we are integrating WikiPathway pathways in Wikipedia. That makes them more useful, and Wikipedia editors can decide to edit those pathways (not yet any from Reactome, but that could happen), in that case they do the real editing on WikiPathways.

So two things really:
1) Make high quality curated data available for editors before you ask them to structure anything.
2) Think about a mechanism to feed community curated data back to expert curators maintaining databases.

[sorry for the length]

Benjamin Good said...

Hi Chris. So I totally agree with your two points there on the bottom and we are working on them. For 1) see the infobox on all gene pages in Wikipedia. It contains up to date info drawn directly from the major databases. For 2) we will be processing and delivering information drawn from these new annotations in a form that anyone, inluding database curators can use. Our experience so far is that most databases aren't terribly interested, but we hope that will change with time. We hope to present this idea to that community at the next Biocuration meeting - (April 2012 in Washington DC).

I think the real cool part of this idea is that isn't really under our control. It's simply an enabling tech/standard that I hope will be useful across many domains. If you want to influence how it's used in biology, please get in and help! We'd love a userscript that could dynamically suggest SWLs and check for redundancy...

Andrew said...

Quoting Chris Evelo:
One other thing that is really needed is a feedback mechanism to feed the information that was added by Wikipedia users back to the expert curators that maintain databases. That really is something to think about.


Spot on! We absolutely want to go this route. Finding a partner on the side of the "official" biocuration community that agrees has been the hard part... ;) We'll hopefully be at the Biocuration meeting in April to evangelize in this regard.

Your first point about whether to represent knowledge from curated databases in Wikipedia is also well taken, and we've been debating this internally as well. The arguments to do that are as you state -- the community doesn't have to repeat what's already been done, and everything is in one place.

However the argument not to add curated content essentially revolve around the realization that Wikipedia will never be the ideal platform to deal with structured data. Even if we got that data in to Wikipedia, doing interesting queries and analyses would be cumbersome and limited at best. Therefore, we're really focusing on getting structured knowledge out. If we can properly represent semantic relationships, then they can be exported to a proper database (e.g., OpenPhacts). And it's over there that the majority of computational scientists will want to interact with the data.

Gentlemen, thanks for the thoughts and discussion! (more welcome...)

Egon Willighagen said...

I would not suggest replacing existing databases with wiki pages. In fact, it should not. However, we have a number of really good, curated databases. But we have tons of small knowledge bases where no critical mass is present. Just look at the recent NAR issue and count how many of them get the same amount curation as the usual suspects get.

Curation is expensive, so we must make this as cheap as possible. Wikis are excellent for that, in my opinion. Small, niche knowledge bases do not typically have a clear structure yet, making a RDBMS less suitable. A wiki allows a much more explorative study, and the SMW provides machine readability at the same time. RDFa in the output and use of templates (think DBPedia) serve the same role.

Samuel and I have been pondering about this for a year and a half or a bit more now, and been using a mix. Samuel has created RDFIO which is a SMW extension, where we prepopulate the wiki with knowledge from structured data with RDFIO, and doing annotation, etc, with the wiki parts. This way you get the best of both worlds.

Check his presentation at SMWCon a week ago or so:

http://saml.rilspace.org/my-smwcon-fall-2011-talk

The use of templates is very useful here indeed, and the use of (user)scripts very good at showing the advantage.

Ben, have you considered semantifying the doi template, creating a rdfs:seeAlso to the CrossRef RDF?

Benjamin Good said...

Hi Egon,

I wonder if you could elaborate a little on your suggestion regarding the DOI template and CrossRef?

Benjamin Good said...

Chris, - regarding your comment "What I really advocate is thus to make a coordinated effort to see whether we can pre-fill the datastructures that Andrew made". We'd be happy to work on this and would love help if you can provide it ;).

Andy Mabbett said...

Interesting ideas.

You're absolutely right that Wikipedia should emit machine-readable metadata where practical.

I lead the microformat project on Wikipedia, and found opposition to in-line templates such as those you're using.

You need to tread slowly and carefully, to ensure community support.

Happy to chat, sometime.

Benjamin Good said...

In case anyone is still following the comment chain here. There is an ongoing discussion at the Wikipedia village pump about this proposal. Feel free to chime in there!