Wednesday, March 10, 2010

bio2rdf 2 biogps

Last Friday I had the pleasure of having lunch with with Andrew Su of the Genomics Institute of the Novartis Foundation.  Among other things, he introduced me to one of his projects called BioGPS.  BioGPS  is an interesting, kind of minimalist approach to gene-based data integration.  Essentially, it allows you to register gene-related 'plugins' that other users can assemble like an iGoogle home page.  Each plugin amounts to an html-producing url that contains  one of a variety of gene ids as a parameter.  So you might have a plugin for ncbi gene, another for kegg, etc. and they are all displayed together using a very smooth, interactive iframe canvas.

This is clearly useful to many people (they get about 150,000 pageviews/month) but its flexibility is limited by the way it currently accesses information - simply by gathering HTML from existing web pages.  Since many pages have overlapping content there is inevitably (screen)wasteful duplication in the aggregate view.  As others have said before, a little bit of semantic web could go a long way to improving this resource - and, because of the way the system is built and the way SPARQL endpoints work, its very easy to do it.

So, the idea is that you could take a sparql endpoint (that yields html as an option), write a query with a gene as a parameter, capture the url that contains the query and then you have a very specific kind of plugin that only shows precisely what information you want.  By assembling a collection of these you could produce a view on the gene information space that was very precisely tailored to individual needs.

I made a simple example of this pattern with the plugin "OMIM disorders where gene is linked to pathogenesis" which you can see in their plugin library.

It hits this endpoint

with this query
 PREFIX omim: PREFIX rdfs: select distinct ?OMIM_disorder where { ?s omim:PATHOGENESIS ?o . ?o bif:contains "VEGF" . ?s rdf:type omim:GeneticDisorder . ?s rdfs:label ?OMIM_disorder }

where the VEGF would be replaced by the gene that you were researching.

Here is an example BioGPS view composed of four plugins.  My bio2rdf-sparql example is there on the top right.

To really do this properly, I think you would want to build a little helper application that would help users  compose the queries and would allow for some basic formatting options for presenting the results of these SPARQLing BioGPS plugins.


Egon Willighagen said...

I guess what you want is autocomplete for the given SPARQL end point (or a few, or complemented with a few common namespaces) and the Manchester syntax... ?

Benjamin Good said...

I think some autocompleting may be useful, but I think the interface would have to hide most if not all of the fact that you were composing a SPARQL query. BioGPS is really making an attempt to allow biologists (who might have trouble configuring their email client.. but are still the ones that will be winning the nobel prizes) to create the plugins. So, under the hood the sparqling plugin-building widget would be composing the query, but on the surface they would be operating in a language that they understood.

Thanks for the comment Egon. The BioGPS system really reminded me of the greasemonkey stuff that you and I were working on a few years back...