Friday, September 12, 2008

freebase ED and sparql

So what do you do when the two papers you would like to finish and submit are sitting in the hands of co-authors?  Kayaking? Sleeping?  Surfing? today, no.  Hacking? today, yes.

While I wait, I decided to finally start working on bridges between freebase and our semantic tagging repository for ED for use after the data is collected. To get started, I wrote the code to answer this question: "what URIs have been tagged with the organism classification X or any of the sub classifications of X".  For example, has anyone tagged anything with magnoliopsida or any of its lower classifications, such as arabidopsis?

To do this, I needed to utilize the 'Higher classifications' (or of course 'Lower classifications') property of the Organism Classification Type.  Unfortunately, there is, thus far, no such thing as a generic transitive property in freebase as far as I can tell, so I built a brute force, recursive query that implements it myself.  I send the following with the '???' replaced with my starting point (e.g. 'magnoliopsida') to freebase as the rest of the URL starting with .
"higher_classification" : "???",
"name" : null,
"guid" : null,
"type" : "/biology/organism_classification"

Freebase responds with the lower classifications of my query and then I repeat the process with these until either a maximum depth is reached or it bottoms out. If you know a better way to do this please let me know.

Once I have all of the guids for all of the lower classifications of my query, I send these over to get URIs tagged with any of them via a SPARQL query like this:
prefix tag: 
prefix rdfs:
select ?tagging ?tag
where {
?tagging tag:associatedTag ?tag .
?tag rdfs:isDefinedBy
?tagging tag:associatedTag ?tag .
?tag rdfs:isDefinedBy
tagging tag:associatedTag ?tag .
?tag rdfs:isDefinedBy

The query has as many UNIONs as topics to check for. (Note that you have to put URIs in SPARQL queries inside angle brackets - blogger was making this difficult for me to include). It works well enough, but if there are too many, I hit the max URL size limit (HTTP 414) so I set it up to send them in chunks and then reassemble the results.

Hacky? Yes. Successful for demo purposes? so far..

Any ideas about optimizing such activities most appreciated.

On the todo list:
  1. Assemble the must-tag list of web services for the upcoming biomoby/ED jamboree
  2. Build up an API-like library of queries like the above and normal queries like 'get all the URIs tagged by user X' so that we can more easily put up reasonable human interfaces for users of ED2.0. (Thanks to those that have already started using it!).  Note that any developers out there already have access to all of the data needed to build ED applications via HTTP calls to freebase and to our repository.  The library I speak of will be used by us and probably made public, but the real idea is for external developers to utilize SPARQL/MQL directly as that provides the most flexibility.
  3. Create mappings between bio-ontology classes and freebase topics.  Likely follow Shawn Simister's model for approaching this integration.  (He has some excellent ideas about SPARQL/MQL integration).
  4. Prepare for kayaking trip tomorrow
  5. Graduate before they cut off my funding...


bgood said...

Made it generic. Method now takes a seed topic like 'cellular_component', a transitive property associated with that kind of topic like 'broader_group', and an official type of the topic like '/biology/gene_ontology_group' and returns a list of topic names and GUIDs. Yay, I've got my very own transitivity 'reasoner' for freebase. I wonder how long I can play with it until they start throttling me...