Wednesday, October 17, 2007

Where is the API?

Yes.. I am procrastinating. I should be sleeping, working on the OntoLoki automatic ontology evaluation system, or preparing for our meeting with the SWAN team tomorrow morning; but instead, I am perusing Project Prospect and thinking about what a journal should look like. This is largely because of my disappointment in reading this nascent blog post in which Ian Mulvaney (a person who I think I respect and leader of a project I obviously find fascinating) suggests that enforcing the application of naming standards for chemical entities at the time of publication would a) be too hard for authors, b) not provide much benefit, c) that it would be better to let this be a voluntary step - all of which I absolutely disagree with.

This, and comments on the post, lead me to Project Prospect - which seems to be the first real publisher to take the idea of semantic enhancements of online manuscripts seriously.

Project Prospect provides semantic annotation (e.g. labeling GO terms etc. in manuscripts) and uses this to provide some enhanced navigation patterns and some additional information (e.g. definitions) for any of the annotations. Doing a pretty nice job at this was apparently enough to win them the 2007 ALPSP/Charlesworth Award for Publishing Innovation. While this is certainly a nice addition and a step in what I think is the right direction, it is 1) overwhelmingly similar to the much older and much more flexible, Conceptual Open Hypermedia ServicE (COHSE) from the University of Manchester and 2) does not seem to provide any capacity for semantic integration of the manuscripts in the collection.

Is this really the best we can do?

What I would like to see is a journal with an API. An API that would let me ask it questions like "what genes are present in articles published in this journal that contained both go:0005576, or any of its children , the word 'vaccine', and are described in the article as being upregulated". Right now, we can approach this sort of question with text-mining, but, with extensions to work like that done to enable the hypermedia browsing defined above (which fundamentally depends on solid entity identification and annotation within the document), this question (which spans multiple manuscripts) could be answered with a relatively straightforward query.

Its time for journals to step up an stop wasting talented researchers time writing text mining algorithms. Lets build a journal with a proper API, one with standards compliant methods for both writing content to it and querying the content inside it programmatically. Such a journal would not only improve human navigation and understanding of its independent textual documents, but would also enable entirely new modes of interaction with the integrated knowledge spanning all of its semantic content.

6 comments:

Anonymous said...

GoPubMed (http://www.gopubmed.org/) does something like that, with text mining and some string matching between GO & MeSH terms and the Pubmed articles' contents. But it's an add-on instead of integrated with the whole publishing scheme of things. Alternatively, and if publisher do not want to take the first step, maybe someone could write an application that authors can run on their desktop (or secure online) that checks the manuscript's contents against x bio-ontologies and subsequently let the author choose which of the matches are ok, which to discard, and, possibly, which ones are "missing" from any of those x ontologies. That, at least, would relieve the burden of fully manual annotations, requires only a one-off programming technology where adding a check for another ontology is just more of the same (instead of reinventing the wheel), and has some built-in flexibility where the result of the semi-automatic annotation can feed back into maintaining and improving the ontologies.
Maybe that publishers are willing to invest in updating their camera-ready submission procedures only when they see that this kind of thing works?

Ian Mulvany said...

Hi Ben,

I'd like to take issue with your interpretation of my comments, and perhaps add a little context to my points as well. I don't hold an editorial position at Nature so I am not going to make comments about what we might or might not do with the journal in terms of submission process as I'm just not in a position to implement such decisions, and I don't want my ideas to be considered by whomever might read them as possibly indicating NPG policy.

What I personally think should happen is that there should be an abstract semantic journal-independent xml schema for articles and that authors should only ever have to write this one format and publishers only accept this one format. I worked for 5 years for another publisher Springer, who have a stable of about 1500 academic journals. Even within this one publisher there were no cross-company standards when it came to journal schema.

Journals get traded, bought and exchanged between publishing houses. With each change there is a considerable cost associated with getting the journal work flow integrated with whatever system is being used in the given publishing house. Having a single format would simplify this process immensely. This one format would of course be style independent. At the moment there are two defacto standards, Word and TeX/LaTeX. The majority of the work of publishing houses is in conversion of these two formats to in-house xml, and then later conversion of that xml onwards to print, html, back to Word, and so on. What it would take for the adoption of one schema would be industry wide acceptance, but when you are dealing with thousands of journals with archives stretching back in many cases over 150 years then I expect that such an adoption is not likely, not matter how great it would be, and it would be great. If I were in a position I would move to force some such adoption. Authors in resubmitting papers to other journals would not have to re-write, data and semantic information would be available, after the initial transition cost it would be win win. Sadly I'm not in such a position.

It was within this context that I thought that industry-wide adoption of any given standard, not just in the context of chemistry, might take a while. It's not that any given change is hard, but that there is an intrinsic inertia that stems understandably from the existence of present workflows. I do think some publishers are interested in changing things, but they are doing so at their own pace and with their own priorities. Given this situation I was trying to say that waiting for publishers to get a clue is not going to be the solution.

I just can't imagine a number of publishers sitting in a room trying to figure out which standards to adopt, and I don't think it would necessarily be a good thing. I think the conversation has to come from scientists and societies.

Tell us what your needs are and we will see where we can go. That's the reason having lunch with people like Egon and having these kinds of conversations with you are so important. It turns out in regards to the InChi situation Nature is already working on something and Tony Hammond came up to me the day after I posted on Nascent to way the finger and tell me off for not knowing about that already. I don't know when it will be finished but we'll be sure to announce when it is.

Another option is to get on the editorial boards of journals and push for change there. These journals may be owned by publisher but in a very real sense they are your journals and your submissions to them is what keeps them alive.

Society publications are a very lucrative growth area for academic publishers, and for journals belonging to large societies a large publisher is going to be willing to go a long way to please the society in contract negotiations. There is one pressure point where a society could say that they need standard x or y semantically available for the use of their community.

So that's kind of my take on the hardness of the problem of getting adoption.

The much benefit point I was trying to make was just that an NPG editorial decision and adoption would not provide full coverage of the literature, which it wouldn't. I completely see the benefit of having semantic markup in papers. I should think that it is self-evedint so I am not going to elaborate on the point here. I was possibly be overly caution in my Nascent post as I am simply not in a position to make those calls.

As for being better to leave the issue to be a voluntary step, well I think the points I make in this comment cover more fully my opinion on this. I have a high hope for automated methods of creating such semantic schemas and even if everyone switched to submitting perfectly annotated papers automated methods will be required to convert existing literature. Of course people won't voluntarily switch en mass as people are rightly lazy.

I think your idea of a journal with an API is brilliant. We like API's in this department and if it's OK with you I'm going to see if we can take the idea and give it any legs. One of the main themes of my boss Timo Hanny has been the confluence of databases and journals. A journal could be considered as a poorly structured database of text, but increasingly papers exist only to point to real databases or data sets or pieces of code. That these all cannot be richly interwoven is criminal and that is definitely the thinking in web production. It would be only natural to provide an API to such a rich data structure.

It also sounds like something that should form part of a proposal for DataNet. Perhaps we should start a discussion on the Nature Network group?.

Benjamin Good said...

Thanks for the clarification Ian. The "..but that is an editorial decision.." clearly had a lot more meaning than I gathered when I read your post. It sure seems like it would be a little frustrating to be a blog author in an evironment where you didn't feel that you were completely free to say exactly what you thought. Guess thats a minus for the business world.

Certainly, feel free to give the idea legs and run with it. Not that I could stop you if you wanted to anyway ;). It would be nice if you kept me posted on any progress though..

I've joined the DataNet group and will now invite Marijke to do the same. (I like your idea Marijke, its similar to something I've done in the past with community ontology construction and I think it would probably be a great way to bootstrap the process).

I'll be looking for work at some point next year - somehow being a part of a DataNet type project might be fantastic.

Pedro Beltrao said...

Regarding the cross-publisher standards I would just like to add one thought. It is not required that all publishers accept the standards to start with. If a group of well known publishers settles on standards that benefit those journals that share the standards then others would ask to join in. I think one additional problem is that even withing each publisher the technological development, the editorial roles and the economical management have all separate mindsets.
Right now I think PLoS, BMC and Nature could really group together and completely detach themselves from any other competitor in science publishing by collaborating on some of these standards.

Anonymous said...

In my naivité, wouldn't some "semantic annotation standards" (at least at the what-level, not the techno-howto) be orthogonal to the myriad of layout standards of the different publishers that Ian Mulvany is referring to? Analogous to LAV data integration, there could then be the shared model and each publisher has his own mappings to the implementation peculiarities. Then publishers do not have to make a whole overhaul in their current non-aligned procedures, just an add-on and take it from there.

Anonymous said...

Dear all,

my name is Michael R. Alvers, I’m CEO of Transinsight providing www.GoPubMed.org. We found this blog rather late but I think it is still worth to comment. I absolutely like the “Journal with an API” – idea! It is needed in order to shift the post-writing text mining analysis to a “match with the background-knowledge (e.g. an ontology)” - as you write.
When it comes to proteins this is of great importance since the average number of synonyms for a single protein is 5.5. That’s why we developed Transwiki. The system (not online) has an auto-completion mode while writing (e.g. blogging). If for example starting with “card” the system looks up the ontology (GO & MeSH) and provides as first choice the MeSH term “cardiovascular diseases” in a popup (right below the end of the
concept in the text) the next 4 most likely concepts are shown. Also the parent concept(s) are displayed (popup right above the end of the concept).
This system fosters two things: first it unifies writing on base of a controlled vocabulary (GO & MeSH & others if available) and second it speeds up writing (in complex areas) significantly due to the auto-completion (hit return if it is the right concept to insert it into the text at cursor position). We talked about this to ELSEVIER. Let's see how we proceed.