Thursday, December 6, 2007

official reviews of E.D.

After more than three months, I've just received notification that the E.D. manuscript has been rejected for publication in the semantic mashup edition of JBI. I provide the reviews below and pose the question to you, the ether - what should I do now? Should I carry out some user-studies and resubmit? Should I make ammendments to the text as suggested by the second reviewer and resubmit? Should I send it to a different a journal? Should I give up on it and finish other pending projects?

?


Dear Mr. Good,

Experts in the field have now reviewed your paper, referenced above. Based on their comments and the number of submissions, we regret to inform you that we are unable to accept your manuscript for publication in the Special Issue "Semantic BioMed Mashup" of the Journal of Biomedical Informatics. One of the major concerns is that more work (e.g., a better use case) is needed for increasing the substance of the paper. You may consider revising the paper according the reviewers' comments and re-submit it to a regular JBI issue in the future.


We have attached the reviewers' comments below to help you to understand the basis for our decision. We hope that their thoughtful comments will help you in future submissions to the JBI and in your future studies.

Sincerely,
JBI Editorial Office

Reviewers' comments:

Reviewer #1: This paper could be a good workshop paper but it is not suitable for journal publication. The paper describes a prototype interface for a system that could potentially be useful, but provides no evaluation whatsoever. It doesn't even say if there are any users of the system. As the authors rightly note, there are many open questions, and even a simple user evaluation would have taken some steps in addressing those questions. Otherwise, it looks like an ad hoc exercise.

For instance, do the users use the suggested tags correctly, or is not being able to see the context for definition makes them select wrong terms? What happens if there are several tags from different vocabularies? Is the extra selection step too cumbersome and users won't bother? How is the agreement between users? How about the agreement with manually generated tags such as MeSH headings?

Without at least some evaluation, I don't think the paper can be a journal paper.

HOwever, the research goal is worthwhile and the approach interesting, so I would strongly encourage the authors to pursue it!

Additional comments: you compare the number of MeSH tags and Connotea annotations and suggest that the difference on the number of tags per item is somehow an indication of quality. I have a hard time understanding how the number of tags corresponds to teh *quality* of those tags. All this shows are the difference in scale.

I think the section motivating adding controlled vocabularies to social tagging systems (:Linking taggers and their tags...) is too one-sided. No potential drawbacks are discussed. What if users don't understand the tags from these controlled vocabularies and use them incorrectly? Would it be worse than not using them at all? Do users need to understand the vocabularies? Will non-rpofessional users know the vocabularies enough to use them without any special training? All this discussion must be present in the paper.

You say that you couldn't use the NCI Thesaurus and WordNet because they are too big for "Semantic Web technologies" This is not true. Many Semantic Web tools can process these easily; so it is a limitation of your technology.


Reviewer #2: This paper presents an application, the Entity Describer (ED), for
generating and storing controlled semantic annotations on biomedical
resources, as an extension of the Connotea social tagging system.
The authors briefly review semantic annotation (i.e., professional
indexing) in biomedicine and social tagging of Web resources, before
comparing the two. While professional annotation results in more
complete, standard and accurate sets of annotations, it is also not
sustainable due to its cost. The authors argue that the quality of
annotation through social tagging would improve if the taggers used
standard terminologies rather than homegrown tags. In order to explore
this hypothesis, they combined an existing social tagging system,
Connotea, with some controlled terminologies, including MeSH and GO. The
application is built -- using Semantic Web technologies -- as a mashup
of Connotea, terminologies and a database of annotations. The ED
modifies the Connotea interface to help users select terms from
controlled vocabularies and stores these annotations in a database,
while maintaining the usual features of Connotea. A prototype of this
application has been developed. The authors propose this annotation
model as an alternative to professional indexing and automatic
indexing. Future work includes making additional terminologies available
and applying ED to other tagging systems than Connotea.


This paper on enriching social tagging with controlled terminologies
through Semantic Web technologies is undoubtedly relevant to this
special issue. The paper is interesting and clearly written, easily
accessible to a readership that would not be familiar with the Semantic
Web. The references are appropriate.
This reviewer has essentially minor reservations about this manuscript
regarding the overall organization, statement of objectives, and the
discussion. These points could be addressed easily. The only major
reservation is the absence of proper discussion of the limitations of
this work.

Overall organization
The paper is composed of ten sections. Although logically flowing, this
succession of small sections might be distracting to the reader as it
fails to reflect the overarching organizational structure of the
paper. I would recommend grouping the first 3 sections under
Introduction/background and the next 4 under Materials and
Methods. Discussion and future work could be grouped.

Statement of objectives.
Again, it does not become clear until section 4 what the objectives of
this work are. I would recommend adding a short introduction to present
the issues of professional indexing and social tagging and stating that
the application presented proposes to reconcile them.

Discussion and future work.
In the future work section, rather than a litany of issues, it would be
useful to regroup the issues around terminology-related and
system-related issues. The issue of extension to other terminological
systems is presented in a somewhat naive manner. For example, it does
not look like the authors have fully appreciated the issues in making
the UMLS available through this system (e.g., size, lack of explicit
subclass relations, intellectual property restrictions, etc.)

Insufficient discussion of the limitations of this work.
The discussion is extremely short. The limitations of this work are not
clearly mentioned (lack of an evaluation [or even a metric for an
evaluation], scalability issues, etc.) Another limitation of this
approach is that you never make the point that professional indexing
relies not only on a controlled terminology, but also on a set of
indexing rules, used to further control the use of the indexing
terms. Finally, in your OWLization of MeSH, you briefly mention
converting broader/narrower links into subClassOf properties, without
raising any issues. What about Liver subClassOf Abdomen? I understand
this shortcut helps you meet the requirement that OWL be used in the
framework of this mashup. This is nonetheless highly inappropriate and
deserves being addressed in the discussion. The short paragraph about
using SKOS instead of OWL should be expanded and moved to the
discussion. The work of Guus Schreiber's group on representing MeSH in
SKOS should be acknowledged. http://thesauri.cs.vu.nl/eswc06/

Technical comment about MeSH.
Figure 7 shows hippocampus in MeSH identified by
"A08.186.211.577.405". Using tree numbers instead of the unique ID
(D006624) is bad practice. In this example, it so happens that
hippocampus occurs in only one hierarchy and has therefore only one tree
number. Most MeSH descriptors, however, have several tree numbers.
A side effect of this practice is that URI based on tree numbers would
result in multiple, non-reconcilable identifiers for the same MeSH
descriptor, leading to seemingly distinct annotations for the same
descriptor.


Minor comments
- Introduction: The sentence "Examples of semantic annotation
... UniProt [1-3]." would fit better between the first two sentences,
that at the end of the first paragraph.
- Introduction, 2nd paragraph: Arguably, the semantic annotation of
MEDLINE citations with MeSH describes *topics* more than it described
*entities*.
- p. 3: "The act of adding a resource to a social tagging collection" Do
you mean "The act of adding a tag to a resource"? The tagging *event*
is a process and cannot be composed of entities such as a tagger,
etc. please rephrase.
- p. 8: "formal training in classification" Do you mean "formal training
in *annotation* (or indexing)"? Classifying resources is a different
issue.
- p. 8: "main subject descriptors". The "official" MeSH terminology
refers to "main headings" (or "descriptors"), with the "major
descriptors" (marked by an asterisk) denoting the main topics in the
article. It is probably safer to stick to this terminology. In this
case, 12.7 must be the average number of descriptors, not major
descriptors.
- p. 9: Since your goal is to compare dispersion of the number of
descriptors in MEDLINE and Connotea, where the means are different,
the coefficient of variation should be used instead of (or in
addition to, for the purpose of the comparison) the raw standard
deviation values. For details, see:
http://en.wikipedia.org/wiki/Coefficient_of_variation
- p. 10: This section should introduce the notion of "controlled
vocabularies" or "controlled terminologies".
- p. 10: Arguably, what will decrease is not so much the quality of the
annotations as it is their *homogeneity*.
- p. 11: Please some background on GreaseMonkey.
- p. 12: It is unclear why the term "Hip" in MeSH (D006615)is not
retrieved as part of the list of terms suggested for the entry "hip".
- p. 15, first line: "if a non-annotation property and was used..."
Remove and.
- p. 15, later: "it would render the knowledge base OWL-Full". you
probably mean: "it would require OWL-Full for the representation of
the KB"
- p. 17, bullet 1: It is unclear what is the justification for
suggesting the addition of these particular terminologies.
- p. 18, bullet 3: tree-like interfaces would be extremely inconvenient
to render biomedical terminal terminologies with a high degree of
multiple inheritance.

7 comments:

Pedro Beltrão said...

I think the user evaluation would be great but hard to do. Do you think you can get enough users to do a test ? How much time and personal engagement on your part do you think it would take ?

Benjamin Good said...

Never having conducted a scientific usability study before I'm not really sure how many users I would need to make it robust so its difficult to say. I suspect it would take at least a couple weeks to do the background research and design the study. Might really be worthwhile in the long run, but this many years into my phd I'm getting a bit impatient I guess.

Its also a bit artificial as the model suggested in the paper (social tagging) is one in which people tag for personal reasons as part of their normal daily activities - which is a pattern that would be difficult to reproduce in a controlled setting. I think the most appropriate analysis would be done directly on the usage data that is being collected (thats been the plan all along). Trouble is, I don't have many people using it yet. I was hoping to generate interest with the publication so that I could produce enough traffic to do a proper analysis...

Ryan S said...

This sounds like fitting content for the Code4Lib journal, but you won't draw a lot of readers from the biology community...

Matthias Samwald said...

The request for "user evaluations" by reviewers of papers such as this one is quite common. However, I am doubtful how realistic it is -- I also don't see such evaluations in most published papers.
Maybe the reviewers (including ourselves, of course), should try to evaluate the practicability of such developments, instead of insisting on such empiric user evaluations that most authors simply cannot deliver.

Benjamin Good said...

I don't disagree with the comments indicating that the paper would be more interesting with some actual data about use included with it. Its really a bit of a chicken and an egg though. How do we get (real non-paid) users for a system like this without a good mechanism of advertising it like a publication? Also, and I think this is part of the point you were making, the request for evaluations of the software kind of misses the main point - which was to introduce the (I think at least somewhat novel) ideas in the paper to a community that might be interested in them. If I was allowed to resubmit it, I could have easily explored these more deeply in the text as the second reviewer suggested.

All in all, I these reviews were more helpful then many I've seen in the past and will certainly help to improve future iterations. The second reviewer in particular clearly spent a lot of time and provided some really useful comments. However, at the same time I'm quite disappointed I didn't have the opportunity to submit an amended version to that special edition as it would have been an ideal venue.

Hilary said...

Sorry to hear that the publication didn't work out!

You might have already moved on from this project, but a couple of thoughts...

Could you use a service like Mechanical Turk to generate user data? You could possibly set up three groups (those posting freeform tags, those using the ED, and those told to work with an online version of MeSH) asking users in each group to tag a predefined set of documents (perhaps with a minimum of 5 tags per document?).

You could then do some sort of graph theoretic analysis (rather than asking about user satisfaction). What is the connectivity of the different datasets? Average degree of a tagged paper? Etc. Of course you'd need to explain what these metrics mean, or why they're useful...

One downside is that you'd need to pay the users, but it might work well. Dolores Lab just did an experiment asking users to categorize colors with freeform words (using Mechanical Turk) and managed to accumulate a lot of data.

You could also perhaps compare these results with data generated through the use of an ontology mining program.

bgood said...

Hi Hilary,

Thanks for your suggestions! I have thought about a Mechanical Turk style experiment, but I'm concerned it really wouldn't be useful for my particular area of interest (in regards to my thesis work) unless I could limit the Turks to include only scientists and limit the papers they tagged to those they know something about. We have some rough plans to conduct a controlled, in person study with the new version of ED but I hesitate again because it really destroys the true context that such an application would be applied in and thus renders the results of questionable utility. I'd be really interested in hearing more about your ideas about a graph-theoretic comparison. I have an inkling that smells good, but please do elaborate with the meal if you don't mind ;).