Monday, April 7, 2008

tag pollution

As part of some research on Connotea, I've discovered some rather unfortunate behavior relating to Endnote and Connotea interaction.  It appears that the following series of events is a common occurrence with some negative consequences for the usefulness of Connotea (at least for me):

  1. A person uses the Endnote search/import mechanism to bring a reference(s) into her personal collection from the Pubmed collection
  2. Endnote pulls in a lot of the MEDLINE annotation into the record it creates.  In particular, it adds all of the MeSH descriptors and the qualifiers added to them for the citation
  3. The person exports their references into a text file and uploads them to Connotea using the import mechanism
  4. The import to Connotea mechanism mangles the MeSH terms as follows:
  • In Pubmed records, and in the Endnote records, /'s are used to separate descriptors such as "Transcription Factors" from qualifiers such as "antagonists & inhibitors" and "metabolism".  For example, you might see a keyword listed as "Transcription Factors/antagonists & inhibitors/*metabolism".  When imported, Connotea strips the slashes from the tag and thus adds the tag "Transcription Factorsantagonists & inhibitors*metabolism" to the post.  
  • MeSH terms sometimes contain commas like "Models, Genetic".  When imported, these compound terms get split into multiple separate tags (Models and Genetic).
In addition, it appears that quite a few people have managed to import the "Research Support" aspect of Pubmed Records as well.  This is why you see more than a thousand bookmarks with the rather misleading tag "Non-U.S. Gov't", often also tagged with the seemingly contradictory "U.S. Gov't". (This happens when the research in the paper had both U.S. and non-U.S. funding).

You may be thinking to yourself, "who gives a rats' nether regions about this?".  Well, I do and so should you, because if it was supported properly it would improve the accuracy and usefulness of the tags for biomedically relevant papers and it really doesn't seem like it would be too difficult to get right.  

Frankly, the data in Connotea would be much more interesting to me if the MeSH tags could not be imported at all.  Since these terms can be accessed for these citations programmatically and freely already (using e.g. the NCBI E-utilities), they add nothing of value at the collective level.  Without this import mechanism, people would be forced to tag these references themselves, thus adding their own interpretations to the collective pot. 

There is, of course, a conflict here between the individual benefit to the tagger of not having to tag all the documents and the value to the collective of that individual's intellectual labor.  A better import mechanism might improve the situation at both levels by a) doing a better job enhancing the user's collection with external annotations and b) gently encouraging them to add their own 2-cents to each post.

Help save the world, stop tag pollution now!

p.s. You will often see a * appended to the beginning of tags imported in this manner such as "*Genes". This indicates that the *'d term is a major topic (as opposed to minor topic) in the manuscript according to MEDLINE indexing.

p.s. I don't mean to pick on Connotea, other services such as CiteUlike and Bibsonomy have similar problems with different syntactic mangling.