Monday, November 17, 2008

coverage of Pubmed by Citeulike

In the life science domain, the total number of items described by social tagging systems is currently tiny in comparison to the number of resources described by institutions. To illustrate, the MEDLINE bibliographic database contains over 16 million references while, as of November 9, 2008, Citeulike, the largest of the academic social tagging services, contained references to only about 203,314 documents with known PubMed identifiers.

Though important to see where things stand today, the interesting aspect of these new systems right now is their potential for growth. Given the large numbers of contributors (and very large numbers of potential contributors), it seems possible that their coverage might eventually meet or surpass that of resource-contrained institutional mechanisms. In 2007, the NLM reported that it indexed 670,943 citations for the MEDLINE database which equates, on average , to about 56,000 citations per month. To estimate if social tagging services might someday reach the same level of throughput as the NLM indexing service, we compared the rates of growth, per month, for MEDLINE and for Citeulike (on Pubmed citations) over the last several years and used this data to make some predictions of future trends.  Here is what we came up with.

The figure plots the numbers of distinct Pubmed citations described by users of Citeulike and by NLM indexers each month and, using exponential smoothing, plots an extrapolation of the observed trends several years into the future. Based on the data obtained so far, we find that the numbers of biomedical resources described per month by Citeulike users is increasing more rapidly than the the number indexed per month by MEDLINE and that, if current trends continue, Citeulike coverage would catch up with MEDLINE around the year 2014 - at which point both systems would be describing approximately 70,000 biomedical citations per month. As the rapidly expanding confidence intervals illustrate, there is insufficent data to provide strong evidence for the precise point of intersection or even that Citeulike will continue to grow; however, it seems plausible that Citeulike and other scientifically oriented social tagging services will continue to expand in their coverage of the life sciences domain at a faster rate than institutional systems and thus will eventually catch up to the point where every document indexed by a professional is also tagged for personal use by a scientist (or 10).

So, what are we going to do with all of that data ?



Duncan Hull said...

hi ben, can't quite read the graphic, do you have a bigger one?

bgood said...

Still tiny, but hope that is little better
The y axis is the number of distinct pmids per month and the x axis is time.

keet said...

your question at the end sounds rethorical and if you have an answer to it (or at least some good ideas) already. I'm looking forward to see them.

one comment on the figure (which did not make it into the analysis in the post) is that I did note the yellow "estimated lower bound" line, which goes down by 2010...

bgood said...

It wasn't really rhetorical, though I do have ideas of course. I'd love to hear what other people think about it. The most obvious things are recommendation systems, but you already knew that. I'm very curious how a folksonomic scientific classification system would work in comparison/cooperation with other extant search mechanisms if we had the volume of data that this ~very rough forecast predicts we will have in a couple years. (Right now its pretty thin..) While I don't think we will be replacing NLM indexing with connotea any time soon, I do think that we will be seeing more and more search/browse/filter applications in scientific contexts that take advantage of the metadata collated in social tagging systems as well as the metadata assembled by institutions.

Yup, as I said, "there is insufficent data to provide strong evidence for the precise point of intersection or even that Citeulike will continue to grow". Its entirely possible that Citeulike etc. is a fad and will die out one way or another. I don't really think that will happen, but I can't prove otherwise. The line is a just a guess using a very small amount of data and I included the confidence interval to emphasize that, beyond 1 year from now, this is a very rough prediction - in no way a guarantee. But its fun to speculate :)

bgood said...

In case you want to see the raw data and play with it yourself online, its up here