Saturday, October 16, 2010

How many articles in PubMed contain information about genes?

In Figure 1 from this article about the Open DMAP information extraction system the authors suggest that, as of 2007, about 40% of the articles indexed in PubMed contained some information about genes that could be extracted from their abstract. That seems like a lot.. If anyone has any data to corroborate or to disprove that statement, please let me know.

--Correction. The article said 40% of recent articles had a gene mention. See comments from Joachim below that suggest a better estimate overall of 19% for articles with abstracts.--


Benjamin Good said...

Friendfeed discussion with some more data points

Joachim said...

We have been running a gene recognition software over MEDLINE's baseline 2010 abstracts and picked up 1.7M PubMed IDs out of the 10M total for which we think we found a gene in them. That would be (very) roughly 20%. I do not have the precision/recall stats of the recogniser at hand, but it might turn out that there are between 20% and 40% gene mentions just in the abstracts when extrapolating our result via precision/recall.

Right now we are in the progress of writing up our results that are freely downloadable ( If you would like to figure this out more precisely, then please get in touch with me.

Benjamin Good said...

Thanks Joachim,
Could you clarify what your corpus was? Does 'MEDLINE's baseline 2010 abstracts' mean all of them up until 2010 or ?? I think it should be more like 20 million, not 10 in that case?
I would certainly like to know more about this work and to figure it out more precisely as you say.

Joachim said...

Yes, MEDLINE's baseline 2010 contains titles and abstracts up untill 2010 ( You are also right that there are almost 20M entries in the baseline catalog (18.5M), but not all of them have an abstract associated with them. So, we only focused on the 10M entries for which there is an abstract.

I will try to get hold of the precision/recall of the gene-/species-recogniser pair we used, and get back to you about that. Feel free to download the unprocessed text-mining data under:

Joachim said...

The precision/recall of the gene recogniser are 90% and 73% respectively ( We used LINNAEUS to determine the correct species for gene mentions, which comes with 90% and 98% precision/recall (

Very naively, I would now say that MEDLINE's baseline 2010 has 1.7M * 90% / 73% * 90% / 98% = 1.9M gene mentions in its titles + abstracts. That would mean that 19% of titles/abstracts -- for which there is an abstract -- have a gene mention, or when considering all 18.5M MEDLINE documents, we get 10%.

This is only a rough estimate, but certainly nowhere near the originally suggested 40%.

Benjamin Good said...

Thanks again, very helpful data point. I added it to the thread that appeared about this post on Friendfeed