Solution of 6.6.1, Lord of the Fruit Flies

Identify the number of papers in PubMed that has Drosophila virilis in the title or abstract



In [1]:

    
from Bio import Entrez
import re

We construct an esearch request and use the NCBI history function in order to refer to this search in our subsequent efetch call.



In [2]:

    
# Always tell NCBI who you are (edit the e-mail below!)
Entrez.email = "your_name@yourmailhost.com"
handle = Entrez.esearch(db="pubmed", 
                        term="Drosophila virilis[Title/Abstract]",
                        usehistory="y")
record = Entrez.read(handle)
# generate a Python list with all Pubmed IDs of articles about D. virilis
id_list = record["IdList"]
record["Count"]









    Out[2]:





'543'



In [3]:

    
webenv = record["WebEnv"]
query_key = record["QueryKey"]

Retrieve the PubMed entries using our search history



In [4]:

    
handle = Entrez.efetch(db="pubmed",
                       rettype="medline", 
                       retmode="text", 
                       retstart=0,
retmax=543, webenv=webenv, query_key=query_key)



In [5]:

    
out_handle = open("D_virilis_pubs.txt", "w")
data = handle.read()
handle.close()
out_handle.write(data)
out_handle.close()

Count the number of contributions per author

We construct a dictionary with all authors as keys and the number of contributions as value.



In [6]:

    
with open("D_virilis_pubs.txt") as datafile:
    author_dict = {}
    for line in datafile:
        if re.match("AU", line):
            # capture author
            author = line.split("-", 1)[1]
            # remove leading and trailing whitespace
            author = author.strip()
            # if key is present, add 1
            # if it's not present, initialize at 1
            author_dict[author] = 1 + author_dict.get(author, 0)

Find the top five researchers

Dictionaries do not have a natural order but we can sort a dictionary based on the values using the function sorted. We retrieve the number of contributions per author from our author_dict using author_dict.get and use it as value in the sorted function. sorted returns a list that can be indexed to return only the top 5 of researchers.



In [7]:

    
for author in sorted(author_dict, key = author_dict.get, reverse = True)[:5]:
    print(author, ":", author_dict[author])









    



Gruntenko NE : 36
Evgen'ev MB : 30
Hoikkala A : 24
Raushenbakh IIu : 24
Korochkin LI : 22