Building the dataset of research papers

The Entrez module, a part of the Biopython library, will be used to interface with PubMed.
You can download Biopython from here.

In this notebook we will be covering several of the steps taken in the Biopython Tutorial, specifically in Chapter 9 Accessing NCBI’s Entrez databases.


In [13]:
from Bio import Entrez

# NCBI requires you to set your email address to make use of NCBI's E-utilities
Entrez.email = "Your.Name.Here@example.org"

The datasets will be saved as serialized Python objects, compressed with bzip2. Saving/loading them will therefore require the pickle and bz2 modules.


In [14]:
import pickle, bz2, os

EInfo: Obtaining information about the Entrez databases


In [15]:
# accessing extended information about the PubMed database
pubmed = Entrez.read( Entrez.einfo(db="pubmed"), validate=False )[u'DbInfo']

# list of possible search fields for use with ESearch:
search_fields = { f['Name']:f['Description'] for f in pubmed["FieldList"] }

In search_fields, we find 'TIAB' ('Free text associated with Abstract/Title') as a possible search field to use in searches.


In [16]:
search_fields


Out[16]:
{'AFFL': "Author's institutional affiliation and address",
 'ALL': 'All terms from all searchable fields',
 'AUCL': 'Author Cluster ID',
 'AUID': 'Author Identifier',
 'AUTH': 'Author(s) of publication',
 'BOOK': 'ID of the book that contains the document',
 'CDAT': 'Date of completion',
 'CNTY': 'Country of publication',
 'COLN': 'Corporate Author of publication',
 'CRDT': 'Date publication first accessible through Entrez',
 'DSO': 'Additional text from the summary',
 'ECNO': 'EC number for enzyme or CAS registry number',
 'ED': "Section's Editor",
 'EDAT': 'Date publication first accessible through Entrez',
 'EID': 'Extended PMID',
 'EPDT': 'Date of Electronic publication',
 'FAUT': 'First Author of publication',
 'FILT': 'Limits the records',
 'FINV': 'Full name of investigator',
 'FULL': 'Full Author Name(s) of publication',
 'GRNT': 'NIH Grant Numbers',
 'INVR': 'Investigator',
 'ISBN': 'ISBN',
 'ISS': 'Issue number of publication',
 'JOUR': 'Journal abbreviation of publication',
 'LANG': 'Language of publication',
 'LAUT': 'Last Author of publication',
 'LID': 'ELocation ID',
 'MAJR': 'MeSH terms of major importance to publication',
 'MDAT': 'Date of last modification',
 'MESH': 'Medical Subject Headings assigned to publication',
 'MHDA': 'Date publication was indexed with MeSH terms',
 'OTRM': 'Other terms associated with publication',
 'PAGE': 'Page number(s) of publication',
 'PAPX': 'MeSH pharmacological action pre-explosions',
 'PDAT': 'Date of publication',
 'PID': 'Publisher ID',
 'PPDT': 'Date of print publication',
 'PS': 'Personal Name as Subject',
 'PTYP': 'Type of publication (e.g., review)',
 'PUBN': "Publisher's name",
 'SI': 'Cross-reference from publication to other databases',
 'SUBH': 'Additional specificity for MeSH term',
 'SUBS': 'CAS chemical name or MEDLINE Substance Name',
 'TIAB': 'Free text associated with Abstract/Title',
 'TITL': 'Words in title of publication',
 'TT': 'Words in transliterated title of publication',
 'UID': 'Unique number assigned to publication',
 'VOL': 'Volume number of publication',
 'WORD': 'Free text associated with publication'}

ESearch: Searching the Entrez databases

To have a look at the kind of data we get when searching the database, we'll perform a search for papers authored by Haasdijk:


In [17]:
example_authors = ['Haasdijk E']
example_search = Entrez.read( Entrez.esearch( db="pubmed", term=' AND '.join([a+'[AUTH]' for a in example_authors]) ) )
example_search


Out[17]:
{'RetStart': '0', 'TranslationSet': [], 'TranslationStack': [{'Field': 'Author', 'Explode': 'N', 'Count': '31', 'Term': 'Haasdijk E[Author]'}, 'GROUP'], 'QueryTranslation': 'Haasdijk E[Author]', 'RetMax': '20', 'IdList': ['26933487', '24977986', '24901702', '24852945', '24708899', '24252306', '23580075', '23144668', '22174697', '22154920', '21870131', '21760539', '20662596', '20602234', '20386726', '18579581', '18305242', '17913916', '17804640', '17686042'], 'Count': '31'}

Note how the result being produced is not in Python's native string format:


In [18]:
type( example_search['IdList'][0] )


Out[18]:
Bio.Entrez.Parser.StringElement

The part of the query's result we are most interested in is accessible through


In [19]:
example_ids = [ int(id) for id in example_search['IdList'] ]
print(example_ids)


[26933487, 24977986, 24901702, 24852945, 24708899, 24252306, 23580075, 23144668, 22174697, 22154920, 21870131, 21760539, 20662596, 20602234, 20386726, 18579581, 18305242, 17913916, 17804640, 17686042]

PubMed IDs dataset

We will now assemble a dataset comprised of research articles containing the keyword "evolution", in either their titles or abstracts.


In [20]:
search_term = 'malaria'

In [21]:
Ids_file = 'data/' + search_term + '__Ids.pkl.bz2'

In [22]:
if os.path.exists( Ids_file ):
    Ids = pickle.load( bz2.BZ2File( Ids_file, 'rb' ) )
else:
    # determine the number of hits for the search term
    search = Entrez.read( Entrez.esearch( db="pubmed", term=search_term+'[TIAB]', retmax=0 ) )
    total = int( search['Count'] )
    
    # `Ids` will be incrementally assembled, by performing multiple queries,
    # each returning at most `retrieve_per_query` entries.
    Ids_str = []
    retrieve_per_query = 10000
    
    for start in range( 0, total, retrieve_per_query ):
        print('Fetching IDs of results [%d,%d]' % ( start, start+retrieve_per_query ) )
        s = Entrez.read( Entrez.esearch( db="pubmed", term=search_term+'[TIAB]', retstart=start, retmax=retrieve_per_query ) )
        Ids_str.extend( s[ u'IdList' ] )
    
    # convert Ids to integers (and ensure that the conversion is reversible)
    Ids = [ int(id) for id in Ids_str ]
    
    for (id_str, id_int) in zip(Ids_str, Ids):
        if str(id_int) != id_str:
            raise Exception('Conversion of PubMed ID %s from string to integer it not reversible.' % id_str )
    
    # Save list of Ids
    pickle.dump( Ids, bz2.BZ2File( Ids_file, 'wb' ) )
    
total = len( Ids )
print('%d documents contain the search term "%s".' % ( total, search_term ) )


67028 documents contain the search term "malaria".

Taking a look at what we just retrieved, here are the last 5 elements of the Ids list:


In [23]:
Ids[:5]


Out[23]:
[27749938, 27749907, 27748596, 27748303, 27748294]

ESummary: Retrieving summaries from primary IDs

To have a look at the kind of metadata we get from a call to Entrez.esummary(), we now fetch the summary of one of Haasdijk's papers (using one of the PubMed IDs we obtained in the previous section:


In [12]:
example_paper = Entrez.read( Entrez.esummary(db="pubmed", id='27749938') )[0]

def print_dict( p ):
    for k,v in p.items():
        print(k)
        print('\t', v)

print_dict(example_paper)


RecordStatus
	 Unknown status
History
	 {'received': '2016/06/09 00:00', 'pubmed': ['2016/10/18 06:00'], 'accepted': '2016/09/29 00:00', 'medline': ['2016/10/18 06:00']}
DOI
	 10.1371/journal.pone.0164685
ESSN
	 1932-6203
AuthorList
	 ['Adde A', 'Roux E', 'Mangeas M', 'Dessay N', 'Nacher M', 'Dusfour I', 'Girod R', 'Briolant S']
PmcRefCount
	 0
NlmUniqueID
	 101285081
LastAuthor
	 Briolant S
References
	 []
LangList
	 ['English']
EPubDate
	 2016 Oct 17
Source
	 PLoS One
PubStatus
	 
PubTypeList
	 ['Journal Article']
HasAbstract
	 1
FullJournalName
	 PloS one
Item
	 []
PubDate
	 2016 Oct 17
Pages
	 e0164685
Issue
	 10
SO
	 2016 Oct 17;11(10):e0164685
ELocationID
	 doi: 10.1371/journal.pone.0164685
Title
	 Dynamical Mapping of Anopheles darlingi Densities in a Residual Malaria Transmission Area of French Guiana by Using Remote Sensing and Meteorological Data.
ISSN
	 
ArticleIds
	 {'eid': '27749938', 'pii': 'PONE-D-16-23217', 'pubmed': ['27749938'], 'rid': '27749938', 'medline': [], 'doi': '10.1371/journal.pone.0164685'}
Id
	 27749938
Volume
	 11

For now, we'll keep just some basic information for each paper: title, list of authors, publication year, and DOI.

In case you are not familiar with the DOI system, know that the paper above can be accessed through the link http://dx.doi.org/10.1007/s12065-012-0071-x (which is http://dx.doi.org/ followed by the paper's DOI).


In [13]:
( example_paper['Title'], example_paper['AuthorList'], int(example_paper['PubDate'][:4]), example_paper['DOI'] )


Out[13]:
('Dynamical Mapping of Anopheles darlingi Densities in a Residual Malaria Transmission Area of French Guiana by Using Remote Sensing and Meteorological Data.',
 ['Adde A', 'Roux E', 'Mangeas M', 'Dessay N', 'Nacher M', 'Dusfour I', 'Girod R', 'Briolant S'],
 2016,
 '10.1371/journal.pone.0164685')

Summaries dataset

We are now ready to assemble a dataset containing the summaries of all the paper Ids we previously fetched.

To reduce the memory footprint, and to ensure the saved datasets won't depend on Biopython being installed to be properly loaded, values returned by Entrez.read() will be converted to their corresponding native Python types. We start by defining a function for helping with the conversion of strings:


In [14]:
Summaries_file = 'data/' + search_term + '__Summaries.pkl.bz2'

In [15]:
if os.path.exists( Summaries_file ):
    Summaries = pickle.load( bz2.BZ2File( Summaries_file, 'rb' ) )
else:
    # `Summaries` will be incrementally assembled, by performing multiple queries,
    # each returning at most `retrieve_per_query` entries.
    Summaries = []
    retrieve_per_query = 500
    
    print('Fetching Summaries of results: ')
    for start in range( 0, len(Ids), retrieve_per_query ):
        if (start % 10000 == 0):
            print('')
            print(start, end='')
        else:
            print('.', end='')
        
        # build comma separated string with the ids at indexes [start, start+retrieve_per_query)
        query_ids = ','.join( [ str(id) for id in Ids[ start : start+retrieve_per_query ] ] )
        
        s = Entrez.read( Entrez.esummary( db="pubmed", id=query_ids ) )
        
        # out of the retrieved data, we will keep only a tuple (title, authors, year, DOI), associated with the paper's id.
        # (all values converted to native Python formats)
        f = [
            ( int( p['Id'] ), (
                str( p['Title'] ),
                [ str(a) for a in p['AuthorList'] ],
                int( p['PubDate'][:4] ),                # keeps just the publication year
                str( p.get('DOI', '') )            # papers for which no DOI is available get an empty string in their place
                ) )
            for p in s
            ]
        Summaries.extend( f )
    
    # Save Summaries, as a dictionary indexed by Ids
    Summaries = dict( Summaries )
    
    pickle.dump( Summaries, bz2.BZ2File( Summaries_file, 'wb' ) )


Fetching Summaries of results: 

0...................
10000...................
20000...................
30000...................
40000...................
50000...................
60000..............

Let us take a look at the first 3 retrieved summaries:


In [16]:
{ id : Summaries[id] for id in Ids[:3] }


Out[16]:
{27748596: ('Identification of a potential anti-malarial drug candidate from a series of 2-aminopyrazines by optimization of aqueous solubility and potency across the parasite life-cycle.',
  ['Le Manach C',
   'Nchinda AT',
   'Paquet T',
   'Gonzalez Cabrera D',
   'Younis Adam Y',
   'Han Z',
   'Bashyam S',
   'Zabiulla M',
   'Taylor D',
   'Lawrence N',
   'White KL',
   'Charman SA',
   'Waterson D',
   'Witty MJ',
   'Wittlin S',
   'Botha ME',
   'Nondaba SH',
   'Reader J',
   'Birkholtz LM',
   'Jimenez-Diaz MB',
   'Martínez-Martínez MS',
   'Ferrer-Bazaga S',
   'Angulo-Barturen I',
   'Meister S',
   'Antonova-Koch Y',
   'Winzeler EA',
   'Street LJ',
   'Chibale K'],
  2016,
  '10.1021/acs.jmedchem.6b01265'),
 27749907: ('Safety and Immunogenicity of Pfs25-EPA/Alhydrogel®, a Transmission Blocking Vaccine against Plasmodium falciparum: An Open Label Study in Malaria Naïve Adults.',
  ['Talaat KR',
   'Ellis RD',
   'Hurd J',
   'Hentrich A',
   'Gabriel E',
   'Hynes NA',
   'Rausch KM',
   'Zhu D',
   'Muratova O',
   'Herrera R',
   'Anderson C',
   'Jones D',
   'Aebig J',
   'Brockley S',
   'MacDonald NJ',
   'Wang X',
   'Fay MP',
   'Healy SA',
   'Durbin AP',
   'Narum DL',
   'Wu Y',
   'Duffy PE'],
  2016,
  '10.1371/journal.pone.0163144'),
 27749938: ('Dynamical Mapping of Anopheles darlingi Densities in a Residual Malaria Transmission Area of French Guiana by Using Remote Sensing and Meteorological Data.',
  ['Adde A',
   'Roux E',
   'Mangeas M',
   'Dessay N',
   'Nacher M',
   'Dusfour I',
   'Girod R',
   'Briolant S'],
  2016,
  '10.1371/journal.pone.0164685')}

EFetch: Downloading full records from Entrez

Entrez.efetch() is the function that will allow us to obtain paper abstracts. Let us start by taking a look at the kind of data it returns when we query PubMed's database.


In [17]:
q = Entrez.read( Entrez.efetch(db="pubmed", id='27749938', retmode="xml") )

q is a list, with each member corresponding to a queried id. Because here we only queried for one id, its results are then in q[0].


In [18]:
type(q), len(q)


Out[18]:
(Bio.Entrez.Parser.ListElement, 1)

At q[0] we find a dictionary containing two keys, the contents of which we print below.


In [19]:
type(q[0]), q[0].keys()


Out[19]:
(Bio.Entrez.Parser.DictionaryElement,
 dict_keys(['MedlineCitation', 'PubmedData']))

In [20]:
print_dict( q[0][ 'PubmedData' ] )


History
	 [DictElement({'Year': '2016', 'Month': '6', 'Day': '9'}, attributes={'PubStatus': 'received'}), DictElement({'Year': '2016', 'Month': '9', 'Day': '29'}, attributes={'PubStatus': 'accepted'}), DictElement({'Hour': '6', 'Minute': '0', 'Year': '2016', 'Month': '10', 'Day': '18'}, attributes={'PubStatus': 'pubmed'}), DictElement({'Hour': '6', 'Minute': '0', 'Year': '2016', 'Month': '10', 'Day': '18'}, attributes={'PubStatus': 'medline'})]
ArticleIdList
	 [StringElement('27749938', attributes={'IdType': 'pubmed'}), StringElement('10.1371/journal.pone.0164685', attributes={'IdType': 'doi'}), StringElement('PONE-D-16-23217', attributes={'IdType': 'pii'})]
PublicationStatus
	 epublish

The key 'MedlineCitation' maps into another dictionary. In that dictionary, most of the information is contained under the key 'Article'. To minimize the clutter, below we show the contents of 'MedlineCitation' excluding its 'Article' member, and below that we then show the contents of 'Article'.


In [21]:
print_dict( { k:v for k,v in q[0][ 'MedlineCitation' ].items() if k!='Article' } )


OtherID
	 []
DateRevised
	 {'Year': '2016', 'Month': '10', 'Day': '18'}
GeneralNote
	 []
OtherAbstract
	 []
DateCreated
	 {'Year': '2016', 'Month': '10', 'Day': '17'}
SpaceFlightMission
	 []
MedlineJournalInfo
	 {'Country': 'United States', 'MedlineTA': 'PLoS One', 'NlmUniqueID': '101285081', 'ISSNLinking': '1932-6203'}
CitationSubset
	 []
PMID
	 27749938
KeywordList
	 []

In [22]:
print_dict( q[0][ 'MedlineCitation' ][ 'Article' ] )


ArticleTitle
	 Dynamical Mapping of Anopheles darlingi Densities in a Residual Malaria Transmission Area of French Guiana by Using Remote Sensing and Meteorological Data.
Abstract
	 {'AbstractText': [StringElement("Local variation in the density of Anopheles mosquitoes and the risk of exposure to bites are essential to explain the spatial and temporal heterogeneities in the transmission of malaria. Vector distribution is driven by environmental factors. Based on variables derived from satellite imagery and meteorological observations, this study aimed to dynamically model and map the densities of Anopheles darlingi in the municipality of Saint-Georges de l'Oyapock (French Guiana). Longitudinal sampling sessions of An. darlingi densities were conducted between September 2012 and October 2014. Landscape and meteorological data were collected and processed to extract a panel of variables that were potentially related to An. darlingi ecology. Based on these data, a robust methodology was formed to estimate a statistical predictive model of the spatial-temporal variations in the densities of An. darlingi in Saint-Georges de l'Oyapock. The final cross-validated model integrated two landscape variables-dense forest surface and built surface-together with four meteorological variables related to rainfall, evapotranspiration, and the minimal and maximal temperatures. Extrapolation of the model allowed the generation of predictive weekly maps of An. darlingi densities at a resolution of 10-m. Our results supported the use of satellite imagery and meteorological data to predict malaria vector densities. Such fine-scale modeling approach might be a useful tool for health authorities to plan control strategies and social communication in a cost-effective, targeted, and timely manner.", attributes={'NlmCategory': 'UNASSIGNED'})]}
PublicationTypeList
	 [StringElement('JOURNAL ARTICLE', attributes={'UI': ''})]
Journal
	 {'ISOAbbreviation': 'PLoS ONE', 'Title': 'PloS one', 'JournalIssue': DictElement({'Issue': '10', 'PubDate': {'Year': '2016'}, 'Volume': '11'}, attributes={'CitedMedium': 'Internet'}), 'ISSN': StringElement('1932-6203', attributes={'IssnType': 'Electronic'})}
ArticleDate
	 [DictElement({'Year': '2016', 'Month': 'Oct', 'Day': '17'}, attributes={'DateType': 'Electronic'})]
ELocationID
	 [StringElement('10.1371/journal.pone.0164685', attributes={'ValidYN': 'Y', 'EIdType': 'doi'})]
Pagination
	 {'MedlinePgn': 'e0164685'}
Language
	 ['ENG']
AuthorList
	 ListElement([DictElement({'Identifier': [], 'AffiliationInfo': [{'Identifier': [], 'Affiliation': "Unité d'Entomologie Médicale, Institut Pasteur de la Guyane, Cayenne, French Guiana."}], 'ForeName': 'Antoine', 'LastName': 'Adde', 'Initials': 'A'}, attributes={'ValidYN': 'Y'}), DictElement({'Identifier': [], 'AffiliationInfo': [{'Identifier': [], 'Affiliation': 'UMR ESPACE-DEV, Institut de Recherche pour le Développement, Montpellier, France.'}], 'ForeName': 'Emmanuel', 'LastName': 'Roux', 'Initials': 'E'}, attributes={'ValidYN': 'Y'}), DictElement({'Identifier': [], 'AffiliationInfo': [{'Identifier': [], 'Affiliation': 'UMR ESPACE-DEV, Institut de Recherche pour le Développement, Montpellier, France.'}], 'ForeName': 'Morgan', 'LastName': 'Mangeas', 'Initials': 'M'}, attributes={'ValidYN': 'Y'}), DictElement({'Identifier': [], 'AffiliationInfo': [{'Identifier': [], 'Affiliation': 'UMR ESPACE-DEV, Institut de Recherche pour le Développement, Montpellier, France.'}], 'ForeName': 'Nadine', 'LastName': 'Dessay', 'Initials': 'N'}, attributes={'ValidYN': 'Y'}), DictElement({'Identifier': [], 'AffiliationInfo': [{'Identifier': [], 'Affiliation': "Centre d'Investigation Clinique et Epidémiologie Clinique Antilles Guyane, Centre hospitalier Andrée-Rosemon, Cayenne, French Guiana."}], 'ForeName': 'Mathieu', 'LastName': 'Nacher', 'Initials': 'M'}, attributes={'ValidYN': 'Y'}), DictElement({'Identifier': [], 'AffiliationInfo': [{'Identifier': [], 'Affiliation': "Unité d'Entomologie Médicale, Institut Pasteur de la Guyane, Cayenne, French Guiana."}], 'ForeName': 'Isabelle', 'LastName': 'Dusfour', 'Initials': 'I'}, attributes={'ValidYN': 'Y'}), DictElement({'Identifier': [], 'AffiliationInfo': [{'Identifier': [], 'Affiliation': "Unité d'Entomologie Médicale, Institut Pasteur de la Guyane, Cayenne, French Guiana."}], 'ForeName': 'Romain', 'LastName': 'Girod', 'Initials': 'R'}, attributes={'ValidYN': 'Y'}), DictElement({'Identifier': [], 'AffiliationInfo': [{'Identifier': [], 'Affiliation': "Unité d'Entomologie Médicale, Institut Pasteur de la Guyane, Cayenne, French Guiana."}, {'Identifier': [], 'Affiliation': 'Direction Interarmées du Service de Santé en Guyane, Cayenne, French Guiana.'}, {'Identifier': [], 'Affiliation': "Unité de Parasitologie et d'Entomologie Médicale, Institut de Recherche Biomédicale des Armées, Marseille, France."}, {'Identifier': [], 'Affiliation': 'Unité de Recherche en Maladies Infectieuses Tropicales Emergentes, Faculté de Médecine La Timone, Marseille, France.'}], 'ForeName': 'Sébastien', 'LastName': 'Briolant', 'Initials': 'S'}, attributes={'ValidYN': 'Y'})], attributes={'CompleteYN': 'Y', 'Type': 'authors'})

A paper's abstract can therefore be accessed with:


In [23]:
{ int(q[0]['MedlineCitation']['PMID']) : str(q[0]['MedlineCitation']['Article']['Abstract']['AbstractText'][0]) }


Out[23]:
{27749938: "Local variation in the density of Anopheles mosquitoes and the risk of exposure to bites are essential to explain the spatial and temporal heterogeneities in the transmission of malaria. Vector distribution is driven by environmental factors. Based on variables derived from satellite imagery and meteorological observations, this study aimed to dynamically model and map the densities of Anopheles darlingi in the municipality of Saint-Georges de l'Oyapock (French Guiana). Longitudinal sampling sessions of An. darlingi densities were conducted between September 2012 and October 2014. Landscape and meteorological data were collected and processed to extract a panel of variables that were potentially related to An. darlingi ecology. Based on these data, a robust methodology was formed to estimate a statistical predictive model of the spatial-temporal variations in the densities of An. darlingi in Saint-Georges de l'Oyapock. The final cross-validated model integrated two landscape variables-dense forest surface and built surface-together with four meteorological variables related to rainfall, evapotranspiration, and the minimal and maximal temperatures. Extrapolation of the model allowed the generation of predictive weekly maps of An. darlingi densities at a resolution of 10-m. Our results supported the use of satellite imagery and meteorological data to predict malaria vector densities. Such fine-scale modeling approach might be a useful tool for health authorities to plan control strategies and social communication in a cost-effective, targeted, and timely manner."}

A paper for which no abstract is available will simply not contain the 'Abstract' key in its 'Article' dictionary:


In [24]:
print_dict( Entrez.read( Entrez.efetch(db="pubmed", id='17782550', retmode="xml") )[0]['MedlineCitation']['Article'] )


ArticleTitle
	 EVOLUTION OF LOCOMOTIVES IN AMERICA.
PublicationTypeList
	 [StringElement('Journal Article', attributes={'UI': 'D016428'})]
Journal
	 {'ISOAbbreviation': 'Science', 'Title': 'Science (New York, N.Y.)', 'JournalIssue': DictElement({'Issue': '3', 'PubDate': {'Year': '1880', 'Month': 'Jul', 'Day': '17'}, 'Volume': '1'}, attributes={'CitedMedium': 'Print'}), 'ISSN': StringElement('0036-8075', attributes={'IssnType': 'Print'})}
ArticleDate
	 []
ELocationID
	 []
Pagination
	 {'MedlinePgn': '35'}
Language
	 ['ENG']

Some of the ids in our dataset refer to books from the NCBI Bookshelf, a collection of freely available, downloadable, on-line versions of selected biomedical books. For such ids, Entrez.efetch() returns a slightly different structure, where the keys [u'BookDocument', u'PubmedBookData'] take the place of the [u'MedlineCitation', u'PubmedData'] keys we saw above.

Here is an example of the data we obtain for the id corresponding to the book The Social Biology of Microbial Communities:


In [25]:
r = Entrez.read( Entrez.efetch(db="pubmed", id='24027805', retmode="xml") )

In [26]:
print_dict( r[0][ 'PubmedBookData' ] )


History
	 [DictElement({'Hour': '6', 'Minute': '0', 'Year': '2013', 'Month': '9', 'Day': '13'}, attributes={'PubStatus': 'pubmed'}), DictElement({'Hour': '6', 'Minute': '0', 'Year': '2013', 'Month': '9', 'Day': '13'}, attributes={'PubStatus': 'medline'}), DictElement({'Hour': '6', 'Minute': '0', 'Year': '2013', 'Month': '9', 'Day': '13'}, attributes={'PubStatus': 'entrez'})]
ArticleIdList
	 [StringElement('24027805', attributes={'IdType': 'pubmed'})]
PublicationStatus
	 ppublish

In [27]:
print_dict( r[0][ 'BookDocument' ] )


Sections
	 [{'Section': [], 'SectionTitle': StringElement('THE NATIONAL ACADEMIES', attributes={'book': 'nap13500', 'part': 'fm.s1'})}, {'Section': [], 'SectionTitle': StringElement('PLANNING COMMITTEE FOR A WORKSHOP ON THE MICROBIOME IN HEALTH AND DISEASE', attributes={'book': 'nap13500', 'part': 'fm.s2'})}, {'Section': [], 'SectionTitle': StringElement('FORUM ON MICROBIAL THREATS', attributes={'book': 'nap13500', 'part': 'fm.s3'})}, {'Section': [], 'SectionTitle': StringElement('BOARD ON GLOBAL HEALTH', attributes={'book': 'nap13500', 'part': 'fm.s5'})}, {'Section': [], 'SectionTitle': StringElement('Reviewers', attributes={'book': 'nap13500', 'part': 'fm.s7'})}, {'Section': [], 'SectionTitle': StringElement('Acknowledgments', attributes={'book': 'nap13500', 'part': 'fm.ack'})}, {'Section': [], 'SectionTitle': StringElement('Workshop Overview', attributes={'book': 'nap13500', 'part': 'workshop'})}, {'Section': [], 'SectionTitle': StringElement('Appendixes', attributes={'book': 'nap13500', 'part': 'nap13500.appgroup1'})}]
ItemList
	 []
PMID
	 24027805
Abstract
	 {'AbstractText': ['On March 6 and 7, 2012, the Institute of Medicine’s (IOM’s) Forum on Microbial Threats hosted a public workshop to explore the emerging science of the “social biology” of microbial communities. Workshop presentations and discussions embraced a wide spectrum of topics, experimental systems, and theoretical perspectives representative of the current, multifaceted exploration of the microbial frontier. Participants discussed ecological, evolutionary, and genetic factors contributing to the assembly, function, and stability of microbial communities; how microbial communities adapt and respond to environmental stimuli; theoretical and experimental approaches to advance this nascent field; and potential applications of knowledge gained from the study of microbial communities for the improvement of human, animal, plant, and ecosystem health and toward a deeper understanding of microbial diversity and evolution.'], 'CopyrightInformation': 'Copyright © 2012, National Academy of Sciences.'}
LocationLabel
	 []
Book
	 {'ELocationID': [], 'BookTitle': StringElement('The Social Biology of Microbial Communities: Workshop Summary', attributes={'book': 'nap13500'}), 'CollectionTitle': StringElement('The National Academies Collection: Reports funded by National Institutes of Health', attributes={'book': 'napcollect'}), 'Isbn': ['9780309264327', '0309264324'], 'Publisher': {'PublisherLocation': 'Washington (DC)', 'PublisherName': 'National Academies Press (US)'}, 'PubDate': {'Year': '2012'}, 'AuthorList': [ListElement([DictElement({'Identifier': [], 'AffiliationInfo': [], 'CollectiveName': 'Institute of Medicine (US) Forum on Microbial Threats'}, attributes={'ValidYN': 'Y'})], attributes={'Type': 'authors', 'CompleteYN': 'Y'})]}
ArticleIdList
	 [StringElement('NBK114831', attributes={'IdType': 'bookaccession'}), StringElement('10.17226/13500', attributes={'IdType': 'doi'})]
AuthorList
	 []
Language
	 ['eng']
KeywordList
	 []
PublicationType
	 [StringElement('Review', attributes={'UI': 'D016454'})]

In a book from the NCBI Bookshelf, its abstract can then be accessed as such:


In [28]:
{ int(r[0]['BookDocument']['PMID']) : str(r[0]['BookDocument']['Abstract']['AbstractText'][0]) }


Out[28]:
{24027805: 'On March 6 and 7, 2012, the Institute of Medicine’s (IOM’s) Forum on Microbial Threats hosted a public workshop to explore the emerging science of the “social biology” of microbial communities. Workshop presentations and discussions embraced a wide spectrum of topics, experimental systems, and theoretical perspectives representative of the current, multifaceted exploration of the microbial frontier. Participants discussed ecological, evolutionary, and genetic factors contributing to the assembly, function, and stability of microbial communities; how microbial communities adapt and respond to environmental stimuli; theoretical and experimental approaches to advance this nascent field; and potential applications of knowledge gained from the study of microbial communities for the improvement of human, animal, plant, and ecosystem health and toward a deeper understanding of microbial diversity and evolution.'}

Abstracts dataset

We can now assemble a dataset mapping paper ids to their abstracts.


In [29]:
Abstracts_file = 'data/' + search_term + '__Abstracts.pkl.bz2'

In [30]:
import http.client
from collections import deque

if os.path.exists( Abstracts_file ):
    Abstracts = pickle.load( bz2.BZ2File( Abstracts_file, 'rb' ) )
else:
    # `Abstracts` will be incrementally assembled, by performing multiple queries,
    # each returning at most `retrieve_per_query` entries.
    Abstracts = deque()
    retrieve_per_query = 500
    
    print('Fetching Abstracts of results: ')
    for start in range( 0, len(Ids), retrieve_per_query ):
        if (start % 10000 == 0):
            print('')
            print(start, end='')
        else:
            print('.', end='')
        
        # build comma separated string with the ids at indexes [start, start+retrieve_per_query)
        query_ids = ','.join( [ str(id) for id in Ids[ start : start+retrieve_per_query ] ] )
        
        # issue requests to the server, until we get the full amount of data we expect
        while True:
            try:
                s = Entrez.read( Entrez.efetch(db="pubmed", id=query_ids, retmode="xml" ) )
            except http.client.IncompleteRead:
                print('r', end='')
                continue
            break
        
        i = 0
        for p in s:
            abstr = ''
            if 'MedlineCitation' in p:
                pmid = p['MedlineCitation']['PMID']
                if 'Abstract' in p['MedlineCitation']['Article']:
                    abstr = p['MedlineCitation']['Article']['Abstract']['AbstractText'][0]
            elif 'BookDocument' in p:
                pmid = p['BookDocument']['PMID']
                if 'Abstract' in p['BookDocument']:
                    abstr = p['BookDocument']['Abstract']['AbstractText'][0]
            else:
                raise Exception('Unrecognized record type, for id %d (keys: %s)' % (Ids[start+i], str(p.keys())) )
            
            Abstracts.append( (int(pmid), str(abstr)) )
            i += 1
    
    # Save Abstracts, as a dictionary indexed by Ids
    Abstracts = dict( Abstracts )
    
    pickle.dump( Abstracts, bz2.BZ2File( Abstracts_file, 'wb' ) )


Fetching Abstracts of results: 

0...................
10000...................
20000...................
30000...................
40000...................
50000...................
60000..............

Taking a look at one paper's abstract:


In [31]:
Abstracts[27749938]


Out[31]:
"Local variation in the density of Anopheles mosquitoes and the risk of exposure to bites are essential to explain the spatial and temporal heterogeneities in the transmission of malaria. Vector distribution is driven by environmental factors. Based on variables derived from satellite imagery and meteorological observations, this study aimed to dynamically model and map the densities of Anopheles darlingi in the municipality of Saint-Georges de l'Oyapock (French Guiana). Longitudinal sampling sessions of An. darlingi densities were conducted between September 2012 and October 2014. Landscape and meteorological data were collected and processed to extract a panel of variables that were potentially related to An. darlingi ecology. Based on these data, a robust methodology was formed to estimate a statistical predictive model of the spatial-temporal variations in the densities of An. darlingi in Saint-Georges de l'Oyapock. The final cross-validated model integrated two landscape variables-dense forest surface and built surface-together with four meteorological variables related to rainfall, evapotranspiration, and the minimal and maximal temperatures. Extrapolation of the model allowed the generation of predictive weekly maps of An. darlingi densities at a resolution of 10-m. Our results supported the use of satellite imagery and meteorological data to predict malaria vector densities. Such fine-scale modeling approach might be a useful tool for health authorities to plan control strategies and social communication in a cost-effective, targeted, and timely manner."

To understand how to obtain paper citations with Entrez, we will first assemble a small set of PubMed IDs, and then query for their citations. To that end, we search here for papers published in the PLOS Computational Biology journal (as before, having also the word "malaria" in either the title or abstract):


In [32]:
CA_search_term = search_term+'[TIAB] AND PLoS computational biology[JOUR]'
CA_ids = Entrez.read( Entrez.esearch( db="pubmed", term=CA_search_term ) )['IdList']
CA_ids


Out[32]:
['27509368', '27043913', '26890485', '26764905', '25590612', '25187979', '24465196', '24465193', '24348235', '24244127', '24204241', '24146604', '24130474', '23874190', '23785271', '23637586', '23637585', '23093922', '22615546', '22511852']

In [33]:
CA_summ = {
    p['Id'] : ( p['Title'], p['AuthorList'], p['PubDate'][:4], p['FullJournalName'], p.get('DOI', '') )
    for p in Entrez.read( Entrez.esummary(db="pubmed", id=','.join( CA_ids )) )
    }
CA_summ


Out[33]:
{'22511852': ('Evolution of the multi-domain structures of virulence genes in the human malaria parasite, Plasmodium falciparum.',
  ['Buckee CO', 'Recker M'],
  '2012',
  'PLoS computational biology',
  '10.1371/journal.pcbi.1002451'),
 '22615546': ('A spatial model of mosquito host-seeking behavior.',
  ['Cummins B', 'Cortez R', 'Foppa IM', 'Walbeck J', 'Hyman JM'],
  '2012',
  'PLoS computational biology',
  '10.1371/journal.pcbi.1002500'),
 '23093922': ('The dynamics of naturally acquired immunity to Plasmodium falciparum infection.',
  ['Pinkevych M', 'Petravic J', 'Chelimo K', 'Kazura JW', 'Moormann AM', 'Davenport MP'],
  '2012',
  'PLoS computational biology',
  '10.1371/journal.pcbi.1002729'),
 '23637585': ('Biomarker discovery by sparse canonical correlation analysis of complex clinical phenotypes of tuberculosis and malaria.',
  ['Rousu J', 'Agranoff DD', 'Sodeinde O', 'Shawe-Taylor J', 'Fernandez-Reyes D'],
  '2013',
  'PLoS computational biology',
  '10.1371/journal.pcbi.1003018'),
 '23637586': ("Malaria's missing number: calculating the human component of R0 by a within-host mechanistic model of Plasmodium falciparum infection and transmission.",
  ['Johnston GL', 'Smith DL', 'Fidock DA'],
  '2013',
  'PLoS computational biology',
  '10.1371/journal.pcbi.1003025'),
 '23785271': ('Modelling co-infection with malaria and lymphatic filariasis.',
  ['Slater HC', 'Gambhir M', 'Parham PE', 'Michael E'],
  '2013',
  'PLoS computational biology',
  '10.1371/journal.pcbi.1003096'),
 '23874190': ('Improving pharmacokinetic-pharmacodynamic modeling to investigate anti-infective chemotherapy with application to the current generation of antimalarial drugs.',
  ['Kay K', 'Hastings IM'],
  '2013',
  'PLoS computational biology',
  '10.1371/journal.pcbi.1003151'),
 '24130474': ('A network approach to analyzing highly recombinant malaria parasite genes.',
  ['Larremore DB', 'Clauset A', 'Buckee CO'],
  '2013',
  'PLoS computational biology',
  '10.1371/journal.pcbi.1003268'),
 '24146604': ('Prediction of the P. falciparum target space relevant to malaria drug discovery.',
  ['Spitzmüller A', 'Mestres J'],
  '2013',
  'PLoS computational biology',
  '10.1371/journal.pcbi.1003257'),
 '24204241': ('Natural, persistent oscillations in a spatial multi-strain disease system with application to dengue.',
  ['Lourenço J', 'Recker M'],
  '2013',
  'PLoS computational biology',
  '10.1371/journal.pcbi.1003308'),
 '24244127': ('Improving the modeling of disease data from the government surveillance system: a case study on malaria in the Brazilian Amazon.',
  ['Valle D', 'Clark J'],
  '2013',
  'PLoS computational biology',
  '10.1371/journal.pcbi.1003312'),
 '24348235': ('Inferring developmental stage composition from gene expression in human malaria.',
  ['Joice R', 'Narasimhan V', 'Montgomery J', 'Sidhu AB', 'Oh K', 'Meyer E', 'Pierre-Louis W', 'Seydel K', 'Milner D', 'Williamson K', 'Wiegand R', 'Ndiaye D', 'Daily J', 'Wirth D', 'Taylor T', 'Huttenhower C', 'Marti M'],
  '2013',
  'PLoS computational biology',
  '10.1371/journal.pcbi.1003392'),
 '24465193': ('Immune-mediated competition in rodent malaria is most likely caused by induced changes in innate immune clearance of merozoites.',
  ['Santhanam J', 'Råberg L', 'Read AF', 'Savill NJ'],
  '2014',
  'PLoS computational biology',
  '10.1371/journal.pcbi.1003416'),
 '24465196': ('Modeling within-host effects of drugs on Plasmodium falciparum transmission and prospects for malaria elimination.',
  ['Johnston GL', 'Gething PW', 'Hay SI', 'Smith DL', 'Fidock DA'],
  '2014',
  'PLoS computational biology',
  '10.1371/journal.pcbi.1003434'),
 '25187979': ('Seasonally dependent relationships between indicators of malaria transmission and disease provided by mathematical model simulations.',
  ['Stuckey EM', 'Smith T', 'Chitnis N'],
  '2014',
  'PLoS computational biology',
  '10.1371/journal.pcbi.1003812'),
 '25590612': ('The interaction between seasonality and pulsed interventions against malaria in their effects on the reproduction number.',
  ['Griffin JT'],
  '2015',
  'PLoS computational biology',
  '10.1371/journal.pcbi.1004057'),
 '26764905': ('Optimal Population-Level Infection Detection Strategies for Malaria Control and Elimination in a Spatial Model of Malaria Transmission.',
  ['Gerardin J', 'Bever CA', 'Hamainza B', 'Miller JM', 'Eckhoff PA', 'Wenger EA'],
  '2016',
  'PLoS computational biology',
  '10.1371/journal.pcbi.1004707'),
 '26890485': ('Quantifying Transmission Investment in Malaria Parasites.',
  ['Greischar MA', 'Mideo N', 'Read AF', 'Bjørnstad ON'],
  '2016',
  'PLoS computational biology',
  '10.1371/journal.pcbi.1004718'),
 '27043913': ('Identifying Malaria Transmission Foci for Elimination Using Human Mobility Data.',
  ['Ruktanonchai NW', 'DeLeenheer P', 'Tatem AJ', 'Alegana VA', 'Caughlin TT', 'Zu Erbach-Schoenberg E', 'Lourenço C', 'Ruktanonchai CW', 'Smith DL'],
  '2016',
  'PLoS computational biology',
  '10.1371/journal.pcbi.1004846'),
 '27509368': ('Malaria Incidence Rates from Time Series of 2-Wave Panel Surveys.',
  ['Castro MC', 'Maheu-Giroux M', 'Chiyaka C', 'Singer BH'],
  '2016',
  'PLoS computational biology',
  '10.1371/journal.pcbi.1005065')}

Because we restricted our search to papers in an open-access journal, you can then follow their DOIs to freely access their PDFs at the journal's website.

We will now issue calls to Entrez.elink() using these PubMed IDs, to retrieve the IDs of papers that cite them. The database from which the IDs will be retrieved is PubMed Central, a free digital database of full-text scientific literature in the biomedical and life sciences.

A complete list of the kinds of links you can retrieve with Entrez.elink() can be found here.


In [36]:
CA_citing = {
    id : Entrez.read( Entrez.elink(
            cmd = "neighbor",               # ELink command mode: "neighbor", returns
                                            #     a set of UIDs in `db` linked to the input UIDs in `dbfrom`.
            dbfrom = "pubmed",              # Database containing the input UIDs: PubMed
            db = "pmc",                     # Database from which to retrieve UIDs: PubMed Central
            LinkName = "pubmed_pmc_refs",   # Name of the Entrez link to retrieve: "pubmed_pmc_refs", gets
                                            #     "Full-text articles in the PubMed Central Database that cite the current articles"
            from_uid = id                   # input UIDs
            ) )
    for id in CA_ids
    }

CA_citing['22511852']


Out[36]:
[{'DbFrom': 'pubmed', 'IdList': ['22511852'], 'LinkSetDb': [{'DbTo': 'pmc', 'LinkName': 'pubmed_pmc_refs', 'Link': [{'Id': '4862716'}, {'Id': '4825093'}, {'Id': '4726288'}, {'Id': '4707858'}, {'Id': '4677286'}, {'Id': '4384290'}, {'Id': '4364229'}, {'Id': '4270465'}, {'Id': '3986854'}, {'Id': '3794903'}, {'Id': '3778436'}, {'Id': '3726600'}]}], 'LinkSetDbHistory': [], 'ERROR': []}]

We have in CA_citing[paper_id][0]['LinkSetDb'][0]['Link'] the list of papers citing paper_id. To get it as just a list of ids, we can do


In [40]:
cits = [ l['Id'] for l in CA_citing['22511852'][0]['LinkSetDb'][0]['Link'] ]
cits


Out[40]:
['4862716',
 '4825093',
 '4726288',
 '4707858',
 '4677286',
 '4384290',
 '4364229',
 '4270465',
 '3986854',
 '3794903',
 '3778436',
 '3726600']

However, one more step is needed, as what we have now are PubMed Central IDs, and not PubMed IDs. Their conversion can be achieved through an additional call to Entrez.elink():


In [41]:
cits_pm = Entrez.read( Entrez.elink( dbfrom="pmc", db="pubmed", LinkName="pmc_pubmed", from_uid=",".join(cits)) )
cits_pm


Out[41]:
[{'DbFrom': 'pmc', 'IdList': ['4862716', '4825093', '4726288', '4707858', '4677286', '4384290', '4364229', '4270465', '3986854', '3794903', '3778436', '3726600'], 'LinkSetDb': [{'DbTo': 'pubmed', 'LinkName': 'pmc_pubmed', 'Link': [{'Id': '26883585'}, {'Id': '26804201'}, {'Id': '26741401'}, {'Id': '26674193'}, {'Id': '26657042'}, {'Id': '25759421'}, {'Id': '25626688'}, {'Id': '25521112'}, {'Id': '24674301'}, {'Id': '24130474'}, {'Id': '24062941'}, {'Id': '23922996'}]}], 'LinkSetDbHistory': [], 'ERROR': []}]

In [42]:
ids_map = { pmc_id : link['Id'] for (pmc_id,link) in zip(cits_pm[0]['IdList'], cits_pm[0]['LinkSetDb'][0]['Link']) }
ids_map


Out[42]:
{'3726600': '23922996',
 '3778436': '24062941',
 '3794903': '24130474',
 '3986854': '24674301',
 '4270465': '25521112',
 '4364229': '25626688',
 '4384290': '25759421',
 '4677286': '26657042',
 '4707858': '26674193',
 '4726288': '26741401',
 '4825093': '26804201',
 '4862716': '26883585'}

And to check these papers:


In [43]:
{   p['Id'] : ( p['Title'], p['AuthorList'], p['PubDate'][:4], p['FullJournalName'], p.get('DOI', '') )
    for p in Entrez.read( Entrez.esummary(db="pubmed", id=','.join( ids_map.values() )) )
    }


Out[43]:
{'23922996': ('Plasmodium falciparum var gene expression homogeneity as a marker of the host-parasite relationship under different levels of naturally acquired immunity to malaria.',
  ['Warimwe GM', 'Recker M', 'Kiragu EW', 'Buckee CO', 'Wambua J', 'Musyoki JN', 'Marsh K', 'Bull PC'],
  '2013',
  'PloS one',
  '10.1371/journal.pone.0070467'),
 '24062941': ('The antigenic switching network of Plasmodium falciparum and its implications for the immuno-epidemiology of malaria.',
  ['Noble R', 'Christodoulou Z', 'Kyes S', 'Pinches R', 'Newbold CI', 'Recker M'],
  '2013',
  'eLife',
  '10.7554/eLife.01074'),
 '24130474': ('A network approach to analyzing highly recombinant malaria parasite genes.',
  ['Larremore DB', 'Clauset A', 'Buckee CO'],
  '2013',
  'PLoS computational biology',
  '10.1371/journal.pcbi.1003268'),
 '24674301': ('Plasmodium falciparum antigenic variation: relationships between widespread endothelial activation, parasite PfEMP1 expression and severe malaria.',
  ['Abdi AI', 'Fegan G', 'Muthui M', 'Kiragu E', 'Musyoki JN', 'Opiyo M', 'Marsh K', 'Warimwe GM', 'Bull PC'],
  '2014',
  'BMC infectious diseases',
  '10.1186/1471-2334-14-170'),
 '25521112': ('Generation of antigenic diversity in Plasmodium falciparum by structured rearrangement of Var genes during mitosis.',
  ['Claessens A', 'Hamilton WL', 'Kekre M', 'Otto TD', 'Faizullabhoy A', 'Rayner JC', 'Kwiatkowski D'],
  '2014',
  'PLoS genetics',
  '10.1371/journal.pgen.1004812'),
 '25626688': ('MDAT- Aligning multiple domain arrangements.',
  ['Kemena C', 'Bitard-Feildel T', 'Bornberg-Bauer E'],
  '2015',
  'BMC bioinformatics',
  '10.1186/s12859-014-0442-7'),
 '25759421': ('Mastering malaria: what helps and what hurts.',
  ['Gupta S'],
  '2015',
  'Proceedings of the National Academy of Sciences of the United States of America',
  '10.1073/pnas.1501786112'),
 '26657042': ('Differential Plasmodium falciparum surface antigen expression among children with Malarial Retinopathy.',
  ['Abdi AI', 'Kariuki SM', 'Muthui MK', 'Kivisi CA', 'Fegan G', 'Gitau E', 'Newton CR', 'Bull PC'],
  '2015',
  'Scientific reports',
  '10.1038/srep18034'),
 '26674193': ('Maintenance of phenotypic diversity within a set of virulence encoding genes of the malaria parasite Plasmodium falciparum.',
  ['Holding T', 'Recker M'],
  '2015',
  'Journal of the Royal Society, Interface',
  '10.1098/rsif.2015.0848'),
 '26741401': ('The role of PfEMP1 as targets of naturally acquired immunity to childhood malaria: prospects for a vaccine.',
  ['Bull PC', 'Abdi AI'],
  '2016',
  'Parasitology',
  '10.1017/S0031182015001274'),
 '26804201': ('Global selection of Plasmodium falciparum virulence antigen expression by host antibodies.',
  ['Abdi AI', 'Warimwe GM', 'Muthui MK', 'Kivisi CA', 'Kiragu EW', 'Fegan GW', 'Bull PC'],
  '2016',
  'Scientific reports',
  '10.1038/srep19882'),
 '26883585': ('Serological Conservation of Parasite-Infected Erythrocytes Predicts Plasmodium falciparum Erythrocyte Membrane Protein 1 Gene Expression but Not Severity of Childhood Malaria.',
  ['Warimwe GM', 'Abdi AI', 'Muthui M', 'Fegan G', 'Musyoki JN', 'Marsh K', 'Bull PC'],
  '2016',
  'Infection and immunity',
  '10.1128/IAI.00772-15')}

Citations dataset

We have now seen all the steps required to assemble a dataset of citations to each of the papers in our dataset.


In [24]:
Citations_file = 'data/' + search_term + '__Citations.pkl.bz2'
Citations = []

At least one server query will be issued per paper in Ids. Because NCBI allows for at most 3 queries per second (see here), this dataset will take a long time to assemble. Should you need to interrupt it for some reason, or the connection fail at some point, it is safe to just rerun the cell below until all data is collected.


In [27]:
import http.client

if Citations == [] and os.path.exists( Citations_file ):
    Citations = pickle.load( bz2.BZ2File( Citations_file, 'rb' ) )

if len(Citations) < len(Ids):
    
    i = len(Citations)
    checkpoint = len(Ids) / 10 + 1      # save to hard drive at every 10% of Ids fetched
    
    for pm_id in Ids[i:]:               # either starts from index 0, or resumes from where we previously left off
        
        while True:
            try:
                # query for papers archived in PubMed Central that cite the paper with PubMed ID `pm_id`
                c = Entrez.read( Entrez.elink( dbfrom = "pubmed", db="pmc", LinkName = "pubmed_pmc_refs", id=str(pm_id) ) )
                
                c = c[0]['LinkSetDb']
                if len(c) == 0:
                    # no citations found for the current paper
                    c = []
                else:
                    c = [ l['Id'] for l in c[0]['Link'] ]
                    
                    # convert citations from PubMed Central IDs to PubMed IDs
                    p = []
                    retrieve_per_query = 500
                    for start in range( 0, len(c), retrieve_per_query ):
                        query_ids = ','.join( c[start : start+retrieve_per_query] )
                        r = Entrez.read( Entrez.elink( dbfrom="pmc", db="pubmed", LinkName="pmc_pubmed", from_uid=query_ids ) )
                        # select the IDs. If no matching PubMed ID was found, [] is returned instead
                        p.extend( [] if r[0]['LinkSetDb']==[] else [ int(link['Id']) for link in r[0]['LinkSetDb'][0]['Link'] ] )
                    c = p
            
            except http.client.BadStatusLine:
                # Presumably, the server closed the connection before sending a valid response. Retry until we have the data.
                print('r')
                continue
            break
        
        Citations.append( (pm_id, c) )
        if (i % 10000 == 0):
            print('')
            print(i, end='')
        if (i % 100 == 0):
            print('.', end='')
        i += 1
        
        if i % checkpoint == 0:
            print('\tsaving at checkpoint', i)
            pickle.dump( Citations, bz2.BZ2File( Citations_file, 'wb' ) )
    
    print('\n done.')
    
    # Save Citations, as a dictionary indexed by Ids
    Citations = dict( Citations )
    
    pickle.dump( Citations, bz2.BZ2File( Citations_file, 'wb' ) )


......................................................................................
30000....................................................................................................
40000....................................................................................................
50000....................................................................................................
60000.......................................................................
 done.

To see that we have indeed obtained the data we expected, you can match the ids below, with the ids listed at the end of last section.


In [29]:
Citations[24130474]


Out[29]:
[27306566, 26456841, 25521112, 25368109, 25303095, 25122340]

Where do we go from here?

Running the code above generates multiple local files, containing the datasets we'll be working with. Loading them into memory is a matter of just issuing a call like
data = pickle.load( bz2.BZ2File( data_file, 'rb' ) ).

The Entrez module will therefore no longer be needed, unless you wish to extend your data processing with additional information retrieved from PubMed.

Should you be interested in looking at alternative ways to handle the data, have a look at the sqlite3 module included in Python's standard library, or Pandas, the Python Data Analysis Library.