The Entrez module, a part of the Biopython library, will be used to interface with PubMed.
You can download Biopython from here.
In this notebook we will be covering several of the steps taken in the Biopython Tutorial, specifically in Chapter 9 Accessing NCBI’s Entrez databases.
In [13]:
from Bio import Entrez
# NCBI requires you to set your email address to make use of NCBI's E-utilities
Entrez.email = "Your.Name.Here@example.org"
In [14]:
import pickle, bz2, os
In [15]:
# accessing extended information about the PubMed database
pubmed = Entrez.read( Entrez.einfo(db="pubmed"), validate=False )[u'DbInfo']
# list of possible search fields for use with ESearch:
search_fields = { f['Name']:f['Description'] for f in pubmed["FieldList"] }
In search_fields, we find 'TIAB' ('Free text associated with Abstract/Title') as a possible search field to use in searches.
In [16]:
search_fields
Out[16]:
To have a look at the kind of data we get when searching the database, we'll perform a search for papers authored by Haasdijk:
In [17]:
example_authors = ['Haasdijk E']
example_search = Entrez.read( Entrez.esearch( db="pubmed", term=' AND '.join([a+'[AUTH]' for a in example_authors]) ) )
example_search
Out[17]:
Note how the result being produced is not in Python's native string format:
In [18]:
type( example_search['IdList'][0] )
Out[18]:
The part of the query's result we are most interested in is accessible through
In [19]:
example_ids = [ int(id) for id in example_search['IdList'] ]
print(example_ids)
We will now assemble a dataset comprised of research articles containing the keyword "evolution", in either their titles or abstracts.
In [20]:
search_term = 'malaria'
In [21]:
Ids_file = 'data/' + search_term + '__Ids.pkl.bz2'
In [22]:
if os.path.exists( Ids_file ):
Ids = pickle.load( bz2.BZ2File( Ids_file, 'rb' ) )
else:
# determine the number of hits for the search term
search = Entrez.read( Entrez.esearch( db="pubmed", term=search_term+'[TIAB]', retmax=0 ) )
total = int( search['Count'] )
# `Ids` will be incrementally assembled, by performing multiple queries,
# each returning at most `retrieve_per_query` entries.
Ids_str = []
retrieve_per_query = 10000
for start in range( 0, total, retrieve_per_query ):
print('Fetching IDs of results [%d,%d]' % ( start, start+retrieve_per_query ) )
s = Entrez.read( Entrez.esearch( db="pubmed", term=search_term+'[TIAB]', retstart=start, retmax=retrieve_per_query ) )
Ids_str.extend( s[ u'IdList' ] )
# convert Ids to integers (and ensure that the conversion is reversible)
Ids = [ int(id) for id in Ids_str ]
for (id_str, id_int) in zip(Ids_str, Ids):
if str(id_int) != id_str:
raise Exception('Conversion of PubMed ID %s from string to integer it not reversible.' % id_str )
# Save list of Ids
pickle.dump( Ids, bz2.BZ2File( Ids_file, 'wb' ) )
total = len( Ids )
print('%d documents contain the search term "%s".' % ( total, search_term ) )
Taking a look at what we just retrieved, here are the last 5 elements of the Ids
list:
In [23]:
Ids[:5]
Out[23]:
To have a look at the kind of metadata we get from a call to Entrez.esummary()
, we now fetch the summary of one of Haasdijk's papers (using one of the PubMed IDs we obtained in the previous section:
In [12]:
example_paper = Entrez.read( Entrez.esummary(db="pubmed", id='27749938') )[0]
def print_dict( p ):
for k,v in p.items():
print(k)
print('\t', v)
print_dict(example_paper)
For now, we'll keep just some basic information for each paper: title, list of authors, publication year, and DOI.
In case you are not familiar with the DOI system, know that the paper above can be accessed through the link http://dx.doi.org/10.1007/s12065-012-0071-x (which is http://dx.doi.org/
followed by the paper's DOI).
In [13]:
( example_paper['Title'], example_paper['AuthorList'], int(example_paper['PubDate'][:4]), example_paper['DOI'] )
Out[13]:
We are now ready to assemble a dataset containing the summaries of all the paper Ids
we previously fetched.
To reduce the memory footprint, and to ensure the saved datasets won't depend on Biopython being installed to be properly loaded, values returned by Entrez.read()
will be converted to their corresponding native Python types. We start by defining a function for helping with the conversion of strings:
In [14]:
Summaries_file = 'data/' + search_term + '__Summaries.pkl.bz2'
In [15]:
if os.path.exists( Summaries_file ):
Summaries = pickle.load( bz2.BZ2File( Summaries_file, 'rb' ) )
else:
# `Summaries` will be incrementally assembled, by performing multiple queries,
# each returning at most `retrieve_per_query` entries.
Summaries = []
retrieve_per_query = 500
print('Fetching Summaries of results: ')
for start in range( 0, len(Ids), retrieve_per_query ):
if (start % 10000 == 0):
print('')
print(start, end='')
else:
print('.', end='')
# build comma separated string with the ids at indexes [start, start+retrieve_per_query)
query_ids = ','.join( [ str(id) for id in Ids[ start : start+retrieve_per_query ] ] )
s = Entrez.read( Entrez.esummary( db="pubmed", id=query_ids ) )
# out of the retrieved data, we will keep only a tuple (title, authors, year, DOI), associated with the paper's id.
# (all values converted to native Python formats)
f = [
( int( p['Id'] ), (
str( p['Title'] ),
[ str(a) for a in p['AuthorList'] ],
int( p['PubDate'][:4] ), # keeps just the publication year
str( p.get('DOI', '') ) # papers for which no DOI is available get an empty string in their place
) )
for p in s
]
Summaries.extend( f )
# Save Summaries, as a dictionary indexed by Ids
Summaries = dict( Summaries )
pickle.dump( Summaries, bz2.BZ2File( Summaries_file, 'wb' ) )
Let us take a look at the first 3 retrieved summaries:
In [16]:
{ id : Summaries[id] for id in Ids[:3] }
Out[16]:
Entrez.efetch()
is the function that will allow us to obtain paper abstracts. Let us start by taking a look at the kind of data it returns when we query PubMed's database.
In [17]:
q = Entrez.read( Entrez.efetch(db="pubmed", id='27749938', retmode="xml") )
q
is a list, with each member corresponding to a queried id. Because here we only queried for one id, its results are then in q[0]
.
In [18]:
type(q), len(q)
Out[18]:
At q[0]
we find a dictionary containing two keys, the contents of which we print below.
In [19]:
type(q[0]), q[0].keys()
Out[19]:
In [20]:
print_dict( q[0][ 'PubmedData' ] )
The key 'MedlineCitation'
maps into another dictionary. In that dictionary, most of the information is contained under the key 'Article'
. To minimize the clutter, below we show the contents of 'MedlineCitation'
excluding its 'Article'
member, and below that we then show the contents of 'Article'
.
In [21]:
print_dict( { k:v for k,v in q[0][ 'MedlineCitation' ].items() if k!='Article' } )
In [22]:
print_dict( q[0][ 'MedlineCitation' ][ 'Article' ] )
A paper's abstract can therefore be accessed with:
In [23]:
{ int(q[0]['MedlineCitation']['PMID']) : str(q[0]['MedlineCitation']['Article']['Abstract']['AbstractText'][0]) }
Out[23]:
A paper for which no abstract is available will simply not contain the 'Abstract'
key in its 'Article'
dictionary:
In [24]:
print_dict( Entrez.read( Entrez.efetch(db="pubmed", id='17782550', retmode="xml") )[0]['MedlineCitation']['Article'] )
Some of the ids in our dataset refer to books from the NCBI Bookshelf, a collection of freely available, downloadable, on-line versions of selected biomedical books. For such ids, Entrez.efetch()
returns a slightly different structure, where the keys [u'BookDocument', u'PubmedBookData']
take the place of the [u'MedlineCitation', u'PubmedData']
keys we saw above.
Here is an example of the data we obtain for the id corresponding to the book The Social Biology of Microbial Communities:
In [25]:
r = Entrez.read( Entrez.efetch(db="pubmed", id='24027805', retmode="xml") )
In [26]:
print_dict( r[0][ 'PubmedBookData' ] )
In [27]:
print_dict( r[0][ 'BookDocument' ] )
In a book from the NCBI Bookshelf, its abstract can then be accessed as such:
In [28]:
{ int(r[0]['BookDocument']['PMID']) : str(r[0]['BookDocument']['Abstract']['AbstractText'][0]) }
Out[28]:
We can now assemble a dataset mapping paper ids to their abstracts.
In [29]:
Abstracts_file = 'data/' + search_term + '__Abstracts.pkl.bz2'
In [30]:
import http.client
from collections import deque
if os.path.exists( Abstracts_file ):
Abstracts = pickle.load( bz2.BZ2File( Abstracts_file, 'rb' ) )
else:
# `Abstracts` will be incrementally assembled, by performing multiple queries,
# each returning at most `retrieve_per_query` entries.
Abstracts = deque()
retrieve_per_query = 500
print('Fetching Abstracts of results: ')
for start in range( 0, len(Ids), retrieve_per_query ):
if (start % 10000 == 0):
print('')
print(start, end='')
else:
print('.', end='')
# build comma separated string with the ids at indexes [start, start+retrieve_per_query)
query_ids = ','.join( [ str(id) for id in Ids[ start : start+retrieve_per_query ] ] )
# issue requests to the server, until we get the full amount of data we expect
while True:
try:
s = Entrez.read( Entrez.efetch(db="pubmed", id=query_ids, retmode="xml" ) )
except http.client.IncompleteRead:
print('r', end='')
continue
break
i = 0
for p in s:
abstr = ''
if 'MedlineCitation' in p:
pmid = p['MedlineCitation']['PMID']
if 'Abstract' in p['MedlineCitation']['Article']:
abstr = p['MedlineCitation']['Article']['Abstract']['AbstractText'][0]
elif 'BookDocument' in p:
pmid = p['BookDocument']['PMID']
if 'Abstract' in p['BookDocument']:
abstr = p['BookDocument']['Abstract']['AbstractText'][0]
else:
raise Exception('Unrecognized record type, for id %d (keys: %s)' % (Ids[start+i], str(p.keys())) )
Abstracts.append( (int(pmid), str(abstr)) )
i += 1
# Save Abstracts, as a dictionary indexed by Ids
Abstracts = dict( Abstracts )
pickle.dump( Abstracts, bz2.BZ2File( Abstracts_file, 'wb' ) )
Taking a look at one paper's abstract:
In [31]:
Abstracts[27749938]
Out[31]:
To understand how to obtain paper citations with Entrez, we will first assemble a small set of PubMed IDs, and then query for their citations. To that end, we search here for papers published in the PLOS Computational Biology journal (as before, having also the word "malaria" in either the title or abstract):
In [32]:
CA_search_term = search_term+'[TIAB] AND PLoS computational biology[JOUR]'
CA_ids = Entrez.read( Entrez.esearch( db="pubmed", term=CA_search_term ) )['IdList']
CA_ids
Out[32]:
In [33]:
CA_summ = {
p['Id'] : ( p['Title'], p['AuthorList'], p['PubDate'][:4], p['FullJournalName'], p.get('DOI', '') )
for p in Entrez.read( Entrez.esummary(db="pubmed", id=','.join( CA_ids )) )
}
CA_summ
Out[33]:
Because we restricted our search to papers in an open-access journal, you can then follow their DOIs to freely access their PDFs at the journal's website.
We will now issue calls to Entrez.elink()
using these PubMed IDs, to retrieve the IDs of papers that cite them.
The database from which the IDs will be retrieved is PubMed Central, a free digital database of full-text scientific literature in the biomedical and life sciences.
A complete list of the kinds of links you can retrieve with Entrez.elink()
can be found here.
In [36]:
CA_citing = {
id : Entrez.read( Entrez.elink(
cmd = "neighbor", # ELink command mode: "neighbor", returns
# a set of UIDs in `db` linked to the input UIDs in `dbfrom`.
dbfrom = "pubmed", # Database containing the input UIDs: PubMed
db = "pmc", # Database from which to retrieve UIDs: PubMed Central
LinkName = "pubmed_pmc_refs", # Name of the Entrez link to retrieve: "pubmed_pmc_refs", gets
# "Full-text articles in the PubMed Central Database that cite the current articles"
from_uid = id # input UIDs
) )
for id in CA_ids
}
CA_citing['22511852']
Out[36]:
We have in CA_citing[paper_id][0]['LinkSetDb'][0]['Link']
the list of papers citing paper_id
. To get it as just a list of ids, we can do
In [40]:
cits = [ l['Id'] for l in CA_citing['22511852'][0]['LinkSetDb'][0]['Link'] ]
cits
Out[40]:
However, one more step is needed, as what we have now are PubMed Central IDs, and not PubMed IDs. Their conversion can be achieved through an additional call to Entrez.elink()
:
In [41]:
cits_pm = Entrez.read( Entrez.elink( dbfrom="pmc", db="pubmed", LinkName="pmc_pubmed", from_uid=",".join(cits)) )
cits_pm
Out[41]:
In [42]:
ids_map = { pmc_id : link['Id'] for (pmc_id,link) in zip(cits_pm[0]['IdList'], cits_pm[0]['LinkSetDb'][0]['Link']) }
ids_map
Out[42]:
And to check these papers:
In [43]:
{ p['Id'] : ( p['Title'], p['AuthorList'], p['PubDate'][:4], p['FullJournalName'], p.get('DOI', '') )
for p in Entrez.read( Entrez.esummary(db="pubmed", id=','.join( ids_map.values() )) )
}
Out[43]:
We have now seen all the steps required to assemble a dataset of citations to each of the papers in our dataset.
In [24]:
Citations_file = 'data/' + search_term + '__Citations.pkl.bz2'
Citations = []
At least one server query will be issued per paper in Ids
. Because NCBI allows for at most 3 queries per second (see here), this dataset will take a long time to assemble. Should you need to interrupt it for some reason, or the connection fail at some point, it is safe to just rerun the cell below until all data is collected.
In [27]:
import http.client
if Citations == [] and os.path.exists( Citations_file ):
Citations = pickle.load( bz2.BZ2File( Citations_file, 'rb' ) )
if len(Citations) < len(Ids):
i = len(Citations)
checkpoint = len(Ids) / 10 + 1 # save to hard drive at every 10% of Ids fetched
for pm_id in Ids[i:]: # either starts from index 0, or resumes from where we previously left off
while True:
try:
# query for papers archived in PubMed Central that cite the paper with PubMed ID `pm_id`
c = Entrez.read( Entrez.elink( dbfrom = "pubmed", db="pmc", LinkName = "pubmed_pmc_refs", id=str(pm_id) ) )
c = c[0]['LinkSetDb']
if len(c) == 0:
# no citations found for the current paper
c = []
else:
c = [ l['Id'] for l in c[0]['Link'] ]
# convert citations from PubMed Central IDs to PubMed IDs
p = []
retrieve_per_query = 500
for start in range( 0, len(c), retrieve_per_query ):
query_ids = ','.join( c[start : start+retrieve_per_query] )
r = Entrez.read( Entrez.elink( dbfrom="pmc", db="pubmed", LinkName="pmc_pubmed", from_uid=query_ids ) )
# select the IDs. If no matching PubMed ID was found, [] is returned instead
p.extend( [] if r[0]['LinkSetDb']==[] else [ int(link['Id']) for link in r[0]['LinkSetDb'][0]['Link'] ] )
c = p
except http.client.BadStatusLine:
# Presumably, the server closed the connection before sending a valid response. Retry until we have the data.
print('r')
continue
break
Citations.append( (pm_id, c) )
if (i % 10000 == 0):
print('')
print(i, end='')
if (i % 100 == 0):
print('.', end='')
i += 1
if i % checkpoint == 0:
print('\tsaving at checkpoint', i)
pickle.dump( Citations, bz2.BZ2File( Citations_file, 'wb' ) )
print('\n done.')
# Save Citations, as a dictionary indexed by Ids
Citations = dict( Citations )
pickle.dump( Citations, bz2.BZ2File( Citations_file, 'wb' ) )
To see that we have indeed obtained the data we expected, you can match the ids below, with the ids listed at the end of last section.
In [29]:
Citations[24130474]
Out[29]:
Running the code above generates multiple local files, containing the datasets we'll be working with. Loading them into memory is a matter of just issuing a call like
data = pickle.load( bz2.BZ2File( data_file, 'rb' ) )
.
The Entrez module will therefore no longer be needed, unless you wish to extend your data processing with additional information retrieved from PubMed.
Should you be interested in looking at alternative ways to handle the data, have a look at the sqlite3 module included in Python's standard library, or Pandas, the Python Data Analysis Library.