PDB entries are experimentally determined models of interesting proteins. Scholarly literature often refers to PDB entries in discussions about interesting aspects of macromolecules, e.g. fold, funtion, folding, etc. So, a basic way to measure the impact of a PDB entry is simply to count of publications associated with the entry.
Let us run the tutorial_utils notebook to setup API URL, logger, caller utility, etc. Check out that notebook to setup anything differently.
In [1]:
%run 'tutorial_utils.ipynb'
There are two calls for entry publications - /pdb/entry/publications and /pdb/entry/related_publications. Check them out in the interactive documentation explorer of the API.
The first call provides articles associated with the entry directly, i.e. the ones which depositor provided.
The second call provides articles and reviews mined from EuroPMC. These publications either cite the depositor's citations directly or merely mention the PDB entry id in the text of the article without explicitly citing any article. Note that articles with a mention of the PDB id can be mined by EuroPMC only when the full text of the article is freely available.
In this example, we will use entries deposited by 'Kleywegt' - I have copied Kleywegt's PDB ids using the PDBe search service. There are 39 as of August 2014. These can be obtained programatically too, see search_introduction notebook to know more.
In [7]:
pdb_ids_list = [
"2c10", "2c11", "1wc2", "2c3n", "2c3q", "2c3t", "1xwg", "1usb", "1pkw", "1pkz",\
"1pl1", "1pl2", "1o8v", "2cds", "1hb6", "1hb8", "1hgw", "1hgy", "1egn", "1hbk",\
"2cbs", "3cbs", "2cbr", "1qjw", "1qk0", "1qk2", "2a2u", "2a2g", "1eg1", "1cb2",\
"1fss", "1lbs", "1lbt", "2chr", "1fcc", "1cbq", "1cbr", "1cbs", "1guh", \
]
Let us now define a dictionary which will hold publications in three categories for each entry.
In [8]:
import collections
entry_pub_keys = collections.defaultdict( \
lambda:{"cited_by":set(), "appears_without_citation":set(), "depositor_citations":set()} \
)
Since the two API calls return publications in slightly different formats, we need to store them in a more uniform data structure before further analysis. We should define a unique identifier for an article, so that we do not consider the same article twice.
Pubmed id would have been a good choice, but not all articles are indexed in pubmed, notably those from Acta Cryst!
So, let us create a unique key which is a composite of title, journal name, volume, pages, publication year and pubmed_id.
In [6]:
from collections import namedtuple
ArticleKey = namedtuple("ArticleKey", ["title","journal","volume","pages","year","pubmed_id"])
Let us write functions to create ArticleKey from the calls mentioned above.
In [10]:
def make_entry_citation_key(pub_info) :
return ArticleKey(
pub_info["title"],
pub_info["journal_info"]["pdb_abbreviation"],
pub_info["journal_info"]["volume"],
pub_info["journal_info"]["pages"],
pub_info["journal_info"]["year"],
pub_info["pubmed_id"],
)
def make_entry_related_publication_key(pub_info) :
return ArticleKey(
pub_info["title"],
pub_info["journal"],
pub_info["volume"],
pub_info["pages"],
pub_info["year"],
pub_info["pubmed_id"],
)
Let us obtain unique articles from /pdb/entry/publications
In [13]:
for pdb_id in pdb_ids_list :
pub_url = PDBE_API_URL + "/pdb/entry/publications/" +pdb_id
try :
api_pub_data = get_PDBe_API_data(pub_url)[pdb_id]
except :
logging.warn("Entry publications could not be obtained for PDB id " + pdb_id)
else :
for pub_info in api_pub_data :
pub_key = make_entry_citation_key(pub_info)
if pub_key.year is None :
continue
entry_pub_keys[pdb_id]["depositor_citations"].add(pub_key)
logging.info("Entry publications obtained for %d entries." % len(entry_pub_keys))
Now let us do the same for /pdb/entry/publications
In [15]:
for pdb_id in pdb_ids_list :
pub_url = PDBE_API_URL + "/pdb/entry/related_publications/" +pdb_id
try :
api_pub_data = get_PDBe_API_data(pub_url)[pdb_id]
except :
logging.warn("Entry related publications could not be obtained for PDB id " + pdb_id)
else :
for pub_category in api_pub_data :
for pub_type, publications in api_pub_data[pub_category].items() :
for pub_info in publications :
pub_key = make_entry_related_publication_key(pub_info)
if pub_key.year is None :
continue
entry_pub_keys[pdb_id][pub_category].add(pub_key)
logging.info("Entry related publications obtained for %d entries." % len(entry_pub_keys))
Let us rank entries according to the number of publications in all three categories, and print a few at the top.
In [16]:
for pub_category in ["depositor_citations", "cited_by", "appears_without_citation"] :
def key_func(pdb_id) :
return len(entry_pub_keys[pdb_id][pub_category])
for pdb_id in sorted(entry_pub_keys.keys(), reverse=True, key=key_func) [0:3] :
logging.info("PDB id %s has %d citations of type %s." % (pdb_id, len(entry_pub_keys[pdb_id][pub_category]), pub_category))
Let us plot the number of publications now.
In [18]:
%matplotlib inline
def make_publications_bar_plot(ordered_pdbids, xtick_maker) :
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
plt.ylabel("Number of publications")
pub_categories = ["depositor_citations", "cited_by", "appears_without_citation"]
bar_colours = ["red","green","blue"]
plot_objects = []
xticks, xtick_labels = [], []
for pci in range(len(pub_categories)) :
x, y = [], []
pub_category = pub_categories[pci]
for pi in range(len(ordered_pdbids)) :
pdb_id = ordered_pdbids[pi]
x.append( 5 + pci + pi*(1+len(pub_categories)) )
if pci==1 :
xticks.append( 5 + pci + pi*(1+len(pub_categories)) )
xtick_labels.append( xtick_maker(pdb_id) )
y.append( len(entry_pub_keys[pdb_id][pub_category]) )
plot_objects.append( ax.bar(x, y, color=bar_colours[pci]) )
ax.legend( [po[0] for po in plot_objects], pub_categories )
ax.set_xticks(xticks)
xticks_obj = ax.set_xticklabels(xtick_labels)
plt.setp(xticks_obj, rotation=90)
plt.ylim([0,200])
plt.show()
make_publications_bar_plot(entry_pub_keys.keys(), lambda pid:pid)
Do number of publications correlate with year of deposition of the entry? From the data we already fetched from the API, we have publication years - the earliest of which would be close to the year of deposition.
Let us reorder X axis of the plot based on this year.
In [41]:
def earliest_year(pdb_id) :
ret_year = 3000
for pub_cat in entry_pub_keys[pdb_id] :
for pub_key in entry_pub_keys[pdb_id][pub_cat] :
if pub_key.year == None :
logging.warn("Missing year publication! " + str(pub_key))
continue
entry_year = int(pub_key.year)
if ret_year > entry_year :
ret_year = entry_year
return ret_year
pdbids_with_publications = list( sorted(entry_pub_keys.keys(), key=earliest_year) )
make_publications_bar_plot(pdbids_with_publications, lambda pid:pid+":"+str(earliest_year(pid)))