Impact of PDB entries

PDB entries are experimentally determined models of interesting proteins. Scholarly literature often refers to PDB entries in discussions about interesting aspects of macromolecules, e.g. fold, funtion, folding, etc. So, a basic way to measure the impact of a PDB entry is simply to count of publications associated with the entry.

Getting started

Let us run the tutorial_utils notebook to setup API URL, logger, caller utility, etc. Check out that notebook to setup anything differently.



In [1]:

    
%run 'tutorial_utils.ipynb'

Publications from the API

There are two calls for entry publications - /pdb/entry/publications and /pdb/entry/related_publications. Check them out in the interactive documentation explorer of the API.

The first call provides articles associated with the entry directly, i.e. the ones which depositor provided.

The second call provides articles and reviews mined from EuroPMC. These publications either cite the depositor's citations directly or merely mention the PDB entry id in the text of the article without explicitly citing any article. Note that articles with a mention of the PDB id can be mined by EuroPMC only when the full text of the article is freely available.

In this example, we will use entries deposited by 'Kleywegt' - I have copied Kleywegt's PDB ids using the PDBe search service. There are 39 as of August 2014. These can be obtained programatically too, see search_introduction notebook to know more.



In [7]:

    
pdb_ids_list = [
    "2c10", "2c11", "1wc2", "2c3n", "2c3q", "2c3t", "1xwg", "1usb", "1pkw", "1pkz",\
    "1pl1", "1pl2", "1o8v", "2cds", "1hb6", "1hb8", "1hgw", "1hgy", "1egn", "1hbk",\
    "2cbs", "3cbs", "2cbr", "1qjw", "1qk0", "1qk2", "2a2u", "2a2g", "1eg1", "1cb2",\
    "1fss", "1lbs", "1lbt", "2chr", "1fcc", "1cbq", "1cbr", "1cbs", "1guh", \
]

Fetching publications from API

Let us now define a dictionary which will hold publications in three categories for each entry.



In [8]:

    
import collections
entry_pub_keys = collections.defaultdict( \
    lambda:{"cited_by":set(), "appears_without_citation":set(), "depositor_citations":set()} \
)

Since the two API calls return publications in slightly different formats, we need to store them in a more uniform data structure before further analysis. We should define a unique identifier for an article, so that we do not consider the same article twice.

Pubmed id would have been a good choice, but not all articles are indexed in pubmed, notably those from Acta Cryst!

So, let us create a unique key which is a composite of title, journal name, volume, pages, publication year and pubmed_id.



In [6]:

    
from collections import namedtuple
ArticleKey = namedtuple("ArticleKey", ["title","journal","volume","pages","year","pubmed_id"])

Let us write functions to create ArticleKey from the calls mentioned above.



In [10]:

    
def make_entry_citation_key(pub_info) :
    return ArticleKey(
        pub_info["title"],
        pub_info["journal_info"]["pdb_abbreviation"],
        pub_info["journal_info"]["volume"],
        pub_info["journal_info"]["pages"],
        pub_info["journal_info"]["year"],
        pub_info["pubmed_id"],
    )

def make_entry_related_publication_key(pub_info) :
    return ArticleKey(
        pub_info["title"],
        pub_info["journal"],
        pub_info["volume"],
        pub_info["pages"],
        pub_info["year"],
        pub_info["pubmed_id"],
    )

Let us obtain unique articles from /pdb/entry/publications



In [13]:

    
for pdb_id in pdb_ids_list :
    pub_url = PDBE_API_URL + "/pdb/entry/publications/" +pdb_id
    try :
        api_pub_data = get_PDBe_API_data(pub_url)[pdb_id]
    except :
        logging.warn("Entry publications could not be obtained for PDB id " + pdb_id)
    else :
        for pub_info in api_pub_data :
            pub_key = make_entry_citation_key(pub_info)
            if pub_key.year is None :
                continue
            entry_pub_keys[pdb_id]["depositor_citations"].add(pub_key)
            
logging.info("Entry publications obtained for %d entries." % len(entry_pub_keys))









    



LOG|11-Nov-2014 14:15:09|INFO  Entry publications obtained for 38 entries.

Now let us do the same for /pdb/entry/publications



In [15]:

    
for pdb_id in pdb_ids_list :
    pub_url = PDBE_API_URL + "/pdb/entry/related_publications/" +pdb_id
    try :
        api_pub_data = get_PDBe_API_data(pub_url)[pdb_id]
    except :
        logging.warn("Entry related publications could not be obtained for PDB id " + pdb_id)
    else :
        for pub_category in api_pub_data :
            for pub_type, publications in api_pub_data[pub_category].items() :
                for pub_info in publications :
                    pub_key = make_entry_related_publication_key(pub_info)
                    if pub_key.year is None :
                        continue
                    entry_pub_keys[pdb_id][pub_category].add(pub_key)

logging.info("Entry related publications obtained for %d entries." % len(entry_pub_keys))









    



LOG|11-Nov-2014 14:16:40|WARNING  Error fetching PDBe-API data! Trial number 0 for call http://www.ebi.ac.uk/pdbe/api/pdb/entry/related_publications/2cds
LOG|11-Nov-2014 14:16:40|WARNING  Error fetching PDBe-API data! Trial number 1 for call http://www.ebi.ac.uk/pdbe/api/pdb/entry/related_publications/2cds
LOG|11-Nov-2014 14:16:40|WARNING  Error fetching PDBe-API data! Trial number 2 for call http://www.ebi.ac.uk/pdbe/api/pdb/entry/related_publications/2cds
LOG|11-Nov-2014 14:16:40|WARNING  Entry related publications could not be obtained for PDB id 2cds
LOG|11-Nov-2014 14:16:41|INFO  Entry related publications obtained for 38 entries.

Ranking entries on impact

Let us rank entries according to the number of publications in all three categories, and print a few at the top.



In [16]:

    
for pub_category in ["depositor_citations", "cited_by", "appears_without_citation"] :
    def key_func(pdb_id) :
        return len(entry_pub_keys[pdb_id][pub_category])
    for pdb_id in sorted(entry_pub_keys.keys(), reverse=True, key=key_func) [0:3] :
        logging.info("PDB id %s has %d citations of type %s." % (pdb_id, len(entry_pub_keys[pdb_id][pub_category]), pub_category))









    



LOG|11-Nov-2014 14:18:48|INFO  PDB id 1cbr has 5 citations of type depositor_citations.
LOG|11-Nov-2014 14:18:48|INFO  PDB id 1cbs has 5 citations of type depositor_citations.
LOG|11-Nov-2014 14:18:48|INFO  PDB id 1cbq has 5 citations of type depositor_citations.
LOG|11-Nov-2014 14:18:48|INFO  PDB id 1qk2 has 156 citations of type cited_by.
LOG|11-Nov-2014 14:18:48|INFO  PDB id 1qk0 has 156 citations of type cited_by.
LOG|11-Nov-2014 14:18:48|INFO  PDB id 1qjw has 156 citations of type cited_by.
LOG|11-Nov-2014 14:18:48|INFO  PDB id 1fss has 11 citations of type appears_without_citation.
LOG|11-Nov-2014 14:18:48|INFO  PDB id 1fcc has 10 citations of type appears_without_citation.
LOG|11-Nov-2014 14:18:48|INFO  PDB id 1cbs has 5 citations of type appears_without_citation.

Let us plot the number of publications now.



In [18]:

    
%matplotlib inline
def make_publications_bar_plot(ordered_pdbids, xtick_maker) :
    import matplotlib.pyplot as plt
    fig = plt.figure()
    ax = fig.add_subplot(111)
    plt.ylabel("Number of publications")
    pub_categories = ["depositor_citations", "cited_by", "appears_without_citation"]
    bar_colours = ["red","green","blue"]
    plot_objects = []
    xticks, xtick_labels = [], []
    for pci in range(len(pub_categories)) :
        x, y = [], []
        pub_category = pub_categories[pci]
        for pi in range(len(ordered_pdbids)) :
            pdb_id = ordered_pdbids[pi]
            x.append( 5 + pci + pi*(1+len(pub_categories)) )
            if pci==1 :
                xticks.append( 5 + pci + pi*(1+len(pub_categories)) )
                xtick_labels.append( xtick_maker(pdb_id) )
            y.append( len(entry_pub_keys[pdb_id][pub_category]) )
        plot_objects.append( ax.bar(x, y, color=bar_colours[pci]) )
    ax.legend( [po[0] for po in plot_objects], pub_categories )
    ax.set_xticks(xticks)
    xticks_obj = ax.set_xticklabels(xtick_labels)
    plt.setp(xticks_obj, rotation=90)
    plt.ylim([0,200])
    plt.show()

make_publications_bar_plot(entry_pub_keys.keys(), lambda pid:pid)

Do number of publications correlate with year of deposition of the entry? From the data we already fetched from the API, we have publication years - the earliest of which would be close to the year of deposition.

Let us reorder X axis of the plot based on this year.



In [41]:

    
def earliest_year(pdb_id) :
    ret_year = 3000
    for pub_cat in entry_pub_keys[pdb_id] :
        for pub_key in entry_pub_keys[pdb_id][pub_cat] :
            if pub_key.year == None :
                logging.warn("Missing year publication! " + str(pub_key))
                continue
            entry_year = int(pub_key.year)
            if ret_year > entry_year :
                ret_year = entry_year
    return ret_year

pdbids_with_publications = list( sorted(entry_pub_keys.keys(), key=earliest_year) )
make_publications_bar_plot(pdbids_with_publications, lambda pid:pid+":"+str(earliest_year(pid)))

Your turn!

Explore correlation between number of publications associated with an entry and protein studied, co-authors and resolution.
Repeat this exercise on another subset of entries, such as those containing your favourite protein - do you see any patterns in the volume of structural work on your protein as function of year?