pypdb advanced demos

This is a set of basic examples of the ways that algorithmic querying with PyPDB can be used to perform advanced search tasks. Most of these examples combine multiple functions in the API in order to perform searches based on the results of previous searches

Preamble


In [1]:
%pylab inline
from IPython.display import HTML

from pypdb.pypdb import *

import pprint


Populating the interactive namespace from numpy and matplotlib

In [2]:
# Search for PDB IDs related to CRISPR
crispr_query = make_query('crispr')
crispr_results = do_search(crispr_query)

# Run BLAST on the top result
top_result = crispr_results[0]
blast_hits = get_blast2(top_result)


for item in blast_hits[0]:
    pdbdesc = describe_pdb(item)
    print(pdbdesc['title'])


Structure of Thermus Thermophilus Cse3 bound to an RNA representing a product complex
Structure of Thermus Thermophilus Cse3 bound to an RNA representing a pre-cleavage complex
Structure of Thermus Thermophilus Cse3 bound to an RNA representing a product mimic complex
STRUCTURE A OF CRISPR ENDORIBONUCLEASE CSE3 BOUND TO 19 NT RNA
STRUCTURE B OF CRISPR ENDORIBONUCLEASE CSE3 BOUND TO 19 NT RNA
STRUCTURE OF CRISPR ENDORIBONUCLEASE CSE3 BOUND TO 20 NT RNA
Crystal structure of a CRISPR-associated protein from thermus thermophilus
Crystal structure of the E. coli CRISPR RNA-guided surveillance complex, Cascade
Crystal structure of a CRISPR RNA-guided surveillance complex, Cascade, bound to a ssDNA target
Crystal structure of RNA-guided immune Cascade complex from E.coli
Crystal structure of the CRISPR-associated protein Cas6e from Escherichia coli str. K-12

Estimate total number of depositions versus time


In [3]:
# Choose a random sample because we don't want to call the database for every single entry
from random import choice

all_pdbs = get_all()

all_dates = list()

for ii in range(100):
    pdb_desc = describe_pdb( choice(all_pdbs) )
    depdate = (pdb_desc['deposition_date'])
    all_dates.append( int(depdate[:4]) )
    
all_dates = array(all_dates)

figure()
subs_v_time = hist(all_dates, max(all_dates)-min(all_dates))
show(subs_v_time)   

# Show power-law scaling
figure()
subs_v_time_loglog = loglog(subs_v_time[0],'.')
show(subs_v_time_loglog)


Graph new CRISPR entries versus time


In [5]:
# Perform search
all_dates = find_dates('crispr', max_results=500)
all_dates = array(all_dates)
all_dates = array([int(depdate[:4]) for depdate in all_dates])
subs_v_time = histogram(all_dates, max(all_dates)-min(all_dates))
dates, num_entries = subs_v_time[1][1:], subs_v_time[0]
popgraph = fill_between(dates, 0, num_entries)

# Formatting the plots
xlim([dates[0], dates[-1]] )
gca().xaxis.set_major_formatter(FormatStrFormatter('%d'))
xticks(fontweight='bold')
yticks(fontweight='bold')
xlabel('Year',fontweight='bold')
ylabel('New PDB entries',fontweight='bold')
show(popgraph)


Sweep RMSD matching parameters


In [15]:
point_group = 'C1'
max_distance = 5.0
npts = 20

dist_vals = linspace(0.0, max_distance, npts)
dx = dist_vals[1]-dist_vals[0]

all_ids = []

for dist_val in dist_vals:
    idlist = do_protsym_search(point_group, min_rmsd=dist_val, max_rmsd=(dist_val+dx))
    all_ids.append(idlist)
    
counts = array([len(item) for item in all_ids])
show(semilogy(dist_vals, counts))
title('Total results versus RMSD')
xlabel('Radius (A)')
ylabel('Number of results')


Out[15]:
<matplotlib.text.Text at 0x10a076668>

Find all associated organisms and


In [56]:
# Search for PDB IDs related to CRISPR
crispr_query = make_query('swim')
crispr_results = do_search(crispr_query)

# Run BLAST on the top result
top_result = crispr_results[0]
blast_hits = get_blast2(top_result)

# Print list of associated taxa
pprint.pprint(list_taxa(blast_hits[0][:5]))
pprint.pprint(list_types(blast_hits[0][:5]))


['Thunnus thynnus',
 'Thunnus thynnus',
 'Thunnus thynnus',
 'Trematomus bernacchii',
 'Trematomus bernacchii']
['protein', 'protein', 'protein', 'protein', 'protein']

In [ ]: