pypdb demos

This is a set of basic examples of the usage and outputs of the various individual functions included in. There are generally two types of functions:

  • Functions that perform searches and return lists of PDB IDs
  • Functions that get information about specific PDB IDs

The list of supported search types, as well as the different types of information that can be returned for a given PDB ID, is large (and growing) and is enumerated completely in the docstrings of pypdb.py. The PDB allows a very wide range of different types of queries, and so any option that is not currently available can likely be implemented pretty easily based on the structure of the query types that have already been implemented. I appreciate any feedback and pull requests.

Another notebook in this directory, advanced_demos.ipynb, includes more in-depth usages of multiple functions, including the tutorial on graphing the popularity of CRISPR that was originally included in this notebook

Preamble


In [1]:
%pylab inline
from IPython.display import HTML

from pypdb.pypdb import *

import pprint


Populating the interactive namespace from numpy and matplotlib

1. Search functions that return lists of PDB IDs

Get a list of PDBs for a specific search term


In [2]:
search_dict = make_query('actin network')
found_pdbs = do_search(search_dict)
print(found_pdbs)


['1D7M', '3W3D', '4A7H', '4A7L', '4A7N']

Search by a specific modified structure


In [3]:
search_dict = make_query('3W3D',querytype='ModifiedStructuresQuery')
found_pdbs = do_search(search_dict)
print(found_pdbs)


['1H3E', '1OB2', '1OB5', '1QPC', '2D2G', '2D2H', '2D2J', '2JUE', '2LN8', '2N2Q', '2WDC', '2XU8', '3BF7', '3BF8', '3DLS', '3J1O', '3JA8', '3VNY', '3VNZ', '3VO0', '4B3Z', '4B7V', '4C2M', '4C4O', '4CDQ', '4CM5', '4COU', '4COV', '4COW', '4COY', '4COZ', '4CRC', '4D3W', '4MJ3', '4N1R', '4NDJ', '4NDK', '4OIV', '4PA0', '4QDJ', '4QDK', '4QS7', '4QS8', '4QS9', '4RFE', '4RFN', '4RFO', '4RPT', '4RV3', '4S3G', '4TQU', '4TQV', '4U2Y', '4U31', '4U33', '4U3C', '4UDZ', '4UF1', '4UF2', '4UF3', '4UIR', '4UJ3', '4UJ4', '4UJ5', '4UY4', '4UZJ', '4W6V', '4WGK', '4WIH', '4WJ1', '4WJ2', '4WMQ', '4WMY', '4WNJ', '4WNS', '4WO0', '4WT8', '4WUB', '4WWY', '4WXH', '4WXV', '4WZ0', '4WZ2', '4WZ3', '4WZG', '4XAA', '4XAB', '4XAC', '4XBZ', '4XC9', '4XCA', '4XCB', '4XFW', '4XK0', '4XLD', '4XN6', '4XNC', '4XNE', '4XOY', '4XOZ', '4XP0', '4XP2', '4XP3', '4XPF', '4XPG', '4XPH', '4XR9', '4XRJ', '4XSG', '4XSH', '4XSX', '4XSY', '4XSZ', '4Y6N', '4Y6U', '4Y7F', '4Y7G', '4Y9X', '4YCX', '4YD1', '4YD2', '4YD8', '4YH8', '4YLU', '4YMK', '4YTC', '4YTF', '4YTH', '4YTI', '4YU9', '4YVG', '4YVH', '4YWY', '4Z0V', '4Z2B', '4Z8B', '4ZFX', '4ZH2', '4ZH3', '4ZH4', '4ZHC', '4ZHD', '4ZHF', '4ZHG', '4ZHH', '4ZL4', '4ZN8', '4ZOT', '4ZP4', '4ZPH', '4ZPI', '4ZPK', '4ZPR', '4ZQD', '4ZSC', '4ZSD', '5A2E', '5A2F', '5A2T', '5A42', '5A63', '5AAM', '5AFO', '5AIT', '5AIU', '5AK7', '5AK8', '5BNV', '5BNX', '5BO0', '5BOJ', '5BV3', '5BWM', '5C1M', '5C1Z', '5C23', '5C30', '5C33', '5C7U', '5C7W', '5CE7', '5CPF']

Search by Author


In [4]:
search_dict = make_query('Perutz, M.F.',querytype='AdvancedAuthorQuery')
found_pdbs = do_search(search_dict)
print(found_pdbs)


['1CQ4', '1FDH', '1GDJ', '1HDA', '1PBX', '2DHB', '2GDM', '2HHB', '2MHB', '3HHB', '4HHB']

Search by Motif


In [5]:
search_dict = make_query('T[AG]AGGY',querytype='MotifQuery')
found_pdbs = do_search(search_dict)
print(found_pdbs)


['3LEZ', '3SGH', '4F47']

Search by a specific experimental method


In [6]:
search_dict = make_query('SOLID-STATE NMR',querytype='ExpTypeQuery')
found_pdbs = do_search(search_dict)
print(found_pdbs)


['1CEK', '1EQ8', '1M8M', '1MAG', '1MP6', '1MZT', '1NH4', '1NYJ', '1PI7', '1PI8', '1PJD', '1PJE', '1PJF', '1Q7O', '1RVS', '1XSW', '1ZN5', '1ZY6', '2C0X', '2CZP', '2E8D', '2H3O', '2H95', '2JSV', '2JU6', '2JZZ', '2K0P', '2KAD', '2KB7', '2KHT', '2KIB', '2KJ3', '2KLR', '2KQ4', '2KQT', '2KRJ', '2KSJ', '2KWD', '2KYV', '2L0J', '2L3Z', '2LBU', '2LEG', '2LGI', '2LJ2', '2LME', '2LMN', '2LMO', '2LMP', '2LMQ', '2LNL', '2LNQ', '2LNY', '2LPZ', '2LTQ', '2LU5', '2M02', '2M3B', '2M3G', '2M4J', '2M5K', '2M5M', '2M5N', '2M67', '2MC7', '2MCU', '2MCV', '2MCW', '2MCX', '2MEX', '2MJZ', '2MME', '2MMU', '2MPZ', '2MSG', '2MTZ', '2MVX', '2MXU', '2N0R', '2N1E', '2NNT', '2RLZ', '2UVS', '2W0N', '2XKM', '3ZPK']

Search by whether it has free ligands


In [4]:
search_dict = make_query('', querytype='NoLigandQuery')
found_pdbs = do_search(search_dict)
print(found_pdbs[:10])


['100D', '101D', '101M', '102D', '102L', '102M', '103L', '103M', '104M', '105M']

Search by protein symmetry group


In [8]:
kk = do_protsym_search('C9', min_rmsd=0.0, max_rmsd=1.0)
print(kk[:5])


['1KZU', '1NKZ', '2FKW', '3B8M', '3B8N']

Information Search functions

While the basic functions described in the previous section are useful for looking up and manipulating individual unique entries, these functions are intended to be more user-facing: they take search keywords and return lists of authors or dates

Find most common authors for a given keyword


In [10]:
top_authors = find_authors('crispr',max_results=100)
pprint.pprint(top_authors[:5])


['Doudna, J.A.', 'Jinek, M.', 'Nam, K.H.', 'Ke, A.', 'Li, H.']

Find papers for a given keyword


In [31]:
matching_papers = find_papers('crispr',max_results=3)
pprint.pprint(matching_papers)


['Crystal structure of a CRISPR-associated protein from thermus thermophilus', 'CRYSTAL STRUCTURE OF HYPOTHETICAL PROTEIN SSO1404 FROM SULFOLOBUS SOLFATARICUS P2', 'NMR solution structure of a CRISPR repeat binding protein']

2. Functions that return information about single PDB entries

Get the full PDB file


In [24]:
pdb_file = get_pdb_file('4lza', filetype='cif', compression=True)
print(pdb_file[:200])


data_4LZA
# 
_entry.id   4LZA 
# 
_audit_conform.dict_name       mmcif_pdbx.dic 
_audit_conform.dict_version    4.032 
_audit_conform.dict_location   http://mmcif.pdb.org/dictionaries/ascii/mmcif_pdbx

Get a general description of the entry's metadata


In [4]:
describe_pdb('4lza')


Out[4]:
{'citation_authors': 'Malashkevich, V.N., Bhosle, R., Toro, R., Hillerich, B., Gizzi, A., Garforth, S., Kar, A., Chan, M.K., Lafluer, J., Patel, H., Matikainen, B., Chamala, S., Lim, S., Celikgil, A., Villegas, G., Evans, B., Love, J., Fiser, A., Khafizov, K., Seidel, R., Bonanno, J.B., Almo, S.C.',
 'deposition_date': '2013-07-31',
 'expMethod': 'X-RAY DIFFRACTION',
 'keywords': 'TRANSFERASE',
 'last_modification_date': '2013-08-14',
 'nr_atoms': '0',
 'nr_entities': '1',
 'nr_residues': '390',
 'release_date': '2013-08-14',
 'resolution': '1.84',
 'status': 'CURRENT',
 'structureId': '4LZA',
 'structure_authors': 'Malashkevich, V.N., Bhosle, R., Toro, R., Hillerich, B., Gizzi, A., Garforth, S., Kar, A., Chan, M.K., Lafluer, J., Patel, H., Matikainen, B., Chamala, S., Lim, S., Celikgil, A., Villegas, G., Evans, B., Love, J., Fiser, A., Khafizov, K., Seidel, R., Bonanno, J.B., Almo, S.C., New York Structural Genomics Research Consortium (NYSGRC)',
 'title': 'Crystal structure of adenine phosphoribosyltransferase from Thermoanaerobacter pseudethanolicus ATCC 33223, NYSGRC Target 029700.'}

Get all of the information deposited in a PDB entry


In [35]:
all_info = get_all_info('4lza')
print(all_info)


{'polymer': {'macroMolecule': {'@name': 'Adenine phosphoribosyltransferase', 'accession': {'@id': 'B0K969'}}, '@entityNr': '1', '@type': 'protein', 'polymerDescription': {'@description': 'Adenine phosphoribosyltransferase'}, 'synonym': {'@name': 'APRT'}, '@length': '195', 'enzClass': {'@ec': '2.4.2.7'}, 'chain': [{'@id': 'A'}, {'@id': 'B'}], 'Taxonomy': {'@name': 'Thermoanaerobacter pseudethanolicus ATCC 33223', '@id': '340099'}, '@weight': '22023.9'}, 'id': '4LZA'}

In [9]:
results = get_all_info('2F5N')
first_polymer = results['polymer'][0]
first_polymer['polymerDescription']


Out[9]:
{'@description': "5'-D(*AP*GP*GP*TP*AP*GP*AP*CP*CP*TP*GP*GP*AP*CP*GP*C)-3'"}

Run a BLAST search on an entry

There are several options here: One function, get_blast(), returns a dict() just like every other function. However, all the metadata associated with this function leads to deeply-nested dictionaries. A simpler function, get_blast2(), uses text parsing on the raw output page, and it returns a tuple consisting of 1. a ranked list of other PDB IDs that were hits, and 2. A list of the actual BLAST alignments and similarity scores.


In [11]:
blast_results = get_blast('2F5N', chain_id='A')
just_hits = blast_results['BlastOutput_iterations']['Iteration']['Iteration_hits']['Hit']
print(just_hits[50]['Hit_hsps']['Hsp']['Hsp_hseq'])


PELPEVETVRRELEKRIVGQKIISIEATYPRMVL--TGFEQLKKELTGKTIQGISRRGKYLIFEIGDDFRLISHLRMEGKYRLATLDAPREKHDHLTMKFADG-QLIYADVRKFGTWELISTDQVLPYFLKKKIGPEPTYEDFDEKLFREKLRKSTKKIKPYLLEQTLVAGLGNIYVDEVLWLAKIHPEKETNQLIESSIHLLHDSIIEILQKAIKLGGSSIRTY-SALGSTGKMQNELQVYGKTGEKCSRCGAEIQKIKVAGRGTHFCPVCQQ

In [12]:
blast_results = get_blast2('2F5N', chain_id='A', output_form='HTML')
print('Total Results: ' + str(len(blast_results[0])) +'\n')
pprint.pprint(blast_results[1][0])


Total Results: 84

<pre>
&gt;<a name="45354"></a>2F5P:3:A|pdbid|entity|chain(s)|sequence
          Length = 274

 Score =  545 bits (1404), Expect = e-155,   Method: Composition-based stats.
 Identities = 274/274 (100%), Positives = 274/274 (100%)

Query: 1   MPELPEVETIRRTLLPLIVGKTIEDVRIFWPNIIRHPRDSEAFAARMIGQTVRGLERRGK 60
           MPELPEVETIRRTLLPLIVGKTIEDVRIFWPNIIRHPRDSEAFAARMIGQTVRGLERRGK
Sbjct: 1   MPELPEVETIRRTLLPLIVGKTIEDVRIFWPNIIRHPRDSEAFAARMIGQTVRGLERRGK 60

Query: 61  FLKFLLDRDALISHLRMEGRYAVASALEPLEPHTHVVFCFTDGSELRYRDVRKFGTMHVY 120
           FLKFLLDRDALISHLRMEGRYAVASALEPLEPHTHVVFCFTDGSELRYRDVRKFGTMHVY
Sbjct: 61  FLKFLLDRDALISHLRMEGRYAVASALEPLEPHTHVVFCFTDGSELRYRDVRKFGTMHVY 120

Query: 121 AKEEADRRPPLAELGPEPLSPAFSPAVLAERAVKTKRSVKALLLDCTVVAGFGNIYVDES 180
           AKEEADRRPPLAELGPEPLSPAFSPAVLAERAVKTKRSVKALLLDCTVVAGFGNIYVDES
Sbjct: 121 AKEEADRRPPLAELGPEPLSPAFSPAVLAERAVKTKRSVKALLLDCTVVAGFGNIYVDES 180

Query: 181 LFRAGILPGRPAASLSSKEIERLHEEMVATIGEAVMKGGSTVRTYVNTQGEAGTFQHHLY 240
           LFRAGILPGRPAASLSSKEIERLHEEMVATIGEAVMKGGSTVRTYVNTQGEAGTFQHHLY
Sbjct: 181 LFRAGILPGRPAASLSSKEIERLHEEMVATIGEAVMKGGSTVRTYVNTQGEAGTFQHHLY 240

Query: 241 VYGRQGNPCKRCGTPIEKTVVAGRGTHYCPRCQR 274
           VYGRQGNPCKRCGTPIEKTVVAGRGTHYCPRCQR
Sbjct: 241 VYGRQGNPCKRCGTPIEKTVVAGRGTHYCPRCQR 274
</pre>

Get PFAM information about an entry


In [37]:
pfam_info = get_pfam('2LME')
print(pfam_info)


{'pfamHit': {'@pfamAcc': 'PF03895.10', '@pfamName': 'YadA_anchor', '@structureId': '2LME', '@pdbResNumEnd': '105', '@pdbResNumStart': '28', '@pfamDesc': 'YadA-like C-terminal region', '@eValue': '5.0E-22', '@chainId': 'A'}}

Get chemical info

This function takes the name of the chemical, not a PDB ID


In [39]:
chem_desc = describe_chemical('NAG')
pprint.pprint(chem_desc)


{'describeHet': {'ligandInfo': {'ligand': {'@chemicalID': 'NAG',
                                           '@molecularWeight': '221.208',
                                           '@type': 'D-saccharide',
                                           'InChI': 'InChI=1S/C8H15NO6/c1-3(11)9-5-7(13)6(12)4(2-10)15-8(5)14/h4-8,10,12-14H,2H2,1H3,(H,9,11)/t4-,5-,6-,7-,8-/m1/s1',
                                           'InChIKey': 'OVRNDRQMDRJTHS-FMDGEEDCSA-N',
                                           'chemicalName': 'N-ACETYL-D-GLUCOSAMINE',
                                           'formula': 'C8 H15 N O6',
                                           'smiles': 'CC(=O)N[C@@H]1[C@H]([C@@H]([C@H](O[C@H]1O)CO)O)O'}}}}

Get ligand info if present


In [40]:
ligand_dict = get_ligands('100D')
pprint.pprint(ligand_dict)


{'id': '100D',
 'ligandInfo': {'ligand': {'@chemicalID': 'SPM',
                           '@molecularWeight': '202.34',
                           '@structureId': '100D',
                           '@type': 'non-polymer',
                           'InChI': 'InChI=1S/C10H26N4/c11-5-3-9-13-7-1-2-8-14-10-4-6-12/h13-14H,1-12H2',
                           'InChIKey': 'PFNFFQXMRSDOHW-UHFFFAOYSA-N',
                           'chemicalName': 'SPERMINE',
                           'formula': 'C10 H26 N4',
                           'smiles': 'C(CCNCCCN)CNCCCN'}}}

Get gene ontology info


In [45]:
gene_info = get_gene_onto('4Z0L ')
pprint.pprint(gene_info['term'][0])


{'@chainId': 'A',
 '@id': 'GO:0001516',
 '@structureId': '4Z0L',
 'detail': {'@definition': 'The chemical reactions and pathways resulting '
                           'in the formation of prostaglandins, any of a '
                           'group of biologically active metabolites which '
                           'contain a cyclopentane ring.',
            '@name': 'prostaglandin biosynthetic process',
            '@ontology': 'B',
            '@synonyms': 'prostaglandin anabolism, prostaglandin '
                         'biosynthesis, prostaglandin formation, '
                         'prostaglandin synthesis'}}

Get sequence clusters by chain


In [41]:
sclust = get_seq_cluster('2F5N.A')
pprint.pprint(sclust['pdbChain'][:10]) # Just look at the top 10


[{'@name': '4PD2.A', '@rank': '1'},
 {'@name': '3U6P.A', '@rank': '2'},
 {'@name': '4PCZ.A', '@rank': '3'},
 {'@name': '3GPU.A', '@rank': '4'},
 {'@name': '3JR5.A', '@rank': '5'},
 {'@name': '3SAU.A', '@rank': '6'},
 {'@name': '3GQ4.A', '@rank': '7'},
 {'@name': '1R2Z.A', '@rank': '8'},
 {'@name': '3U6E.A', '@rank': '9'},
 {'@name': '2XZF.A', '@rank': '10'}]

Get the representative for a chain


In [46]:
clusts = get_clusters('4hhb.A')
print(clusts)


{'pdbChain': {'@name': '2W72.A'}}

List all taxa associated with a list of IDs


In [13]:
crispr_query = make_query('crispr')
crispr_results = do_search(crispr_query)
pprint.pprint(list_taxa(crispr_results[:10]))


['Thermus thermophilus',
 'Sulfolobus solfataricus P2',
 'Hyperthermus butylicus DSM 5456',
 'unidentified phage',
 'Sulfolobus solfataricus P2',
 'Pseudomonas aeruginosa UCBPP-PA14',
 'Pseudomonas aeruginosa UCBPP-PA14',
 'Pseudomonas aeruginosa UCBPP-PA14',
 'Sulfolobus solfataricus',
 'Thermus thermophilus HB8']

List data types with a list of IDs


In [14]:
crispr_query = make_query('crispr')
crispr_results = do_search(crispr_query)
pprint.pprint(list_types(crispr_results[:5]))


['protein', 'protein', 'protein', 'protein', 'protein']

In [ ]: