Often we want to see how a ligand binds to a domain and how such binding sites compare across multiple PDB entries.
In this example, we will use the PDBe API to :
|
|
Let us run the tutorial_utils notebook to setup API URL, logger, caller utility, etc. Check out that notebook to setup anything differently.
In [48]:
%run 'tutorial_utils.ipynb'
Let us begin by finding PDB entries that contain retinoic acid (PDB chemical component REA). For this, we will use the API call /pdb/compound/in_pdb.
In [49]:
chem_comp_id = "REA"
cc_entries_URL = PDBE_API_URL + "/pdb/compound/in_pdb/" + chem_comp_id
cc_entries_data = get_PDBe_API_data(cc_entries_URL) # function from tutorial_utils
cc_entries = cc_entries_data[chem_comp_id]
logging.info("There are %d PDB entries containing chemical component %s." % (len(cc_entries), chem_comp_id))
Hmmm, so 31 PDB entries contain REA.
Alternatively, we could have used the PDBe search system too to find these entries. See the search_introduction notebook to know more.
So far so good. Now let's figure the domain composition of these entries by using the SIFTS mappings call. This call returns many types of mappings, such as UniProt accessions, Pfam families, InterPro, SCOP, CATH, etc. We will make one mapping call for each PDB entry id.
In [50]:
def get_mappings() :
all_mappings = {}
for pdb_id in cc_entries :
mappings_data = get_PDBe_API_data(PDBE_API_URL+"/mappings/"+pdb_id)
try :
all_mappings[pdb_id] = mappings_data[pdb_id]
print ".",
except KeyError :
logging.warn("Mappings call failed for entry " + pdb_id)
print ""
logging.info("Mappings fetched for %d of %d entries." % (len(all_mappings), len(cc_entries)))
return all_mappings
mappings = get_mappings()
Which mappings are available? Let us count.
In [51]:
import collections
mapping_counts = collections.defaultdict(lambda:0)
for pdb_id in mappings :
for mapping_type, mappings_info in mappings[pdb_id].items() :
if len(mappings_info) > 0 :
mapping_counts[mapping_type] += 1
for mt, mc in mapping_counts.items() :
logging.info("%3d entries have %s mappings." % (mc, mt))
Let us focus on CATH superfamilies and count how many times each superfamily is mapped onto entries containing the compound of our interest.
In [54]:
dom_type = "CATH"
dom_counts = collections.defaultdict(lambda:0)
dom_family_name = {}
for pdb_id in mappings :
if not mappings[pdb_id].has_key(dom_type) :
continue
else :
for dom_id, mranges in mappings[pdb_id][dom_type].items() :
dom_family_name[dom_id] = mappings[pdb_id][dom_type][dom_id]["homology"]
dom_counts[dom_id] += 1
for dom_id, frequency in dom_counts.items() :
logging.info("%2d entries contain CATH superfamily %11s (%s)." % (frequency, dom_id, dom_family_name[dom_id]))
We see there are 9 entries which contain CATH domain called Lipocalin (2.40.128.20).
Let us now make a reverse associative array from PDB chains to superfamilies, and note that a chain can contain multiple CATH domains.
In [55]:
chain_to_domain = collections.defaultdict(lambda:[])
for pdb_id in mappings :
if mappings[pdb_id].has_key(dom_type) :
for dom_id, mapping_info in mappings[pdb_id][dom_type].items() :
for arange in mapping_info["mappings"] :
chain_to_domain[(pdb_id,arange["chain_id"])].append(dom_id)
for (pdb_id,chain_id), dom_ids in chain_to_domain.items() [0:5] :
logging.info("PDB id %s chain %s contain domains %s" % (pdb_id, chain_id, ", ".join(dom_ids)))
Now let us turn our attention to binding sites information. Let us write a function that will fetch it using binding sites API call.
In [56]:
def get_binding_sites() :
all_sites = collections.defaultdict(lambda:[])
for pdb_id in cc_entries :
api_data = get_PDBe_API_data(PDBE_API_URL+"/pdb/entry/binding_sites/"+pdb_id)
for asite in api_data[pdb_id] :
if any((alig["chem_comp_id"] == chem_comp_id for alig in asite["ligand_residues"])) :
all_sites[pdb_id].append(asite)
return all_sites
Ideally we expect there to be at least one binding site for REA defined in each entry. Let us verify that.
In [57]:
binding_sites = get_binding_sites()
binding_frequency = collections.defaultdict(lambda:0)
for pdb_id, sites_info in binding_sites.items() :
binding_frequency[len(sites_info)] += 1
for num_sites, num_entries in binding_frequency.items() :
logging.info("%2d entries have %2d binding sites of %s." % (num_entries, num_sites, chem_comp_id))
Let us make a list of chains which bind our favourite compound.
In [58]:
binding_chains = set()
for pdb_id, sites_info in binding_sites.items() :
for asite in sites_info :
for res_info in asite["site_residues"] :
binding_chains.add( (pdb_id, res_info["chain_id"]) )
logging.info("%d PDB chains bind %s, e.g. %s ...." % ( \
len(binding_chains), chem_comp_id, \
" ".join([bc[0]+":"+bc[1] for bc in list(binding_chains)[0:5]]) \
))
Now we can find chains common to binding sites and CATH superfamilies to find instances of domains with binding ligands.
In [59]:
domain_of_choice = "2.40.128.20"
for pdb_chain in binding_chains :
if chain_to_domain.has_key(pdb_chain) and domain_of_choice in chain_to_domain[pdb_chain] :
logging.info("PDB chain %s:%s binds %s and contains CATH domains %s" % \
(pdb_chain[0], pdb_chain[1], chem_comp_id, ",".join(chain_to_domain[pdb_chain])))
Hurray! we now have a dataset of chains we can study to understand the binding between retinoic acid and the CATH domain!
There are many improvement you can readily make to this recipe to obtain a better dataset for your study, such as: