Compile a dataset of ligand-binding domains

Often we want to see how a ligand binds to a domain and how such binding sites compare across multiple PDB entries.

In this example, we will use the PDBe API to :
  • find the PDB entries containing retinoic acid (REA),
  • analyze the domain composition of those entries,
  • find the lipocalin CATH domains which interact with REA,
  • find a set of chains that contain lipocalin domain and bind to REA too, and
  • write your own improvements to the demo code.

Getting started

Let us run the tutorial_utils notebook to setup API URL, logger, caller utility, etc. Check out that notebook to setup anything differently.


In [48]:
%run 'tutorial_utils.ipynb'

Entries containing reinoic acid

Let us begin by finding PDB entries that contain retinoic acid (PDB chemical component REA). For this, we will use the API call /pdb/compound/in_pdb.


In [49]:
chem_comp_id = "REA"
cc_entries_URL = PDBE_API_URL + "/pdb/compound/in_pdb/" + chem_comp_id
cc_entries_data = get_PDBe_API_data(cc_entries_URL) # function from tutorial_utils
cc_entries = cc_entries_data[chem_comp_id]
logging.info("There are %d PDB entries containing chemical component %s." % (len(cc_entries), chem_comp_id))


LOG|11-Nov-2014 13:37:31|INFO  There are 31 PDB entries containing chemical component REA.

Hmmm, so 31 PDB entries contain REA.

Alternatively, we could have used the PDBe search system too to find these entries. See the search_introduction notebook to know more.

Obtaining the domains

So far so good. Now let's figure the domain composition of these entries by using the SIFTS mappings call. This call returns many types of mappings, such as UniProt accessions, Pfam families, InterPro, SCOP, CATH, etc. We will make one mapping call for each PDB entry id.


In [50]:
def get_mappings() :
    all_mappings = {}
    for pdb_id in cc_entries :
        mappings_data = get_PDBe_API_data(PDBE_API_URL+"/mappings/"+pdb_id)
        try :
            all_mappings[pdb_id] = mappings_data[pdb_id]
            print ".",
        except KeyError :
            logging.warn("Mappings call failed for entry " + pdb_id)
    print ""
    logging.info("Mappings fetched for %d of %d entries." % (len(all_mappings), len(cc_entries)))
    return all_mappings

mappings = get_mappings()


. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 
LOG|11-Nov-2014 13:37:39|INFO  Mappings fetched for 31 of 31 entries.

Which mappings are available? Let us count.


In [51]:
import collections
mapping_counts = collections.defaultdict(lambda:0)

for pdb_id in mappings :
    for mapping_type, mappings_info in mappings[pdb_id].items() :
        if len(mappings_info) > 0 :
            mapping_counts[mapping_type] += 1

for mt, mc in mapping_counts.items() :
    logging.info("%3d entries have %s mappings." % (mc, mt))


LOG|11-Nov-2014 13:37:43|INFO   31 entries have Pfam mappings.
LOG|11-Nov-2014 13:37:43|INFO   31 entries have InterPro mappings.
LOG|11-Nov-2014 13:37:43|INFO   26 entries have CATH mappings.
LOG|11-Nov-2014 13:37:43|INFO    4 entries have EC mappings.
LOG|11-Nov-2014 13:37:43|INFO   31 entries have UniProt mappings.
LOG|11-Nov-2014 13:37:43|INFO   17 entries have SCOP mappings.
LOG|11-Nov-2014 13:37:43|INFO   31 entries have GO mappings.

Let us focus on CATH superfamilies and count how many times each superfamily is mapped onto entries containing the compound of our interest.


In [54]:
dom_type = "CATH"
dom_counts = collections.defaultdict(lambda:0)
dom_family_name = {}

for pdb_id in mappings :
    if not mappings[pdb_id].has_key(dom_type) :
        continue
    else :
        for dom_id, mranges in mappings[pdb_id][dom_type].items() :
            dom_family_name[dom_id] = mappings[pdb_id][dom_type][dom_id]["homology"]
            dom_counts[dom_id] += 1

for dom_id, frequency in dom_counts.items() :
    logging.info("%2d entries contain CATH superfamily %11s (%s)." % (frequency, dom_id, dom_family_name[dom_id]))


LOG|11-Nov-2014 13:39:14|INFO   9 entries contain CATH superfamily 2.40.128.20 (Lipocalin).
LOG|11-Nov-2014 13:39:14|INFO   2 entries contain CATH superfamily 2.60.40.180 (Immunoglobulin-like).
LOG|11-Nov-2014 13:39:14|INFO   3 entries contain CATH superfamily  3.30.50.10 (Erythroid Transcription Factor GATA-1, subunit A).
LOG|11-Nov-2014 13:39:14|INFO  15 entries contain CATH superfamily 1.10.565.10 (Retinoid X Receptor).
LOG|11-Nov-2014 13:39:14|INFO   1 entries contain CATH superfamily 1.10.630.10 (Cytochrome p450).

We see there are 9 entries which contain CATH domain called Lipocalin (2.40.128.20).

Let us now make a reverse associative array from PDB chains to superfamilies, and note that a chain can contain multiple CATH domains.


In [55]:
chain_to_domain = collections.defaultdict(lambda:[])

for pdb_id in mappings :
    if mappings[pdb_id].has_key(dom_type) :
        for dom_id, mapping_info in mappings[pdb_id][dom_type].items() :
            for arange in mapping_info["mappings"] :
                chain_to_domain[(pdb_id,arange["chain_id"])].append(dom_id)

for (pdb_id,chain_id), dom_ids in chain_to_domain.items() [0:5] :
    logging.info("PDB id %s chain %s contain domains %s" % (pdb_id, chain_id, ", ".join(dom_ids)))


LOG|11-Nov-2014 13:39:29|INFO  PDB id 3fal chain D contain domains 1.10.565.10
LOG|11-Nov-2014 13:39:29|INFO  PDB id 3fc6 chain B contain domains 1.10.565.10
LOG|11-Nov-2014 13:39:29|INFO  PDB id 3dzy chain A contain domains 3.30.50.10, 1.10.565.10
LOG|11-Nov-2014 13:39:29|INFO  PDB id 1xls chain B contain domains 1.10.565.10
LOG|11-Nov-2014 13:39:29|INFO  PDB id 1xiu chain B contain domains 1.10.565.10

Obtaining the binding sites

Now let us turn our attention to binding sites information. Let us write a function that will fetch it using binding sites API call.


In [56]:
def get_binding_sites() :
    all_sites = collections.defaultdict(lambda:[])
    for pdb_id in cc_entries :
        api_data = get_PDBe_API_data(PDBE_API_URL+"/pdb/entry/binding_sites/"+pdb_id)
        for asite in api_data[pdb_id] :
            if any((alig["chem_comp_id"] == chem_comp_id for alig in asite["ligand_residues"])) :
                all_sites[pdb_id].append(asite)
    return all_sites

Ideally we expect there to be at least one binding site for REA defined in each entry. Let us verify that.


In [57]:
binding_sites = get_binding_sites()

binding_frequency = collections.defaultdict(lambda:0)

for pdb_id, sites_info in binding_sites.items() :
    binding_frequency[len(sites_info)] += 1
    
for num_sites, num_entries in binding_frequency.items() :
    logging.info("%2d entries have %2d binding sites of %s." % (num_entries, num_sites, chem_comp_id))


LOG|11-Nov-2014 13:39:31|INFO  15 entries have  1 binding sites of REA.
LOG|11-Nov-2014 13:39:31|INFO  12 entries have  2 binding sites of REA.
LOG|11-Nov-2014 13:39:31|INFO   4 entries have  4 binding sites of REA.

Let us make a list of chains which bind our favourite compound.


In [58]:
binding_chains = set()
for pdb_id, sites_info in binding_sites.items() :
    for asite in sites_info :
        for res_info in asite["site_residues"] :
            binding_chains.add( (pdb_id, res_info["chain_id"]) )

logging.info("%d PDB chains bind %s, e.g. %s ...." % ( \
            len(binding_chains), chem_comp_id, \
            " ".join([bc[0]+":"+bc[1] for bc in list(binding_chains)[0:5]]) \
))


LOG|11-Nov-2014 13:39:34|INFO  55 PDB chains bind REA, e.g. 1n4h:A 4nqa:H 3dzy:A 1xls:B 1xiu:B ....

Now we can find chains common to binding sites and CATH superfamilies to find instances of domains with binding ligands.


In [59]:
domain_of_choice = "2.40.128.20"
for pdb_chain in binding_chains :
    if chain_to_domain.has_key(pdb_chain) and domain_of_choice in chain_to_domain[pdb_chain] :
        logging.info("PDB chain %s:%s binds %s and contains CATH domains %s" % \
                     (pdb_chain[0], pdb_chain[1], chem_comp_id, ",".join(chain_to_domain[pdb_chain])))


LOG|11-Nov-2014 13:39:37|INFO  PDB chain 1epb:A binds REA and contains CATH domains 2.40.128.20
LOG|11-Nov-2014 13:39:37|INFO  PDB chain 1rlb:E binds REA and contains CATH domains 2.40.128.20
LOG|11-Nov-2014 13:39:37|INFO  PDB chain 2g78:A binds REA and contains CATH domains 2.40.128.20
LOG|11-Nov-2014 13:39:37|INFO  PDB chain 1cbr:A binds REA and contains CATH domains 2.40.128.20
LOG|11-Nov-2014 13:39:37|INFO  PDB chain 1gx9:A binds REA and contains CATH domains 2.40.128.20
LOG|11-Nov-2014 13:39:37|INFO  PDB chain 1cbs:A binds REA and contains CATH domains 2.40.128.20
LOG|11-Nov-2014 13:39:37|INFO  PDB chain 1rlb:F binds REA and contains CATH domains 2.40.128.20
LOG|11-Nov-2014 13:39:37|INFO  PDB chain 1fem:A binds REA and contains CATH domains 2.40.128.20
LOG|11-Nov-2014 13:39:37|INFO  PDB chain 1epb:B binds REA and contains CATH domains 2.40.128.20
LOG|11-Nov-2014 13:39:37|INFO  PDB chain 1cbr:B binds REA and contains CATH domains 2.40.128.20
LOG|11-Nov-2014 13:39:37|INFO  PDB chain 2fr3:A binds REA and contains CATH domains 2.40.128.20
LOG|11-Nov-2014 13:39:37|INFO  PDB chain 3cwk:A binds REA and contains CATH domains 2.40.128.20

Hurray! we now have a dataset of chains we can study to understand the binding between retinoic acid and the CATH domain!

Your turn!

There are many improvement you can readily make to this recipe to obtain a better dataset for your study, such as:

  • Use validation information to choose only high quality binding sites. You can use the validation API calls to check gloval quality metrics like resolution, Rfree, or more local parameters like residue-wise geometry outliers and electron density fit.
  • Make a non-redundant set of binding sites by choosing only one from each PDB id. Use your favourite sequence alignment software to remove chains with very similar sequences.
  • Choose chains where there is sufficient overlap between binding site residues and domain-mapped residues.