Strained conformation or outlier?

Validation of protein structures is a collection of checks that quantify the expectation of occurrence of features in a model. An unexpected feature is not always a modelling error, and it is sometimes argued that such feature is strained but stabilized by its neighbouring atoms. Additional evidence needs to be gathered in such tricky situation.

In addition to checking electron density itself, it would be useful to check whether similarly strained feature occurs in other models of that protein, e.g.

  • the asymmetric unit of an entry might have more than one copy of the protein, or
  • there might be multiple MODEL records each containing a copy of the protein, or
  • there might be other entries with the same protein.

In general, it would be useful to create a quality profile from multiple models of a protein to see whether a residue systematically exhibits an unusual geometry, or whether the geometry is an error not observed in other models.

In this tutorial, let us see how backbone and sidechain features of residues compare across the multiple copies of GroEL in PDB entry 1kp8 determined in 2003 at resolution of 2Å.

Getting started

Let us run the tutorial_utils notebook to setup API URL, logger, caller utility, etc. Check out that notebook to setup anything differently.


In [2]:
%run 'tutorial_utils.ipynb'

Fetching validation information

Now we will identify the molecule number of GroEL in entry 1kp8 using the /pdb/entry/molecules call.


In [3]:
the_pdbid = "1kp8"

mols_data = get_PDBe_API_data(PDBE_API_URL + "/pdb/entry/molecules/" + the_pdbid)[the_pdbid]
logging.info("PDB entry %s has %d types of molecules." % (the_pdbid, len(mols_data)))

for mol_info in mols_data :
    if mol_info["molecule_type"] == "polypeptide(L)" :
        logging.info( "Molecule no. %d, '%s', is a protein modelled in %d chains %s." % \
            (mol_info["entity_id"], mol_info["molecule_name"], \
            len(mol_info["in_chains"]), ",".join(mol_info["in_chains"])) \
        )


LOG|11-Nov-2014 14:24:38|INFO  PDB entry 1kp8 has 6 types of molecules.
LOG|11-Nov-2014 14:24:38|INFO  Molecule no. 1, '60 kDa chaperonin', is a protein modelled in 14 chains A,B,C,D,E,F,G,H,I,J,K,L,M,N.

So the protein of interest is molecule no. 1 and there are 14 copies of it.

Let us now obtain per-residue information on backbone (Ramachandran) and sidechain quality using the call /validation/rama_sidechain_listing/.


In [4]:
rama_data = get_PDBe_API_data(PDBE_API_URL + "/validation/rama_sidechain_listing/entry/" + the_pdbid) [the_pdbid]

Let us define a container which will hold information about Ramachandran state (outlier, allowed, favoured) and rotamer state (rotamer name or outlier) for each residue of each chain.


In [5]:
import collections
outliers = {
    "rama":collections.defaultdict(lambda: collections.defaultdict(lambda:[])),
    "rota":collections.defaultdict(lambda: collections.defaultdict(lambda:[])),
}

Let us populate this container with Rama and sidechain information from chains modelling molecule no. 1 in the first and only MODEL in the entry.


In [6]:
for mol in rama_data["molecules"] :
    if str(mol["entity_id"]) != "1" :
        continue
    for chain in mol["chains"] :
        model = chain["models"][0]
        logging.info("Chain id %s has Rama sidechain validation for %d residues." % (chain["chain_id"], len(model["residues"])))
        for residue in model["residues"] :
            res_id = (residue["residue_name"],residue["author_residue_number"],residue["author_insertion_code"])
            outliers["rama"] [res_id] [ residue["rama"] ].append( chain["chain_id"] )
            outliers["rota"] [res_id] [ residue["rota"] ].append( chain["chain_id"] )


LOG|11-Nov-2014 14:25:33|INFO  Chain id A has Rama sidechain validation for 524 residues.
LOG|11-Nov-2014 14:25:33|INFO  Chain id C has Rama sidechain validation for 524 residues.
LOG|11-Nov-2014 14:25:33|INFO  Chain id B has Rama sidechain validation for 524 residues.
LOG|11-Nov-2014 14:25:33|INFO  Chain id E has Rama sidechain validation for 524 residues.
LOG|11-Nov-2014 14:25:33|INFO  Chain id D has Rama sidechain validation for 524 residues.
LOG|11-Nov-2014 14:25:33|INFO  Chain id G has Rama sidechain validation for 524 residues.
LOG|11-Nov-2014 14:25:33|INFO  Chain id F has Rama sidechain validation for 524 residues.
LOG|11-Nov-2014 14:25:33|INFO  Chain id I has Rama sidechain validation for 524 residues.
LOG|11-Nov-2014 14:25:33|INFO  Chain id H has Rama sidechain validation for 524 residues.
LOG|11-Nov-2014 14:25:33|INFO  Chain id K has Rama sidechain validation for 524 residues.
LOG|11-Nov-2014 14:25:33|INFO  Chain id J has Rama sidechain validation for 524 residues.
LOG|11-Nov-2014 14:25:33|INFO  Chain id M has Rama sidechain validation for 524 residues.
LOG|11-Nov-2014 14:25:33|INFO  Chain id L has Rama sidechain validation for 524 residues.
LOG|11-Nov-2014 14:25:33|INFO  Chain id N has Rama sidechain validation for 524 residues.

Finding the odd ones

Now let us write a function which will find the residues that have more than one Rama or sidechain state, such that at least one of the states occurs a majority of times.


In [8]:
def print_unusual_residues(val_key, major_state_freq) :
    for res_id, val_info in  outliers[val_key].items() :
        # not interested in all residues at this index to be in the same state
        if len(val_info) == 1 : continue
        # not interested in this residue index if no state occurs at least with required frequency
        if not any( [len(val_info[k]) >= major_state_freq for k in val_info] ) :
            continue
        # write out the popular state and also those occurring less frequently
        val_keys = sorted( val_info.keys(), key = lambda vk:len(val_info[vk]), reverse=True )
        state_frequencies = ", ".join(["%7s"%k+":%2d" % len(val_info[k]) for k in val_keys])
        minor_chains_str = " ".join(["%s:%s"%(k,",".join(val_info[k])) for k in val_keys[1:]])
        logging.info("Residue [%3s %4s%s] has multiple %s states: %s || Unusual chains %s" % \
            (res_id[0], res_id[1], res_id[2], val_key, state_frequencies, minor_chains_str) \
        )

Let us print residues with unusual Ramachandran state that contrasts against counterparts in other chains.


In [9]:
print_unusual_residues("rama", 12)


LOG|11-Nov-2014 14:30:37|INFO  Residue [PRO  462] has multiple rama states: Favored:12, Allowed: 2 || Unusual chains Allowed:E,L
LOG|11-Nov-2014 14:30:37|INFO  Residue [ILE  342] has multiple rama states: Favored:13, Allowed: 1 || Unusual chains Allowed:I
LOG|11-Nov-2014 14:30:37|INFO  Residue [LEU  222] has multiple rama states: Allowed:12, Favored: 2 || Unusual chains Favored:D,G
LOG|11-Nov-2014 14:30:37|INFO  Residue [ALA  383] has multiple rama states: OUTLIER:13, Allowed: 1 || Unusual chains Allowed:A
LOG|11-Nov-2014 14:30:37|INFO  Residue [THR  497] has multiple rama states: Favored:13, Allowed: 1 || Unusual chains Allowed:K
LOG|11-Nov-2014 14:30:37|INFO  Residue [GLU  156] has multiple rama states: Favored:13, Allowed: 1 || Unusual chains Allowed:E
LOG|11-Nov-2014 14:30:37|INFO  Residue [ASP  253] has multiple rama states: OUTLIER:12, Allowed: 2 || Unusual chains Allowed:D,N
LOG|11-Nov-2014 14:30:37|INFO  Residue [SER  154] has multiple rama states: Favored:12, Allowed: 2 || Unusual chains Allowed:B,E

Here we see that Thr K-497, Glu E-156 are in the allowed region whereas residues in other chains are all in the favoured region. These residues need to be reviewed.

Residues Ala 383, Asp 253 are nearly always outliers. This could be a genuine outlier with good reason, or it could be a systematic error!

Similarly, let us print residues where only a small minority of chains have a different sidechain rotamer at a residue position.


In [10]:
print_unusual_residues("rota", 13)


LOG|11-Nov-2014 14:30:54|INFO  Residue [VAL  510] has multiple rota states:       t:13, OUTLIER: 1 || Unusual chains OUTLIER:B
LOG|11-Nov-2014 14:30:54|INFO  Residue [ARG   58] has multiple rota states:  mtm-85:13,  mtt180: 1 || Unusual chains mtt180:B
LOG|11-Nov-2014 14:30:54|INFO  Residue [ASP   52] has multiple rota states:     t70:13,      t0: 1 || Unusual chains t0:B
LOG|11-Nov-2014 14:30:54|INFO  Residue [LYS   75] has multiple rota states: OUTLIER:13,    tttt: 1 || Unusual chains tttt:E
LOG|11-Nov-2014 14:30:54|INFO  Residue [VAL  417] has multiple rota states:       m:13, OUTLIER: 1 || Unusual chains OUTLIER:L
LOG|11-Nov-2014 14:30:54|INFO  Residue [GLU  130] has multiple rota states:    tp10:13,   mt-10: 1 || Unusual chains mt-10:B
LOG|11-Nov-2014 14:30:54|INFO  Residue [VAL  128] has multiple rota states:       t:13,       m: 1 || Unusual chains m:B
LOG|11-Nov-2014 14:30:54|INFO  Residue [GLU  129] has multiple rota states:   mt-10:13,    tp10: 1 || Unusual chains tp10:B
LOG|11-Nov-2014 14:30:54|INFO  Residue [ARG  368] has multiple rota states:  mtm180:13,  mtp180: 1 || Unusual chains mtp180:N
LOG|11-Nov-2014 14:30:54|INFO  Residue [MET  514] has multiple rota states:     mtp:13, OUTLIER: 1 || Unusual chains OUTLIER:E
LOG|11-Nov-2014 14:30:54|INFO  Residue [ARG  231] has multiple rota states:  mtt180:13,  mtp180: 1 || Unusual chains mtp180:J
LOG|11-Nov-2014 14:30:54|INFO  Residue [ASP  490] has multiple rota states:    m-20:13,     t70: 1 || Unusual chains t70:E
LOG|11-Nov-2014 14:30:54|INFO  Residue [ASP  334] has multiple rota states:     t70:13,      t0: 1 || Unusual chains t0:D
LOG|11-Nov-2014 14:30:54|INFO  Residue [SER  424] has multiple rota states:       p:13,       m: 1 || Unusual chains m:E
LOG|11-Nov-2014 14:30:54|INFO  Residue [ARG  284] has multiple rota states:  ptt180:13, OUTLIER: 1 || Unusual chains OUTLIER:M
LOG|11-Nov-2014 14:30:54|INFO  Residue [VAL   74] has multiple rota states:       p:13,       t: 1 || Unusual chains t:B

Note residues Val B-510, Met E-514, Val L-417, Arg M-284 which have rotameric counterparts in all other chains - their sidechain most probably needs remodelling. Also note residues like Lys 75 could be outliers for a good reason, or could just be systematic outliers.

Your turn!

  • Redo this comparison for an entry of your interest. Open the entry and electron density maps in a 3D viewer and discuss your findings.
  • Extend such comparison to other kinds of geometric quality, such as bond lengths, vdW clashes, chirality, etc.
  • Extend the comparison to include same protein from multiple entries - use PDBe search service.