Validation of protein structures is a collection of checks that quantify the expectation of occurrence of features in a model. An unexpected feature is not always a modelling error, and it is sometimes argued that such feature is strained but stabilized by its neighbouring atoms. Additional evidence needs to be gathered in such tricky situation.
In addition to checking electron density itself, it would be useful to check whether similarly strained feature occurs in other models of that protein, e.g.
In general, it would be useful to create a quality profile from multiple models of a protein to see whether a residue systematically exhibits an unusual geometry, or whether the geometry is an error not observed in other models.
In this tutorial, let us see how backbone and sidechain features of residues compare across the multiple copies of GroEL in PDB entry 1kp8 determined in 2003 at resolution of 2Å.
Let us run the tutorial_utils notebook to setup API URL, logger, caller utility, etc. Check out that notebook to setup anything differently.
In [2]:
%run 'tutorial_utils.ipynb'
Now we will identify the molecule number of GroEL in entry 1kp8 using the /pdb/entry/molecules call.
In [3]:
the_pdbid = "1kp8"
mols_data = get_PDBe_API_data(PDBE_API_URL + "/pdb/entry/molecules/" + the_pdbid)[the_pdbid]
logging.info("PDB entry %s has %d types of molecules." % (the_pdbid, len(mols_data)))
for mol_info in mols_data :
if mol_info["molecule_type"] == "polypeptide(L)" :
logging.info( "Molecule no. %d, '%s', is a protein modelled in %d chains %s." % \
(mol_info["entity_id"], mol_info["molecule_name"], \
len(mol_info["in_chains"]), ",".join(mol_info["in_chains"])) \
)
So the protein of interest is molecule no. 1 and there are 14 copies of it.
Let us now obtain per-residue information on backbone (Ramachandran) and sidechain quality using the call /validation/rama_sidechain_listing/.
In [4]:
rama_data = get_PDBe_API_data(PDBE_API_URL + "/validation/rama_sidechain_listing/entry/" + the_pdbid) [the_pdbid]
Let us define a container which will hold information about Ramachandran state (outlier, allowed, favoured) and rotamer state (rotamer name or outlier) for each residue of each chain.
In [5]:
import collections
outliers = {
"rama":collections.defaultdict(lambda: collections.defaultdict(lambda:[])),
"rota":collections.defaultdict(lambda: collections.defaultdict(lambda:[])),
}
Let us populate this container with Rama and sidechain information from chains modelling molecule no. 1 in the first and only MODEL in the entry.
In [6]:
for mol in rama_data["molecules"] :
if str(mol["entity_id"]) != "1" :
continue
for chain in mol["chains"] :
model = chain["models"][0]
logging.info("Chain id %s has Rama sidechain validation for %d residues." % (chain["chain_id"], len(model["residues"])))
for residue in model["residues"] :
res_id = (residue["residue_name"],residue["author_residue_number"],residue["author_insertion_code"])
outliers["rama"] [res_id] [ residue["rama"] ].append( chain["chain_id"] )
outliers["rota"] [res_id] [ residue["rota"] ].append( chain["chain_id"] )
Now let us write a function which will find the residues that have more than one Rama or sidechain state, such that at least one of the states occurs a majority of times.
In [8]:
def print_unusual_residues(val_key, major_state_freq) :
for res_id, val_info in outliers[val_key].items() :
# not interested in all residues at this index to be in the same state
if len(val_info) == 1 : continue
# not interested in this residue index if no state occurs at least with required frequency
if not any( [len(val_info[k]) >= major_state_freq for k in val_info] ) :
continue
# write out the popular state and also those occurring less frequently
val_keys = sorted( val_info.keys(), key = lambda vk:len(val_info[vk]), reverse=True )
state_frequencies = ", ".join(["%7s"%k+":%2d" % len(val_info[k]) for k in val_keys])
minor_chains_str = " ".join(["%s:%s"%(k,",".join(val_info[k])) for k in val_keys[1:]])
logging.info("Residue [%3s %4s%s] has multiple %s states: %s || Unusual chains %s" % \
(res_id[0], res_id[1], res_id[2], val_key, state_frequencies, minor_chains_str) \
)
Let us print residues with unusual Ramachandran state that contrasts against counterparts in other chains.
In [9]:
print_unusual_residues("rama", 12)
Here we see that Thr K-497, Glu E-156 are in the allowed region whereas residues in other chains are all in the favoured region. These residues need to be reviewed.
Residues Ala 383, Asp 253 are nearly always outliers. This could be a genuine outlier with good reason, or it could be a systematic error!
Similarly, let us print residues where only a small minority of chains have a different sidechain rotamer at a residue position.
In [10]:
print_unusual_residues("rota", 13)
Note residues Val B-510, Met E-514, Val L-417, Arg M-284 which have rotameric counterparts in all other chains - their sidechain most probably needs remodelling. Also note residues like Lys 75 could be outliers for a good reason, or could just be systematic outliers.