Comparing structures

In this chapter we will study some of the more common metrics to assess the similarity between proteins structures.

When we compare proteins we could face two related problems:

  1. Protein superposition, comparing two proteins with the same sequence (or in some cases equal lenght).
  2. Protein structural alignment, comparing protein with different sequences.

In this chapter we will talk about the first problem, protein superposition. The second problem is just the combination of the ideas in this chapter and the next one.

We will also discuss the Principal Component Analysis technique in the context of the study of atoms fluctuations.


In [1]:
% matplotlib inline
import sys, pymol
from pymol import cmd, stored
from pymol.cgo import *  # We will use this to draw cones!
stdout = sys.stdout
pymol.finish_launching()
sys.stdout = stdout
from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
cmd.load('1UBQ.pdb')
cmd.load('1D3Z.pdb')
cmd.remove('not polymer or hydro')
object0 = cmd.get_names()[0]
object1 = cmd.get_names()[1]


 Adjusting settings to improve performance for Intel cards.

RMSD

The root-mean-square deviation (RMSD) is probably the most common metric used to compare two protein structures. It can be used to evalute how structurally similar are two evolutivelly-related proteins or to assess how close a predicted protein structure is from the native structure or even as a reaction coordinate to check if a Molecular Dynamics simulation has reached the stationary state.

The RMSD is the measure of the average distance between the atoms (usually the $C\alpha$ or backbone atoms) of superimposed proteins. The RMSD is defined as:

$$\mathrm{RMSD}=\sqrt{\frac{1}{N}\sum_{i=1}^N\delta_i^2}$$

Where:
$\delta$ is the distance between $N$ pairs of equivalent atoms.


In [3]:
def rmsd_cur(mol0, mol1, sel='*'):
    """
    Computes the root mean square deviation from the current
    coordinates of two pairs of equivalent atoms. Does not
    perform a superposition.
    
    Parameters
    ----------
    mol0 : PyMOL object
    mol1 : PyMOL object
    sel  : PyMOL selection, atoms used to compute rmsd.
           e.g. use ca+c+n for the backbone
    """
    model0 = cmd.get_model('%s and name %s' % (mol0, sel))
    model1 = cmd.get_model('%s and name %s'  % (mol1, sel))
    xyz0 = np.array(model0.get_coord_list())
    xyz1 = np.array(model1.get_coord_list())
    
    rmsd = (np.sum((xyz0 - xyz1 )**2)/len(xyz0))**0.5
    return rmsd

In [4]:
rmsd = rmsd_cur(object0, object1, sel='ca') #'ca+c+n'
print '%.2f' % rmsd


115.06

In [5]:
rmsd = cmd.rms_cur('%s and name ca' % object0, '%s and name ca' % object1)
print '%.2f' % rmsd


115.06

Protein structural superposition

As you saw in the previous section, the RMSD is meaningless if the proteins are not superposed properly. To superpose two protein structures means to rotate and translate one of the proteins respect to the other. The standard approach to solve this problem is the Kabsch algorithm algorithm. Basically, the Kabsch algorithm is an analytical approach to minimize the RMSD of two sets of cartesian coordinates.

The algorithm works in 3 steps:

  1. Calculation of the proper translation
  2. Calculation of the covariance-matrix between both set of coordinates
  3. Calculation of the optimal rotation matrix

Translation

The optimal superposition is obtained by translating one set of coordinates so that both centroids (geometric centers) coincides. The centroid can be computed as the average of the cartesian coordinates. In practice is more convenient two translate both proteins so that their centroid coincides with the origin of the coordinate system. This is done by subtracting, from each set of coordinates, their corresponding centroids.

Computation of the covariance matrix

The second step consist of calculating a covariance matrix $A$. The covariance matrix is the generalization of the variance to multiple dimensions (remember that the variance is a special case of the covariance when the two variables are identical). The spread of a set of points in one dimensional space can be characterized by an scalar call the variance, for a 2D space a $2 \times 2$ matrix is needed for a 3D space you need a $3 \times 3$ matrix and so on. The covariance matrix of $P$ and $Q$ is:

$$A = P^T * Q$$

Computation of the optimal rotation matrix

The third step is the hardest to undertand. In order to obtain the optimal rotation matrix, $R$. It is necesary to calculate the Singular Value Decomposition (SVD) of the covariance matrix $A$. The SVD is a type of factorization (or decomposition) of a matrix into the product of other matrices.

$$A = USW^T$$

The interesting part (for us) is that the rotation matrix we are looking for is:

$$R = UW^T$$

There is one more thing to take care, $R$ could be a matrix with determinant +1 or -1. We only need the $R$ matrix with determinat +1. If we rotate our molecule using the other matrix we will end-up with the mirror-image of our structure!!!


In [6]:
def rmsd_fit(mol0, mol1, sel='*', fit=True):
    """
    Computes the root mean square deviation from two pairs of
    equivalent atoms after superposition.
    
    Parameters
    ----------
    mol0 : PyMOL object
    mol1 : PyMOL object
    sel  : PyMOL selection. atoms used to compute rmsd.
           e.g. use ca+c+n for the backbone
    fit  : bool. If false computes the rmsd after superposition, but without
           updating the coordinates
           
    """
    xyz0 = np.array(cmd.get_model('%s and name %s' % (mol0, sel)).get_coord_list())
    xyz1 = np.array(cmd.get_model('%s and name %s'  % (mol1, sel)).get_coord_list())
    
    xyz0_all = np.array(cmd.get_model('%s' % mol0).get_coord_list())
    xyz1_all = np.array(cmd.get_model('%s'  % mol1).get_coord_list())
    
    # Translation
    X = xyz0 - xyz0.mean(axis=0)
    Y = xyz1 - xyz1.mean(axis=0)
    # Covariation matrix
    Cov_matrix = np.dot(Y.T, X)
    # Optimal rotation matrix
    U, S, Wt = np.linalg.svd(Cov_matrix)
    # Create Rotation matrix R
    R = np.dot(U, Wt)
    # Ensure a right-handed coordinate system
    if np.linalg.det(R) < 0.:
        S[-1] = -S[-1]
        Wt[-1] *= -1
        R = np.dot(U, Wt) 
    if fit:
        # center the first molecule
        stored.sel0 = list(xyz0_all - xyz0.mean(axis=0))
        # rotate and translate the second molecule
        stored.sel1 = list(np.dot((xyz1_all - xyz1.mean(0)), R))
        #update the changes to the coordinates 
        cmd.alter_state(1, mol0,"(x,y,z)=stored.sel0.pop(0)")
        cmd.alter_state(1, mol1,"(x,y,z)=stored.sel1.pop(0)")

    # We compute the RMSD after superposition by using the matrix S. The advantage is 
    # we do not need to actually do the superposition before computing the RMSD.
    #rmsd = (np.exp(np.log(np.sum(X ** 2) + np.sum(Y ** 2)) - 2.0 * np.log(np.sum(S)))/len(X))**0.5
    rmsd = ((np.sum(X ** 2) + np.sum(Y ** 2) - 2.0 * np.sum(S))/len(X))**0.5
    # scales and translates the window to show a selection
    cmd.zoom()
    return rmsd

In [7]:
rmsd = rmsd_fit(object0, object1, sel='ca', fit=True)
print '%.2f' % rmsd


0.52

Caveats of the RMSD

One problem with the RMSD (and the Kabsch algorithm) is that it gives too much importance to regions that do not match well. In other words the RMSD is not robust to outliers. If two proteins are very similar, except for a small portion (say a loop) the RMSD will tend to be larger that what a visual inspection (intuition) will tell you. One way to solve this problem is to iteratively perform a superposition eliminating from the comparison (manually or automatically) portions that deviate too much.

Lets change the _rmsdfit function in order to make the superposition more robust to outliers


In [8]:
def rmsd_fit_outliers(mol0, mol1, sel='*',
              fit=True, cutoff=2.):
    """
    Computes the root mean square deviation from two pairs of
    equivalent atoms after superposition. The superposition is 
    done iterativelly until all outilers are removed. Outliers
    are defined according to the cutoff value.
    
    Parameters
    ----------
    mol0 : PyMOL object
    mol1 : PyMOL object
    sel  : PyMOL selection, atoms used to compute rmsd.
           e.g. use ca+c+n for the backbone
    fit  : bool. If false computes the rmsd after superposition, but with-out
           updating the coordinates
    cutoff: float. outlier rejection cutoff in Angstrom.
    """

    xyz0 = np.array(cmd.get_model('%s and name %s' % (mol0, sel)).get_coord_list())
    xyz1 = np.array(cmd.get_model('%s and name %s'  % (mol1, sel)).get_coord_list())
    
    xyz0_all = np.array(cmd.get_model('%s' % mol0).get_coord_list())
    xyz1_all = np.array(cmd.get_model('%s'  % mol1).get_coord_list())
    
    total_len = len(xyz0)
    while True:
        # Translation
        X = xyz0 - xyz0.mean(0)
        Y = xyz1 - xyz1.mean(0)
        # Covariation matrix
        Cov_matrix = np.dot(Y.T, X)
        # Optimal rotation matrix
        U, S, W = np.linalg.svd(Cov_matrix)
        # Create Rotation matrix R
        R = np.dot(U, W)
        # Ensure a right-handed coordinate system
        if np.linalg.det(R) < 0:
            S[-1] = -S[-1]
            W[-1] *= -1
            R = np.dot(U, W)     
            
        # rotate and translate the molecule
        Y = np.dot((xyz1 - xyz1.mean(0)), R)
        # center the molecule
        X = xyz0 - xyz0.mean(0)
  
        # compute the distaces between equivalent pair of atoms
        d = np.sqrt(np.sum((X - Y) ** 2, 1))
        xyz0 = xyz0[d < cutoff]
        xyz1 = xyz1[d < cutoff]
        
        super_res = len(xyz0)
        if  super_res == total_len:
            break
        else:
            total_len = super_res
        
    if fit:
        # center the first molecule
        stored.sel0 = list(xyz0_all - xyz0.mean(axis=0))
        # rotate and translate the second molecule
        stored.sel1 = list(np.dot((xyz1_all - xyz1.mean(0)), R))
        #update the changes to the coordinates 
        cmd.alter_state(1, mol0,"(x,y,z)=stored.sel0.pop(0)")
        cmd.alter_state(1, mol1,"(x,y,z)=stored.sel1.pop(0)")

    
    rmsd = ((np.sum(X ** 2) + np.sum(Y ** 2) - 2 * np.sum(S)) / len(X))**0.5
    # scales and translates the window to show a selection
    cmd.zoom()
    return rmsd, super_res

In [9]:
rmsd, super_res = rmsd_fit_outliers(object0, object1, cutoff=1.5, sel='ca', fit=True)
print '%.2f and %s residues superimposed' % (rmsd, super_res)


0.40 and 73 residues superimposed

The PyMOL's command super gives access to a sequence-independet structural alignment method (similar to _rmsd_fitoutliers) that can automatically remove pairs of residues that do not align well. The output of super is:

  • RMSD after refinement
  • Number of aligned atoms after refinement
  • Number of refinement cycles
  • RMSD before refinement
  • Number of aligned atoms before refinement
  • Raw alignment score
  • Number of residues aligned

In [10]:
cmd.super('%s and name ca' % object0, '%s  and name ca' % object1, cutoff=2)


Out[10]:
(0.3532576858997345, 71, 3, 0.5214025974273682, 76, 366.1607666015625, 76)

Another approach is to use the Kabsch algorithm (i.e. the minimization of the RMSD) to find a good structural superposition and then evaluate the result with a different metric (one that is more robust to outliers). The process is iterated until some convergence criteria is meet. Such aproach is used by the TM-score algorithm.

Template Modeling score (TM-score)

The TM-score is an algorithm to calculate the structural similarity of two protein models. It is often used to quantitatively assess the accuracy of protein structure predictions relative to the experimental structure.

The TM-score has the following advantages over the RMSD:

  • It is robust to outliers, because weights the closer atom pairs stronger than the distant ones
  • The TM-scores values are independent of the protein lenght.
  • TM-score has the value in (0,1]. Based on statistics, a TM-score $\lt 0.17$ corresponds to a random similarity and a TM-score $\gt 0.5$ generally corresponds to the same fold in SCOP/CATH.

The TM-score is defined as:

\begin{equation} TM\text{-}score = max\left [ \frac{1}{L} \sum_{i}^{n} \frac{1}{1+(\frac{d_i}{d_0})^2} \right ] \end{equation}

where:
$n$ is the lenght of the protein
$L$ is the lenght of superimposed region (by default equal to n)
$d_i$ is the distance between the $i^{th}$ pair of residues
$d_0 = 1.24 \sqrt[3]{L-15} - 1.8$
$Max$ denotes the maximum value after optimal spatial superposition.


In [11]:
def tm_score(mol0, mol1, sel='*'): #Check if TM-align use all atoms!
    """
    Compute TM-score between two set of coordinates
    
    Parameters
    ----------
    mol0 : PyMOL object
    mol1 : PyMOL object
    sel  : PyMOL selection, atoms used to compute rmsd.
           e.g. use ca+c+n for the backbone
    """
    xyz0 = np.array(cmd.get_model('%s and name %s' % (mol0, sel)).get_coord_list())
    xyz1 = np.array(cmd.get_model('%s and name %s'  % (mol1, sel)).get_coord_list())
    
    L = len(xyz0)
    # d0 is less than 0.5 for L < 22 
    # and nan for L < 15 (root of a negative number)
    d0 = 1.24 * np.power(L - 15, 1/3) - 1.8
    d0 = max(0.5, d0) 

    # compute the distance for each pair of atoms
    di = np.sum((xyz0 - xyz1) ** 2, 1) # sum along first axis
    return np.sum(1 / (1 + (di / d0) ** 2)) / L

In [12]:
score = tm_score(object0, object1, sel='ca')
print '%.4f' % score


0.9686

Global Distance Test (GDT)

As the TM-score, this metric was invented as a robust alternative to the RMSD. It is most commonly used duting the Critical Assessment of Structure Prediction (CASP) to compare the results of protein structure prediction to experimentally determined structures.

The GDT score is calculated as the average of the number of $C\alpha$ atoms bellow a defined distance cutoff (after superposition):

  • The GDT_TS (total score) uses the cutoff values, 1, 2, 4 and 8.
  • The GDT_HA (High Accuracy) uses the cutoff values, 0.5, 1, 2 and 4.

Bayesian statistic and robust superposition

A better way to solve the superposition problem, one that is theoretically-well founded, is to use Bayesian statistic. This approach is equivalent to use a weighted RMSD i.e. an RMSD where the pairs that match well have a higher contribution than those that do not match well. In this approach the weights are inferred automatically from the data. Understanding this method will require and introduction to Bayesian statistic, hence we are not going to study it here. But you can read more about this here

Principal Component Analysis

Principal component analysis (PCA) is widely used statistical procedure to reduce the dimensionality of a data-set, without loosing too much information. In this context each measured variable is called a dimension. The idea is to project our high-dimensional data into a a lower space in the most informative way. Within PCA this is achieved in roughly two steps:

  • Find a new coordinate system to represent the data:

    The new axes are called principal components. The principal components are a linear combination of the original axes. Principal components are build in such a way that they capture the maximum variance of the data (with the restriction that they must be orthogonal).

  • Use only the first few principal components to represent, analyze and visualize the data.

    In many practical applications just a few principal components can be used to explain most of the variability contained in the data, effectively reducing the dimensionality of the data-set

To start gaining intuition into how PCA works, lets see the following trivial example when we have two variables perfectly correlated. We could find a new coordinate system (by performing a rotation) that renders one of the coordinates superfluous. If you look at right figure it becomes obvious that the new $x$ axes explain all the variation in our data.


In [13]:
x0 = [1, 2, 3, 4, 5]
y0 = x0

x1 = [1, 2, 3, 4, 5]
y1 = [3] * 5

plt.figure(figsize=(8,4))
plt.subplot(1,2,1)
plt.scatter(x0, y0);
plt.subplot(1,2,2)
plt.scatter(x1, y1);


Of course real data does not look like the toy-data from the previous example. However, PCA works in practice because real data-sets often have an internal structure i.e the variables/dimensions are correlated (at least to some degree) or variables can be expressed as a linear combination of other variables. Thus, PCA is a way to convert a set of (possibly) correlated variables into a set of linearly uncorrelated variables (the principal components).

Notice that while in your original data-set your variables have a meaning (at least to you and hopefully to your public!) when PCA is performed the new axes (the principal components) are the result of a linear combination of the original variables, hence in general, they will be meaningless.


In [16]:
x = np.array([[1, 2, 3, 4, 5]])
y = x * 2
z = x * 4
xy = np.vstack((x, y, z))

cov_mat = np.dot(xy.T, xy)
eig_vals, eig_vecs = np.linalg.eigh(cov_mat)
p = np.abs(eig_vals).argsort()[::-1]
eig_vals = eig_vals[p]
eig_vecs = eig_vecs[p]

PC1 = np.dot(xy, eig_vecs[:,0])
PC2 = np.dot(xy, eig_vecs[:,1])

plt.scatter(PC1, PC2)
plt.xlabel('PC1')
plt.ylabel('PC2');


PCA in Structural Bioinformatics

PCA is routinely used in exploratory data analysis to make sense of complex data-sets. In structural bioinformatics one common application is the analysis of biomolecular structural ensembles specially in molecular dynamics simulations, but also NMR ensembles or even artificial ensembles made by combining several structures of the same biomolecule, but solved independently under different experimental conditions and/or techniques (like X-ray).

PCA has found a niche in the analysis of molecular dynamics, because of its simplicity and because bio-molecular systems are complex (and more complex when they are moving!). Interestingly, it has been found that the vast majority of protein dynamics can be described by a surprisingly low number of collective degrees of freedom.

PCA rests on the assumption that the major collective modes of fluctuation dominate the dynamics (and function of molecules) this assumptions is shared with a theoretical method to generate conformational fluctuations called Normal Mode Analysis NMA.

To compute the PCA, first we will upload a PDB file with more than one conformation. To perform a PCA analysis is important to perform an structural superposition. In general all conformations are superposed onto the first one, but this choice is irrelevant. As with the Kabsh algorithm, the only thing that matters is to remove the rotations and translations and just keep the internal motions.


In [17]:
cmd.delete('all')
#cmd.load('2K39.pdb')
cmd.load('1D3Z_full.pdb')
cmd.remove('not polymer')
# Sometimes is a better idea to remove portions with high flexibility
# otherwise these portions could hide more subtle changes.
#sel = 'name ca and resi 1-70'
sel = 'name ca'
# superimpose all conformations
cmd.intra_fit(sel);
states = cmd.count_states('all')
residues = cmd.count_atoms(sel)
xyz = np.zeros((states, 3*residues))
for state in range(0, states):
    model = cmd.get_model(sel, state=state+1)
    xyz[state] = np.array(np.ravel(model.get_coord_list()))
np.shape(xyz)


Out[17]:
(10, 228)

The PCA algorithm

  1. Translate all conformations to the origin of coordinates
  2. Build the Covariance Matrix
  3. Compute the eigenvectors and eigenvalues
  4. Sort the eigenvectors according to the eigenvalues

Notice that like in the Kabsh algorithm, we need to compute a covariance matrix. The only difference is that now we need to compute the average covariance matrix (along the $N$ conformations). Here we are estimating the covariance matrix of the population from a sample, hence we do not divide by $N$, instead we divide by $N-1$ (if you want to find out why the $-1$ you should read more about the variance). Nevertheless notice that in general you should be performing a PCA from a large number of conformations and when you increase $N$, the difference between $N$ and $N-1$, becomes more and more negligible.

In this case the eigenvectors represent a correlated displacement of a groups of atoms and eigenvalues represent the magnitude of this displacement ($\mathring A^{2}$). We sort the eigenvalues (and their corresponding eigenvectors) because we want the eigenvectors and eigenvalues that explain most of the fluctuations.


In [18]:
def PCA(xyz):
    """
    Compute a Principal Component Analysis from a set 
    of coordinates
    
    Parameters
    ----------
    xyz : Array. Cartesian coordinates
    """
    
    # Translate to the origin of coordinates
    xyz_mean = np.mean(xyz, axis=0)
    xyz_c = xyz - xyz_mean

    # compute the covariance_matrix
    cov_mat = np.dot(xyz_c.T, xyz_c)/(np.shape(xyz_c)[0]-1)
    # Compute the eigenvectors and eigenvalues
    eig_vals, eig_vecs = np.linalg.eigh(cov_mat)
    
    p = np.abs(eig_vals).argsort()[::-1]
    eig_vals = eig_vals[p]
    eig_vecs = eig_vecs[:,p]
    return eig_vals, eig_vecs

In [19]:
eig_vals, eig_vecs = PCA(xyz)

Now we have to visualize and explore the principal components we have just obtained.


In [20]:
plt.figure(figsize=(9,4.5))
plt.subplot(121)
plt.plot(eig_vals)
plt.xlabel('modes')
plt.ylabel('eigenvalues $\AA^2$')
plt.subplot(122)
plt.plot(eig_vals)
plt.xlabel('modes')
plt.ylabel('eigenvalues $\AA^2$')
plt.xlim(0,10)


Out[20]:
(0, 10)

In [21]:
var_exp = np.cumsum(eig_vals/np.sum(eig_vals)*100)
plt.plot(var_exp)
plt.xlim(0,10)
plt.ylim(0,100)
plt.xlabel('modes')
plt.ylabel('Variance explained (%)');


We can compute how much each mode contributes to the rmsd atomic fluctuations


In [22]:
rmsds = []
for i in range(3):
    mod = eig_vecs[:,i].reshape(-1,3)
    rmsd = (np.sum((mod)**2, axis=1)*eig_vals[i])**0.5
    rmsds.append(rmsd)

In [28]:
plt.plot(rmsds[0], label='mode 1')
plt.plot(rmsds[1], label='mode 2')
plt.plot(rmsds[2], label='mode 3')
plt.xlabel('Residues')
plt.ylabel('RMSD')
plt.legend(loc=0);


Sometimes clustering the structures in the PC space can be useful to understand how different conformations are related to each other. In many cases this clustering method can reveal functional relationships that are harder to find by methods like RMSD clustering.


In [27]:
PC1 = np.dot(xyz, eig_vecs[:,0])
PC2 = np.dot(xyz, eig_vecs[:,1])
plt.scatter(PC1, PC2)
plt.xlabel('PC1')
plt.ylabel('PC2');


Another useful and aesthetically appealing way to visualize PCA in biomolecular systems is plotting them directly on the structural model in what is sometimes call a porcupine plot.


In [24]:
def porcupine(xyz, eig_vals, eig_vecs, mode=0, state=1):
    """
    Genereta a porcupine plot to visualize the result of PCA
    onto a 3D structural model
    
    Parameters
    ----------
    xyz : Array. Cartesian coordinates
    eig_vals : Array. Eigenvalues from a PCA
    eig_vecs : Array. Eigenvector from a PCA
    mode : int. Principal Component that will be plotted
    state : int. State onto wich the PC will be plotted
    """
    xyz_mean = np.mean(xyz, axis=0)
    coord0 = np.reshape((xyz[state-1]), (-1,3))
    coord1 = np.reshape(eig_vecs[:,mode]*eig_vals[mode]+xyz_mean, (-1,3))
    cmd.delete('arrows')
    r0, g0, b0 = 0, 0, 1
    r1, g1, b1 = 1, 0, 0
    rd0 = 0.5
    rd1 = 0.05
    obj = []
    for i in range(len(coord0)):
        x0, y0, z0 = coord0[i]
        x1, y1, z1 = coord1[i]
        obj.extend([CONE, x0, y0, z0, x1, y1, z1, rd0, rd1, \
                    r0, g0, b0, r1, g1, b1, 1.0, 0.0 ])
    cmd.load_cgo(obj,'arrows')

In [25]:
cmd.hide('all')
cmd.show('cartoon')
porcupine(xyz, eig_vals, eig_vecs, mode=0)

Further reading

Kabsch, W. (1976). A solution for the best rotation to relate two sets of vectors. Acta. Crystal, 32A:922-923.
Kabsch, W. (1978). A discussion of the solution for the best rotation to related two sets of vectors. Acta. Crystal, 34A:827-828.
RMSD Bosco. A nice post about the Kabsch algorithm.
Y. Zhang & J. Skolnick 2004. Scoring function for automated assessment of protein structure template quality.
J. Xu & Y. Zhang 2010. How significant is a protein structure similarity with TM-score=0.5?
Protein Dynamics Methods and Protocols Experimental and computational state of the art methods for characterizing protein dynamics.
CSB. A python framework for building applications in the domain of structural bioinformatics.
ProDY Protein Dynamics and sequence analysis.
PCA in 3 steps. Blog/IPython-notebook article by Sebastian Raschka.