Distances, angles and geometry

In this chapter we will begin to explore some of the most common structural bioinformatic algorithms and we will apply some of the concepts from previous chapters.

The goal of structural bioinformatics is to understand biomolecules, their properties,functions, dynamics, evolutionary history etc... in terms of structural/geometrical features. In general, structural bioinformatics borrows the idea from classical mechanics that all you need to determinate the state of a system are the coordinates and velocities of its particles. As we will not study dynamical properties of biomolecules in this course we will restrict ourselves to the manipulation of coordinates of static biomolecular models.

In this chapter we will use several Python libraries, including the PyMOL API. So, lets start importing everything we need.


In [1]:
%matplotlib inline
# The next five lines are needed to use the PyMOL API
# all the stuff with the "sys" is just to send the pymol
# standard-output (stdout) to the notebook.
import sys, pymol
from pymol import cmd, stored
stdout = sys.stdout
pymol.finish_launching()
sys.stdout = stdout

from __future__ import division  
import numpy as np
np.set_printoptions(precision=3)
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D  # 3D plots
import seaborn as sns
from scipy.spatial.distance import cdist  # import cdist function from scipy


 Adjusting settings to improve performance for Intel cards.

The PyMOL API

PyMOL provieds a lot of handy functions to works with molecules, all the things that can be done using the PyMOL GUI (graphical user interface) can be done with its API (and even a couple more!).

The first think we will do is to upload a molecule into PyMOL, we can do that with the load command. Then, we will remove everything that is not a polymer with the remove comand, in this case we are removing oxygen atoms that belong to water molecules. All PyMOL commands start with cmd.


In [2]:
cmd.load('1UBQ.pdb')
cmd.remove('not polymer')

Next, we will create a NumPy array containg the cartesian coordinates for all the C$\alpha$ atoms from the uploaded molecule. For convenience we will use the following PyMOL functions:


In [3]:
model = cmd.get_model('all')
xyz = np.array(model.get_coord_list())
model_ca = cmd.get_model('name ca')
xyz_ca = np.array(model_ca.get_coord_list())
xyz[:3]


Out[3]:
array([[ 27.34 ,  24.43 ,   2.614],
       [ 26.266,  25.413,   2.842],
       [ 26.913,  26.639,   3.531]])

In [4]:
xyz.shape


Out[4]:
(602, 3)

We can use the Linux command grep to check that we are extracting the correct coordinates. We use the Linux command head to print just the very few lines.


In [5]:
!grep 'ATOM  ' 1UBQ.pdb | head


ATOM      1  N   MET A   1      27.340  24.430   2.614  1.00  9.67           N  
ATOM      2  CA  MET A   1      26.266  25.413   2.842  1.00 10.38           C  
ATOM      3  C   MET A   1      26.913  26.639   3.531  1.00  9.62           C  
ATOM      4  O   MET A   1      27.886  26.463   4.263  1.00  9.62           O  
ATOM      5  CB  MET A   1      25.112  24.880   3.649  1.00 13.77           C  
ATOM      6  CG  MET A   1      25.353  24.860   5.134  1.00 16.29           C  
ATOM      7  SD  MET A   1      23.930  23.959   5.904  1.00 17.17           S  
ATOM      8  CE  MET A   1      24.447  23.984   7.620  1.00 16.11           C  
ATOM      9  N   GLN A   2      26.335  27.770   3.258  1.00  9.27           N  
ATOM     10  CA  GLN A   2      26.850  29.021   3.898  1.00  9.07           C  

Distances

The euclidian distance between two points in the 3D-space, given the points ($x_0, y_0, z_0$) and ($x_1, y_1, z_1$) is:

$$d=\sqrt{(\Delta x)^2+(\Delta y)^2+(\Delta z)^2}=\sqrt{(x_1-x_0)^2+(y_1-y_0)^2+(z_1-z_0)^2}$$

This formula is just a generalization of the Pythagorean theorem. Other distances exist but we will not talk about those in this course. Using NumPy we can compute the distance between two points as follows.


In [6]:
def distance(coord, a, b):
    """Compute the distance between two points"""
    dist = (np.sum((coord[a]-coord[b])**2))**0.5
    return dist

The distance bewteen the first and second C$\alpha$ is:


In [7]:
d = distance(xyz, 1, 9)
print 'Distance: %.3f' % d


Distance: 3.804

We can get the same result using the following PyMOL function:


In [8]:
d = cmd.get_distance('index 2', 'index 10')
#d = cmd.get_distance('res 1 and name ca', 'res 2 and name ca')
print 'Distance: %.3f' % d


Distance: 3.804

In [9]:
def contact_map(coord, cut_off=None):
    """
    A function to compute contact maps.
    
    Parameters
    ----------
    coord : array. 
        cartesian coordinates
    cut_off: float. 
        contact distance
    
    Returns
    -------
    arrays of zeros and ones.
    
    """
    lenght = len(coord)
    d_matrix = np.zeros((lenght, lenght))
    for row in range(lenght):
        for col in range(lenght):
            dist = distance(coord, row, col)
            if dist <= cut_off:
                d_matrix[row, col] = 1
    return d_matrix

Contact maps

A protein contact map is a matrix $\mathbf{M}$ representing the contacts between aminoacidic residue pairs in a three-dimensional protein structure. Two residues are in contact if their distance is below a certain cutoff $c$:

$$\mathbf{M}_{ij} = \begin{cases} 1 & \delta_{ij} \leq c \\ 0 & otherwise \end{cases}$$

From contacts maps it is possible to identify secondary structure elements:

  • Helices appear directly adjacent to the diagonal.
  • Parallel beta sheets are parallel to the diagonal.
  • Antiparallel beta sheets appear as strips perpendicular to the diagonal

Contacts map are 2D models that preserve important features of the 3D structures they represents. As it seems that it could be easy to predict this 2D reduced models than a whole 3D structure a lot of effort have been put (relatively recently) to predict contact maps from sequences (in general multiple sequence alignment). The idea is that if we study a lot of proteins we could detect residues that co-evolve. Of course, two residues could have co-evolved because they are functionally related even when the are not structurally related (they are no in contact). Anyway, several methods to predict contact maps have been proposed and in general it is possible to obtain very good result if a lot of sequences are available. Interestingly, it has been show that reconstructing a 3D protein from a contact map is NP-Hard. Heuristics methods to perform this task have been proposed and also a lot of research is going-on to solve this problem.

The HB-plot is a representation related to the contact map, the difference is that HB-plots are restricted to residues forming hydrogens bonds. This type of plot can also be very helpful to study proteins.


In [10]:
cm = contact_map(xyz_ca, cut_off=9.)
plt.imshow(cm, interpolation='none')
plt.grid('off')


Lets see how we can easily optimize the _contactmap function using NumPy and SciPy.


In [11]:
def contact_map_vectorized(coord, cut_off=None):
    """
    A function to compute contact maps.
    
    Parameters
    ----------
    coord : array. 
        cartesian coordinates
    cut_off: float. 
        contact distance
    
    Returns
    -------
    arrays of booleans.
    
    """
    dist = cdist(coord, coord, metric='euclidean')
    return dist <= cut_off

In [12]:
cm = contact_map_vectorized(xyz_ca, cut_off=9.)
plt.imshow(cm, interpolation='none')
plt.grid('off')


The optimized function is shorter and without any loop and even more important is about 950 times faster, as we can easily measure using the cell magic timeit.


In [13]:
%%timeit
cm = contact_map(xyz_ca, cut_off=9.)


10 loops, best of 3: 86.7 ms per loop

In [14]:
%%timeit
cm = contact_map_vectorized(xyz_ca, cut_off=9.)


10000 loops, best of 3: 91.4 µs per loop

Angles

An angle is the figure formed by two lines, sharing a common endpoint (vertex).

Consider two atoms a and c, both bonded to a third atom b.

The angle formed by the three atoms a, b and c can be calculated as:

$$\varphi_{abc}=\arccos \left(\frac{\mathbf{ba} \cdot \mathbf{cb}}{|\mathbf{ba}| |\mathbf{cb}|}\right)$$

where:
$\mathbf{ba}$ is the vector (bond) conecting the a and b atoms
$\mathbf{bc}$ is the vector (bond) conecting the c and b atoms


In [15]:
def get_angle(coord, a, b, c):
    # Compute 2 vectors conecting the three points (bonds)
    ba = coord[b]-coord[a]
    bc = coord[b]-coord[c]

    #Measure the angle between those vectors
    ba_mod = np.sqrt(np.sum((ba) ** 2))
    bc_mod = np.sqrt(np.sum((bc) ** 2))
    angle_rad = np.arccos(np.dot(ba, bc) /
                             (ba_mod * bc_mod))
    
    # convert from radians to degree
    angle_degree = angle_rad * 180 / np.pi
    return angle_degree

Lets use the previous defined function to compute the angle defined by the atoms ($N-C_{\alpha}-C$) from the first residue.


In [16]:
angle = get_angle(xyz, 0, 1, 2) 
print 'Torsional angle: %.2f' % angle


Torsional angle: 107.01

We can obtain the same result using the PyMOL API command _get_angle_


In [17]:
angle = cmd.get_angle('index 1', 'index 2', 'index 3') # pymol count from 1 
print 'Torsional angle: %.2f' % angle


Torsional angle: 107.01

Torsional angles

A torsional angle, or dihedral, is defined as the angle between two planes.

How to measure the angle between two planes? Well, as we saw in the previous section we could define an angle in terms of two lines. Hence one solution to measure the angle between planes is to define two lines that are related to those planes. The intuitive approach will be to define the lines perpendicular to the line where the planes intersect and that are lying within the planes. The problem with this idea is that the mathematics involved are somehow complicated. An easier approach is the following:

  • Compute the 3 vectors, $\mathbf{ba}$, $\mathbf{cb}$ and $\mathbf{dc}$ connecting the four atoms (these corresponds to the bonds between those atoms), the first and second bond define the $\mathbf{A}$ plane, the second and third define the $\mathbf{B}$ plane (the translucid grey rectangles in the below figure).
  • Compute the normal vector for each of plane (black arrow)
  • Measure the angle between those normal vectors
$$\varphi_{AB}=\arccos \left(\frac{\mathbf{u_A} \cdot \mathbf{u_B}}{|\mathbf{u_A}| |\mathbf{u_B}|}\right)$$

were: $\mathbf{u_A}$ is the normal vector to plane $\mathbf{A}$ $\mathbf{u_B}$ is the normal vector to plane $\mathbf{B}$

  • Convert from radians to degree
  • Compute the sign of the vector

By convention, torsionals angle are defined in the range [-180$^\circ$, 180$^\circ$]. But the result of our calculations are in the range [0, 180$^\circ$]. Then, we need to find out the correct sign

Visually we can find the sign by looking along the bond defined by the second and third atoms, the second atom should be eclipsing the third and the four atom should be to the left. If the first atom above the plane $\mathbf{B}$ then we have a positive angle, otherwise we have a negative one.

As you may guess the above algorithm does not translate easily to mathematics. Instead we can just compute a dot product between the unit vector $\mathbf{u_A}$ and the vector connecting the last two atoms $\mathbf{dc}$. If the result is positive we have a positive angle otherwise we have a negative one.


In [18]:
def get_torsional(a, b, c, d):
    """Compute the torsional angle given four points"""
    # Compute 3 vectors conecting the four points
    ba = xyz[b]-xyz[a]
    cb = xyz[c]-xyz[b]
    dc = xyz[d]-xyz[c]

    # Compute the normal vector to each plane
    u_A = np.cross(ba, cb)
    u_B = np.cross(cb, dc)

    #Measure the angle between the two normal vectors
    u_A_mod = np.sqrt(np.sum((u_A) ** 2))
    u_B_mod = np.sqrt(np.sum((u_B) ** 2))
    tor_rad = np.arccos(np.dot(u_A, u_B) /
                             (u_A_mod * u_B_mod))
    
    # convert from radians to degrees
    tor_degree = tor_rad * 180 / np.pi
    
    # compute the sign
    sign = np.dot(u_A, dc)
    if sign > 0:
        return tor_degree
    else:
        return -tor_degree

Let use the previous defined function and compute the $\psi$ torsional angle ($N_i-C_{\alpha i}-C_i-N_{i+1}$) from the first residue.


In [19]:
tor = get_torsional(0, 1, 2, 8) 
print 'Torsional angle: %.2f' % tor


Torsional angle: 149.63

We can check the above result using the PyMOL function _get_dihedral_


In [20]:
tor = cmd.get_dihedral('index 1', 'index 2', 'index 3', 'index 9')
print 'Torsional angle: %.2f' % tor


Torsional angle: 149.63

Ramachandran plot

A Ramachandran plot is scatter plot of the dihedral angles $\phi$ and $\psi$ of amino acid residues in protein structures (for other molecules we could also have ramachandran-like plots). The first ramachandran plot was computed by hand and include only steric repulsive forces. Now we have access to high-quality experimentally determined Ramachandran plots like the one compiled by Dumbrack's lab.

A Ramachandran plot can be used to show the theoretical $\phi$ and $\psi$ allowed angles (as in the original Ramachandran plot). Alternatively, it can be used to show the empirical distribution of $\phi$ and $\psi$ angles, for example during structure validation.

Combining the PyMol _getdihedral dihedral function and the PyMOL selection algebra is straightforward to compute a Ramachandran map.


In [21]:
def get_residues_num():
    stored.ResiduesNumber = []
    cmd.iterate('(name ca)', 'stored.ResiduesNumber.append(resi)')
    first = int(stored.ResiduesNumber[0])
    last = int(stored.ResiduesNumber[-1])
    return first, last

def get_phi(res_num):
    if res_num != first:
        cmd.select('A', 'resi %s and name C' % (res_num-1))
        cmd.select('B', 'resi %s and name N' % res_num)
        cmd.select('C', 'resi %s and name CA' % res_num)
        cmd.select('D', 'resi %s and name C' % res_num)
        return cmd.get_dihedral('A', 'B', 'C', 'D')
    else:
        return float('nan')


def get_psi(res_num):
    if res_num != last:
        cmd.select('A', 'resi %s and name N' % res_num)
        cmd.select('B', 'resi %s and name CA' % res_num)
        cmd.select('C', 'resi %s and name C' % res_num)
        cmd.select('D', 'resi %s and name N' % (res_num+1))
        psi = cmd.get_dihedral('A', 'B', 'C', 'D')
        return psi
    else:
        return float('nan')

In [22]:
first, last = get_residues_num()

phi = []
psi = []
for i in range(first, last+1):
    phi.append(get_phi(i))
    psi.append(get_psi(i))

plt.figure(figsize=(6, 6))
plt.scatter(phi, psi)
plt.xlabel('$\phi$', fontsize=16)
plt.ylabel('$\psi$', fontsize=16, rotation=0)
plt.xlim(-180, 180) 
plt.ylim(-180, 180);


Radius of Gyration

The radius of gyration is used to describe the dimensions of a polymeric chain and is defined as:

$$Rg = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (r_i - r_{mean})^{2}}$$

where:
$r_{mean}$ is the geometric center of the polymer. $$r_{mean} = \frac 1N \sum_{i=1}^N r_i$$

There is an empirical relationship between the radius of gyration and the number $N$ of aminoacids in a protein sequence.

$$Rg_{theo} = 0.395*N^{\frac{3}{5}} + 7.257$$

The radius of gyration is also related to a theoretical and experimental measure called hydrodynamic radius.

Sometimes the radius of gyration is used as part of a penalization term during simulation of compact molecules to avoid spending computational resources in simulating unfolded states. In general a quadratic error term is used when the theoretical and the computed radius of gyrations differ.


In [24]:
xyz_mean = np.average(xyz, axis=0)
rg = np.sqrt(np.sum((xyz - xyz_mean)**2)/len(xyz))
print 'Radius of gyration: %.2f' % rg

cmd.pseudoatom('sphere_rg', pos=tuple(xyz_mean), vdw=rg, color='blue')
cmd.set('sphere_transparency', 0.5)
cmd.set('sphere_quality', 2)
cmd.hide('everything')
cmd.show('cartoon')
cmd.show("spheres", 'sphere_rg')


Radius of gyration: 11.73

Sometimes the radius of gyration is defined as a mass-weighted average:

$$Rg = \sqrt{\frac{1}{M} \sum_{i=1}^{N} m_i (r_i - r_{com})^{2}}$$

where:
$r_{com}$ is the center of mass of the polymer. $$r_{com} = \frac 1M \sum_{i=1}^N m_i r_i$$

$m_i$ is the mass of the $i^{th}$ atom
$M$ is the total mass of the system

Since hydrogens are not always present in the pdb files, in general the radius of gyration is computed ignoring hydrogens atoms.

Before continue we are going to remove the sphere. We will do this to avoid confusions in the next sections. The sphere is a pseudo atom with coordinates and properties similar to real atoms. Hence, if we ask PyMOL for coordinates or properties of our molecule the coordinates and properties of the sphere will be included.


In [25]:
cmd.remove('sphere_rg')

In [26]:
model_atom = model.atom
mass = np.array([i.get_mass() for i in model_atom]).reshape(len(model_atom), 1)
M = np.sum(mass)

xyz_com = np.sum(xyz * mass, axis=0) / M
rg = (np.sum(mass* (xyz - xyz_com)**2) / M)**0.5

print 'Radius of gyration: %.2f' % rg


Radius of gyration: 11.79

In [27]:
cmd.remove('sphere_rg')

Hydrogens Bonds

An hydrogen bond is the atractive force (a dipole-dipole attraction) between an hydrogen attached to a highly electronegative atom (such as nitrogen or oxygen) and other electronegative atom. Hydrogen bonds interactions are stronger than van der Waals and weaker than covalent bonds.

The are several ways to compute and hydrogen bond from the cartesian coordinates of a molecule, probably the simplest one is to define it using 1 angle and 1 distance (other definitions involve more than one angle and/or distance). In this context an hydrogen bond is formed between an atom with an hydrogen (H) bonded to it (the donor, D) and another atom (the acceptor, A) if:

  • The distance D-A is less than a cut-off distance (usually 3.5 $\mathring A$).
  • The angle D-H-A is greater than a cut-off angle (usually 90$^\circ$).

Using this criteria, an optimal hydrogen bond will be one with a distance D-A of ~2.2 $\mathring A$ and with a value of 180$^\circ$ for the D-H-A angle.

Under PyMOL is possible to measure hydrogen bonds using the _findpairs command. When using this command is important to be sure to provide the appropriate atom selections, because _findpairs only checks atom orientation, not atom type. The command returns a list of tuples with the name of the object and the index of the atom for each pair of atoms that satisfied the selected criteria.


In [28]:
donor = 'res 1 and name N'
aceptor =  'res 17 and name O'
cmd.find_pairs(donor, aceptor, mode=1, cutoff=3.5, angle=40)


Out[28]:
[(('1UBQ', 1), ('1UBQ', 130))]

Dehydrons

A dehydron is a protein backbone hydrogen bond incompletely shielded from water attack. A desolvated hydrogen bond is energetically more favourable than one exposed to the solvent and hence dehydrons are sticky , since they promote the removal of surrounding water through protein associations or ligand binding.

Dehydrons are less conserved than other structural motifs, hence identification of dehydrons could help to increase specificity during the rational drug design process. Certain proteins are enriched in dehydrons such as membrane proteins, toxic proteins and proteins that have a strong tendency to aggregate.

A putative dehydron can be detected by counting the number of wrappers that surround a hydrogen bond. A wrapper is defined as a carbon atom not bonded directly to an oxygen or nitrogen atom, i.e. a non-polar carbon atom. A dehydron is defined by the number of wrappers inside two overlapping spheres centred at the C$\alpha$ carbon of the donor and acceptor residues. If the number of wrappers around an hydrogen bond is below a certain cut-off value that hydrogen bond is identified as a dehydron.

One way to define that cut-off is by analysing a large number of high quality X-ray proteins and measuring the number of wrappers per hydrogen bond. The next figure is the result of such analysis.

The frequency of wrappers in this set of proteins approximate a Gaussian distribution with a mean of ~27 and a standard deviation of ~8, hence an hydrongen bond with less than 19 wrappers is defined as a dehydron (27-8 = 19). In the same fashion and over-wrapped hydrogen bond is and hydrogen bond with more than 35 wrappers (27+8).

A convenient way to estimate dehydrons is by using the wrappy plugin.

Computing dehydrons using PyMOL is relativelly easy thanks to the Selection Algebra features available in PyMOL.

The following code is a slightly different version from the one in the plugin.


In [29]:
def dehydron(selection='all', angle_range=40, 
             max_distance=3.5, desolv=6.5, min_wrappers=19):
    
    # obtain a valid name for the selection
    DH_name = cmd.get_legal_name('DH_%s' % selection)
    # avoid appeding results
    cmd.delete(DH_name) 

    # Ensures that your selection excludes ligands and small molecules    
    selection_hb = '((%s) and polymer)' % (selection)
    # Get pairs of hydrogen bonds. 
    #"Byres" expands te selection to complete residues.
    hb = cmd.find_pairs("((byres %s) and name n)" % selection_hb, 
                        "((byres %s) and name o)" % selection_hb,
                        mode=1, cutoff=max_distance, angle=angle_range)
    #Sort the hydrogen bonds
    hb.sort()
    print "--------------------------------------------------------------------"
    print "--------------------------Dehydron Results--------------------------"
    print "--------------------------------------------------------------------"
    print "            Donor             |            Aceptor           |"
    print "     Object   Chain Residue   |     Object   Chain Residue   | # wrappers"

    cmd.select('_nonpolar', '(elem C) and not (solvent or (elem N+O) extend 1)')
    cmd.select('_selection', '%s' % selection)

    for pairs in hb:
        # Count the number of wrappers around the ca (byca) for each hb
        wrappers = cmd.count_atoms(
'((%s and _nonpolar and _selection) within %f of byca (%s`%d %s`%d))' % 
                ((pairs[0][0], desolv) + pairs[0] + pairs[1]))
        if wrappers < min_wrappers:
            space = {'nitro':[], 'oxy':[]}
            cmd.distance(DH_name, pairs[0], pairs[1]) # plot hb
            cmd.iterate(pairs[0], 'nitro.append((chain, resi, resn))', space=space)
            cmd.iterate(pairs[1], 'oxy.append((chain, resi, resn))', space=space)
            print ' %12s%4s%6s%6s | %12s%4s%6s%6s |%7s' % (pairs[0][0],
                                                        space['nitro'][0][0], 
                                                        space['nitro'][0][2], 
                                                        space['nitro'][0][1], 
                                                        pairs[1][0], 
                                                        space['oxy'][0][0], 
                                                        space['oxy'][0][2], 
                                                        space['oxy'][0][1], 
                                                        wrappers)
    cmd.delete('_nonpolar')
    cmd.delete('_selection')
    cmd.show_as('dashes', DH_name) # show hb as dashes

Let's compute the dehydrons for our molecule.


In [30]:
dehydron()


--------------------------------------------------------------------
--------------------------Dehydron Results--------------------------
--------------------------------------------------------------------
            Donor             |            Aceptor           |
     Object   Chain Residue   |     Object   Chain Residue   | # wrappers
         1UBQ   A   THR     7 |         1UBQ   A   LYS    11 |     17
         1UBQ   A   GLY    10 |         1UBQ   A   THR     7 |     15
         1UBQ   A   GLU    24 |         1UBQ   A   ASP    52 |     18
         1UBQ   A   ASP    32 |         1UBQ   A   ALA    28 |     15
         1UBQ   A   GLY    35 |         1UBQ   A   GLN    31 |     16
         1UBQ   A   ARG    54 |         1UBQ   A   GLU    51 |     18

Surface Area

The solvent-accessible surface area (SASA) is the surface area of a biomolecule that is accessible to the solvent.

A well-know analytical method to compute the SASA is the LCPO method. This method is analytical, but solves an aproximated version of the problem, providing a quick calculation of the SASA while introducing small errors (~1-3 $\mathring A$$^2$).

A numerical method to calculate the SASA is to use the Shrake & Rupley algorithm. Conceptually, this method rolls a sphere (representing the solvent), all over the surface of the molecule.

The Shrake and Rupley algorithm step by step

The algorithm to compute SASA is a little bit more complex than what we have seen so far, but is nothing to worry.

The first step in the Shrake-Rupley algorithm is to draw a mesh of (aproximatelly) evenly distributed points around each atom. Initially the points are drawn on the unit sphere and then scaled to a given radius (the solvent radius + the van der Waals radius) of each atom. Each point represents a small portion of the total area. Thus, the solvent accesible surface area can be determined by counting the points that are exposed to the solvent (i.e. not buried by the neighboring atoms). Increasing the number of points should decrease the error and increase the computational cost. The choice of the solvent radius also affects the computed area, the smaller the radius the larger the computed area. A typical value for the solvent radius is 1.4 $\mathring A$, which is aproximatelly the radius of a water molecule.

The Shrake and Rupley algorithm can be divided into 3 functions.

Generate a mesh of evenly distributed points on the surface of a sphere

There is no exact method to generate such a mesh, the Shrake-Rupley algorithm relies on an aproximation that is based on the Golden ratio.


In [31]:
def generate_sphere_points(n):
    """
    Return an array of evenly distributed points on the 
    surface of the unit sphere using the Golden-Section
    Spiral algorithm.
    
    Parameters
    ----------
    n : int. Number of points
    
    """
    
    golden_angle = np.pi * (3 - 5**0.5)

    theta = golden_angle * np.arange(n)
    z = np.linspace(1 - (1 / n), (1 / n) - 1, n)
    radius = np.sqrt(1 - z**2)
    points = np.zeros((n, 3))
    # change from polar to cartesian coordinates
    points[:,0] = radius * np.cos(theta)
    points[:,1] = radius * np.sin(theta)
    points[:,2] = z
    
    return points

In [32]:
points = generate_sphere_points(1000)
fig = plt.figure(figsize=(7, 7))
ax = fig.add_subplot(1, 1, 1, projection='3d')
ax.scatter(points[:,0], points[:,1], points[:,2]);


Create a function to compute the neighbor atoms

The next thing we need to do is to compute the neighbors for each atom. In this context two atoms are neighbors if is not possible to put a water molecule between them. In other words, two atoms are neighbors if the distance between their centers is less that the diameter of the solvent plus the van der Waals radius of both atoms.
The idea of creating this function is to reduce the computational cost of the SASA calculation if two atoms are not neighbors, then they can't bury each other.


In [33]:
# create a list with the vdW radius for each atom
stored.vdW = []
cmd.iterate('all', 'stored.vdW.append(vdw)')
# compute the distance matrix
dist = cdist(xyz, xyz, metric='euclidean')

def get_neighbors(coord, probe, k):
    """
    Return list of indices of neighbors. 
    
    Parameters
    ----------
    coord : Array (n,3). Cartesian coordinates
    probe : float. Radius of the solvent
    k: int. atom for which the neighbors will be computed
    
    """
    neighbor_indices = []
    radius = stored.vdW[k] + probe * 2
    # avoid comparing an atom with itself
    for idx in range(len(coord)):
        if idx != k:
            if dist[idx, k] < radius + stored.vdW[idx]:
                neighbor_indices.append(idx)
    return neighbor_indices

Now we are ready to combine the functions _generate_spherepoints and _getneighbors to compute the SASA.


In [57]:
def get_sasa(coord, probe=1.4, n_points=1000):
    """
    Returns the solvent-accessible-surface area
    
    Parameters
    ----------
    coord : Array (n,3). Cartesian coordinates
    probe : float. Radisu of the solvent
    n : int. Number of points
    
    """
    points = generate_sphere_points(n_points)
    # compute the area each point represents
    const = 4.0 * np.pi / n_points
    areas = []
    for i in range(len(coord)):
        # scale the points to the correct radius
        radius = stored.vdW[i] + probe
        points_scaled = coord[i] + points * radius
        # get all the neighbors of the i residue
        neighbors = get_neighbors(coord, 1.4, i)
        # compute the distance between points and neighbors
        d_matrix = cdist(points_scaled, coord[neighbors],
                     metric='euclidean')
        # create a matrix and store the vdW radii for each neighbor
        nb_matrix = np.zeros((n_points, len(neighbors)))
        for nb_i, nb in enumerate(neighbors):
            nb_matrix[:,nb_i] =  stored.vdW[nb]
        # compute the number of buried points, we have to be carefull
        # because we have counted how many times a point is buried
        # and we only need how many points are buried
        buried = np.sum(np.sum(d_matrix < nb_matrix + probe, axis=1) > 0)
        exposed = n_points  - buried
        area = const * exposed * radius**2
        areas.append(area)
    return areas

In [58]:
sasa = get_sasa(xyz, probe=1.4, n_points=1000)
print 'SASA = %.2f' % sum(sasa)


SASA = 4870.62

In [36]:
cmd.set('dot_solvent', 1)
cmd.set('dot_density', 2) 
cmd.set('solvent_radius', 2) 
cmd.get_area('all')


Out[36]:
4878.47119140625

In [34]:
sasas = []
for i in range(500, 10001, 500):
    sasas.append(sum(get_sasa(xyz, probe=1.4, n_points=i)))

In [35]:
plt.plot(range(500,10001,500), sasas)
plt.xlabel('Number of points')
plt.ylabel('SASA ($\AA^2$)');


Further reading

PconPy is a Python module for generating 2D protein maps.
asa.py An alternative Python implementation of the Shrake and Rupley algotithm. The implementation of the SASA algorithm discused in this chapter is based on asa.py
The mysterious regions of the ramachandran plot A very interesting post about the Ramachandran Plot