In [1]:
% matplotlib inline
import sys, pymol
from pymol import cmd, stored
stdout = sys.stdout
pymol.finish_launching()
sys.stdout = stdout
from __future__ import division
from math import cos, acos
import numpy as np
from numba import jit # this is a new one! we will use Numba to speed-up our computations
np.set_printoptions(linewidth=200, precision=3)
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.optimize import minimize
from scipy.misc import derivative
from scipy.stats import norm, expon
Before talking about optimization methods and molecualr mechanics we will talk about how to speed-up Python code. There are several options, most of them follows the Pareto rule.In this case can be re-stated as follows only an small part of the code is responsible for most of the execution time. Hence, is generally possible to speed-up Python just by re-writing an small part of the code, you can attain that by:
The 3 last options allows speed-ups that can range from 0 to 3 orders of magnitude depending on several factors. This means that under certain conditions it is possible to obtain similar performance to C/C++ or Fortran code.
Much of the technology behind these methods (like the jit compiler) for optimization were not available when Python was invented and thus the patchy-feel of this solutions. One language that was created from the beginning with all this optimizations in mind is Julia. Julia feels like Python but runs (almost) like C/C++/Fortran. The only drawback right now is that the Julia ecosystem is too young (at least for now). The good news is that it is possible to run Julia and Python together (see here and here).
Using Numba we just add the decorator @jit just before the function we would like to optimize. A decorator allow you to modify code in functions or classes, without having to rewrite the function or class.
In general Numba is smart enough to infer the type of the arguments and the type of the returned value. In case Numba fails to find out the right types you could end-up with code that is slower than the pure python code! In that case you can indicate to Numba the right types. For how to do this and other features on Numba please read the documentation.
In [2]:
# A pure python funcion
def sum_0(arr):
M, N = arr.shape
result = 0
for i in range(M):
for j in range(N):
result += arr[i,j]
return result
# A numpy version of the previous funcion
def sum_1(arr):
return np.sum(arr)
# The numba-optimized version of the first function
@jit
def sum_2(arr):
M, N = arr.shape
result = 0
for i in range(M):
for j in range(N):
result += arr[i,j]
return result
# The numba-optimized version of the second function
@jit
def sum_3(arr):
return np.sum(arr)
# and array to test our functions
a = np.arange(64).reshape(8, 8)
In [3]:
%timeit sum_0(a) # Python
%timeit sum_1(a) # NumPy
%timeit sum_2(a) # Numba
%timeit sum_3(a) # NumPy + Numba
When I run this benchmark on my machine I get the followng results (all comparisons are made against the pure Python function).
Not bad for a adding a decorator and removing some lines!
Now, we will write some auxiliary functions, using Numba, for later use.
In [4]:
@jit
def dist(x, y):
return ((x[0]-y[0])**2 + (x[1]-y[1])**2 + (x[2]-y[2])**2)**0.5
@jit
def _dot(p, q):
return p[0]*q[0] + p[1]*q[1] + p[2]*q[2]
@jit
def _cross(p, q):
x = p[1] * q[2] - p[2] * q[1]
y = p[2] * q[0] - p[0] * q[2]
z = p[0] * q[1] - p[1] * q[0]
return np.array([x, y, z])
@jit
def _mod(x):
return (x[0]**2 + x[1]**2 + x[2]**2)**0.5
In [5]:
# a couple of functions to compare against the Numba optimized ones.
def dist_py(x, y):
return (sum((x-y)**2))**0.5
def mod_py(x):
return (sum(x**2))**0.5
# just an array for benchmarking
coord = np.array([[ 37.71, 23.33, 15.6 ],
[ 38.32, 22.34, 15.19]])
In [6]:
%timeit dist_py(coord[0], coord[1])
%timeit dist(coord[0], coord[1])
%timeit np.dot(coord[0], coord[1])
%timeit _dot(coord[0], coord[1])
%timeit np.cross(coord[0], coord[1])
%timeit _cross(coord[0], coord[1])
%timeit mod_py(coord[0])
%timeit _mod(coord[0])
In my case the speed-ups of using Numba vs pure Python are; ~9, ~1.5, ~1.5, ~13 respectively.
A force field (FF) is a model of the potential energy of a system of particles like a molecular system, including the functional form and the parameters. Notice that while a FF do not explicitly model forces, the first derivative of the potential energy gives the force, so the force part of the name can be easily justified.
Most common FF have a similar functional form, some have less terms (like no bond or angle terms or even no electrostatic term) others have more terms (like explicitly representing hydrogens bonds or a term to penalize extended structures)
$$ U = \sum_\text{bonds} k_b (r-r_0)^2 + \sum_\text{angles} k_a (\theta - \theta_0)^2 + \sum_\text{torsions} k_t [1-\cos(n (\chi - \chi_0))] \\ + \sum_\text{non-bonded} \biggl({\frac{A_{ij}}{(r_{ij})^{12}} - \frac{B_{ij}}{(r_{ij})^6} + \frac{q_iq_j}{4\pi \epsilon_0 r_{ij}}\biggr)} $$The non-bonded interactions (the last term) are computed only for atoms three or more bonds apart. Generally, 1-4 interactions are scaled down by a factor of 0.5.
We can classify FF according to how detailed the system is modeled:
Some popular FF for biomolecules are AMBER, CHARMM, OPLS, Rosetta, MARTINI and Sirah.
As you see from the above equation there are a lot of parameters in a FF that should be determined, like equilibrium values of bond lengths, bond angles and torsionals, van der Waals radii, and partial charges. The parameters are determined for atoms in functional groups. i.e. a carbon in a methyl and a carbon in a carboxyl groups will have different set of parameters. In most common FF the partial charges are fixed. Nevertheless, polarizable models (in which the charge responds to the environment) have been also developed and how to implement them in a computationally efficient way is an intensive area of research.
Traditionally, parameters were derived from quantum mechanical computations and/or experimental measurement of small compounds. One problem with this approach is that by using small compounds is hard to correctly capture the properties of large biomolecules. One good example of this is the difficulty that FF historically had to reproduce the ramachandran plot. Alternative, parameters can be derived by doing statistical analysis of experimentally solved biomolecular structures. Briefly, the main idea is that the probability to observe any state of a system is proportional to $e^{(\frac{-E}{kT})}$ (i.e. the probability of the states follow a Boltzmann distribution), where $E$ is the energy, $T$ the absolute temperature and $k$ the Boltzmann constant. Thus, we can obtain energies by observing frequencies. FF derived using this type of approach are generally called knowledge based scoring functions or knowledge based potentials. This approach have been used for the Rosetta force field, that includes (among others) an hydrogen bond term, a ramachandran term, a rotamer term for side chains, etc were the parameters were determined directly from high quality protein structures. You can read more about the energy terms in the _score12 scoring function here. The score12 scoring function, was the default in Rosetta until the talaris2013 scoring function was created, unfortunately this new scoring function is not yet well documented.
Now we are going to implement our very own FF. We are not going to do any fancy stuff, indeed chances are high that our FF will not be adequate for any other purpose than to learn a couple of things in this course. Our FF will have 4 terms, bonds, angles, torsionals and VdW/hard_sphere.
First we will load a small molecule, a blocked alanine residue.
In [7]:
cmd.load('ala_blocked.pdb')
Then, we are going to define the parameters of our FF. These parameters are a very short and simplified version of what you will usually find in a modern FF. Notice that we have just four types of atoms (C,H,O and N) and in general the values, while inspired on values from a real FF (AMBER), we can't take them too seriously since we didn't have performed any test to check if the combination of parameters reproduce some experimentally observed or theoretically derived quantity.
In [8]:
# some values are duplicated, because for example we can have a 'NC' bond or 'CN' bond
# but both are the same!.
par_bonds = {
'CC':(303.1, 1.535),
'CN':(390.5, 1.407),
'CO':(623.0, 1.224),
'CH':(344.3, 1.087),
'NH':(451.2, 1.402),
'OH':(294.6, 1.517),
'NC':(390.5, 1.407),
'OC':(623.0, 1.224),
'HC':(344.3, 1.087),
'HN':(451.2, 1.402),
'HO':(294.6, 1.517)}
par_angles = {
'CCO':(68.0, 2.1485),
'CCN':(65.8, 1.9565),
'CCC':(63.8, 1.9198),
'CCH':(46.4, 1.9198),
'CNC':(63.9, 2.1170),
'COH':(47.1, 1.8867),
'CNH':(48.8, 2.0437),
'NCO':(75.8, 2.1293),
'NCH':(49.8, 1.9111),
'OCH':(47.1, 1.8867),
'OCO':(78.2, 2.2741),
'HCH':(39.4, 1.8901),
'OCC':(68.0, 2.1485),
'NCC':(65.8, 1.9565),
'HCC':(46.4, 1.9198),
'HOC':(47.1, 1.8867),
'HNC':(48.8, 2.0437),
'OCN':(75.8, 2.1293),
'HCN':(49.8, 1.9111),
'HCO':(47.1, 1.8867),
}
par_torsionals = {
'CN':(2, 5.0, 3.1416),
'NC':(2, 5.0, 3.1416),
'CC':(3, 0.6, 3.1416),
}
par_vdw = {
'H':(1.49, 0.015),
'C':(1.90, 0.086),
'O':(1.66, 0.210),
'N':(1.82, 0.170),
}
Now we will get the coordinates of our molecule using the PyMOL API.
In [9]:
model = cmd.get_model('all')
xyz = np.array(model.get_coord_list())
# Python start counting on 0 PyMOL on 1, thus we add a dummy value '0' to the list.
topo = {'types': [0,], 'bonds': []}
We will also need to have a record of the atoms types in our molecule. Remember that in our implementation the atom types are just the 4 elements C,H,O and N.
In [10]:
def get_types():
"""
Get the types of the atoms in a molecule
"""
cmd.iterate('all', 'types.append(elem)', space=topo)
return topo['types']
In [11]:
types = get_types()
print types
Bonds can be parametrized using a simple harmonic potential. This approximation is OK if the bond length does not deviate too much from its equilibrium value.
$$ \sum_\text{bonds} k_b (r-r_0)^2 $$Where: $K_b$ is a constant that controls how easy is for the length of a bond $r$ to deviates from its equilibrium value $r_0$.
We need to list all bonds in our molecule. Since classical force fields does not model bond rupture/creation this list will not change for a given molecule during our computations/simulations.
In [12]:
def get_bonds():
"""
List all bonds in a molecule.
"""
model = cmd.get_model('all')
for at in model.atom:
cmd.iterate('neighbor index %s' % at.index,
'bonds.append((%s, index))' % at.index, space=topo)
#remove duplicated bonds
bonds = list(set([tuple(sorted(i)) for i in topo['bonds']]))
return bonds
In [13]:
bonds = get_bonds()
print bonds
In [14]:
def nrg_bonds(bonds, xyz):
E_bonds = 0
for bond in bonds:
at_i, at_j = bond
key = types[at_i] + types[at_j]
k_bond, r0 = par_bonds[key]
r_ij = dist(xyz[at_i-1], xyz[at_j-1])
E_bonds += k_bond * (r_ij - r0)** 2
return E_bonds
In [15]:
nrg_bonds(bonds, xyz)
Out[15]:
In [16]:
def get_angles(bonds):
"""
List all angles in a molecule.
Parameters
----------
bonds: List of tuples: containing the indices for all bonds in a molecule.
"""
angles = []
lenght = len(bonds)
for i in range(lenght):
for j in range(i+1, lenght):
s = set(bonds[i])
t = set(bonds[j])
# The vertex is the element common to both bonds
vertex = list(s & t)
if vertex:
# edges are the element in one bond but not in the other
# I use pop to retrieve the only element in each set
edge0 = (s - t).pop()
edge1 = (t - s).pop()
angle = (edge0, vertex[0], edge1)
angles.append(angle)
return angles
In [17]:
angles = get_angles(bonds)
print angles
Now we can compute the energy contribution of the angle term to the total energy
In [18]:
def comp_angle(a, b, c):
ba_mod = dist(xyz[b], xyz[a])
bc_mod = dist(xyz[b], xyz[c])
angle_rad = acos(_dot(xyz[b]-xyz[a], xyz[b]-xyz[c]) /
(ba_mod * bc_mod))
return angle_rad
def nrg_angles(angles, xyz):
E_angles = 0
for angle in angles:
at_i, at_j, at_k = angle
key = types[at_i] + types[at_j] + types[at_k]
k_angle, theta0 = par_angles[key]
#angle_rad = cmd.get_angle('index %s' % at_i, 'index %s' % at_j, 'index %s' % at_k)
angle_rad = comp_angle(at_i-1, at_j-1, at_k-1)
E_angles += k_angle * (angle_rad - theta0)**2
return E_angles
In [19]:
#%%timeit
nrg_angles(angles, xyz)
Out[19]:
Since torsionals angles have more than one minimum value we could model the torsional contribution to the potential energy using a cosine function.
$$ \sum_\text{torsions} k_t [1-\cos(n (\chi- \chi_0))] $$Where:
$K_t$ is a constant that controls how easy is for the angle $\chi$ deviates from its equilibrium value $\chi_0$. Like in the case of the angles, the torsionals are measured in radians.
$n$ is a parameter that controls the number of minima.
In [20]:
def get_torsionals(angles):
"""
List all torsionals in a molecule.
Parameters
----------
angles: List of tuples: containing the indices for all angles in a molecule.
"""
torsionals = []
lenght = len(angles)
for i in range(lenght):
for j in range(i+1, lenght):
ang_s = angles[i]
ang_t = angles[j]
if ang_s[1] == ang_t[0] and ang_s[2] == ang_t[1]:
a = ang_s[0]
b, c, d = ang_t
torsional = (a, b, c, d)
torsionals.append(torsional)
elif ang_s[1] == ang_t[2] and ang_s[2] == ang_t[1]:
a = ang_s[0]
b, c, d = ang_t[::-1]
torsional = (a, b, c, d)
torsionals.append(torsional)
elif ang_s[0] == ang_t[1] and ang_s[1] == ang_t[0]:
a = ang_s[2]
b, c, d = ang_t
torsional = (a, b, c, d)
torsionals.append(torsional)
return torsionals
In [21]:
torsionals = get_torsionals(angles)
print torsionals,
In [22]:
def comp_dihedral(a, b, c, d):
# Compute 3 vectors conecting the four points
ba = xyz[b]-xyz[a]
cb = xyz[c]-xyz[b]
dc = xyz[d]-xyz[c]
# Compute the normal vector to each plane
u_A = _cross(ba, cb)
u_B = _cross(cb, dc)
#Measure the angle between the two normal vectors
u_A_mod = _mod(u_A)
u_B_mod = _mod(u_B)
tor_rad = acos(_dot(u_A, u_B) /
(u_A_mod * u_B_mod))
return tor_rad
def nrg_torsionals(torsionals, xyz):
E_tor = 0.
for tor in torsionals:
at_i, at_j, at_k, at_l = tor
key = types[at_j] + types[at_k]
period, k_t, chi0 = par_torsionals[key]
#chi = cmd.get_dihedral('index %s' % at_i,'index %s' % at_j,
# 'index %s' % at_k,'index %s' % at_l)
chi = comp_dihedral(at_i-1, at_j-1, at_k-1, at_l-1)
E_tor += k_t * (1 - (cos(period*(chi - chi0))))
return E_tor
In [23]:
#%%timeit
nrg_torsionals(torsionals, xyz)
Out[23]:
The van der Waals contribution to the potential energy describes the repulsion or attraction between atoms that are not directly bonded, besides does that involves the interaction of charges and partial charge.
$$ \sum_\text{torsions} \biggl({\frac{A_{ij}}{(r_{ij})^{12}} - \frac{B_{ij}}{(r_{ij})^6} \biggr)} $$The $r^{12}$ term, describes Pauli repulsion at short ranges due to overlapping electron orbitals and the $r^{6}$ term describes attraction at long ranges due the so-called dispersion forces.
Is important to remark that the only justification for the exponents $12$ and $6$ is one of numerical convenience. Indeed there is evidence that other exponent and/or expression for the van der Waals energy are better suited.
In [24]:
def get_nb():
"""
Computes the neighbors of each atom, that are at least 3 bonds appart.
"""
nb_vdw_list = []
model = cmd.get_model('all')
for at in model.atom:
nb = cmd.index('not (index %s extend 3)' % at.index)
nb_vdw_list.extend([tuple(sorted([at.index, i[1]])) for i in nb])
nb_vdw = list(set(tuple(nb_vdw_list)))
return nb_vdw
In [25]:
neighbors = get_nb()
print neighbors,
Now we can compute the van der Waals contribution to the total energy:
In [26]:
def nrg_vdw(neighbors, xyz, hard_sphere=True):
"""
Computes the a "hard sphere" energy or the van der Waals energy:
Parameters
----------
neighbors: List of tuples.
Containing the indices for all angles in a molecule.
xyz: array.
cartesian coordinates.
hard_sphere: bolean.
if True compute the energy using a hard sphere potential, if False
uses the Lennard-Jones potential
"""
E_vdw = 0
for i,j in neighbors:
key_i = types[i]
key_j = types[j]
sum_vdw = par_vdw[key_i][0] + par_vdw[key_j][0]
r_ij = dist(xyz[i-1], xyz[j-1])
if hard_sphere:
if r_ij > sum_vdw:
E_vdw += 0.2 # arbitrary, non-tested value.
else:
epsilon_ij = (par_vdw[key_i][1] * par_vdw[key_j][1])**0.5
A_ij = epsilon_ij * (sum_vdw)**12
B_ij = 2 * epsilon_ij * (sum_vdw)**6
E_vdw += (A_ij / (r_ij)**12) - (B_ij / (r_ij)**6)
return E_vdw
In [27]:
#%%timeit
nrg_vdw(neighbors, xyz)
Out[27]:
The total energy of out molecule, according to out FF is:
In [28]:
print nrg_bonds(bonds, xyz) + nrg_angles(angles, xyz) + nrg_torsionals(torsionals, xyz) \
+ nrg_vdw(neighbors, xyz)
Energy minimization (AKA energy optimization or geometry optimization) is the process of finding an arrangement of a collection of atoms where, according to some model (force field, scoring function, etc), the net inter-atomic force on each atom is as close as possible to zero, i.e. the system is in a local minimum of the potential energy surface (PES).
During an energy minimization the changes in the coordinates have not physical meaning. The minimization process do not produce a trajectory in the physical sense. Thus, different optimization algorithms could provide equivalent results for the minimum energy structure, but arrive at them trough different pathways.
As we have already discussed, a molecule can be described by a coordinate vector $\mathbf{r} = (r_1, r_2,...r_n)$. And the energy of a molecule can be computed as a function of its coordinates $E(r)$. And our objective is to find the value of $\mathbf{r}$ for which $E(\mathbf{r})$ is at a local minimum. This kind of problems are studied by a branch of mathematics called optimization theory. Optimization is a general term to refer to the maximization and minimization functions. Both operations are the equivalent since maximizing $f$ is the same as minimizing $-f$.
There are several strategies to find the minimum of a function. One possibility, specially useful for very complicate problems that are difficult to deal analytically, is to compute the values of our functions are different points and use the lowest found value as the minimum value of the function. We can do this in a systematic fashion or moving randomly and we can move in a continuous space or use a grid to discretize the problem. We will explore this type of solution when we see Markov Chain Monte Carlo. But know we will explore other alternatives based on using derivatives of our function.
We know we are at a local minimum of a function because the derivative of the energy with respect to the position of the atoms, $\frac{\partial E}{\partial \mathbf{r}}$, is zero and the second derivative, $\frac{\partial^2E}{\partial^2r_i}$, is positive. This should sound familiar too you. When we are dealing with multiple dimensions people it is more convenient to talk about the gradient of a function $\nabla E(r) = (\frac{\partial E}{\partial r_1}, \frac{\partial E}{\partial r_2},...\frac{\partial E}{\partial r_n})$ instead of the first derivative and about the Hessian matrix
$$ \mathbf{H} E(r) = \begin{bmatrix} \dfrac{\partial^2 E}{\partial r_1^2} & \dfrac{\partial^2 E}{\partial r_1\,\partial r_2} & \cdots & \dfrac{\partial^2 E}{\partial r_1\,\partial r_n} \\[2.2ex] \dfrac{\partial^2 E}{\partial r_2\,\partial r_1} & \dfrac{\partial^2 E}{\partial r_2^2} & \cdots & \dfrac{\partial^2 E}{\partial r_2\,\partial r_n} \\[2.2ex] \vdots & \vdots & \ddots & \vdots \\[2.2ex] \dfrac{\partial^2 E}{\partial r_n\,\partial r_1} & \dfrac{\partial^2 E}{\partial x_n\,\partial r_2} & \cdots & \dfrac{\partial^2 E}{\partial r_n^2} \end{bmatrix}. $$which describes the curvature of the PES at $\mathbf{r}$, has all positive eigenvalues we are at a minimum.
The computational model that provides a value for $E(\mathbf{r})$ could be based on quantum mechanics, classical force fields or statistical potentials. Using any of these computational models and an initial guess of the correct geometry, an iterative optimization procedure is followed, for example:
All methods described in this section are dedicated to find the local minimum and can't guaranteed to find the global minimum. Latter we will study strategies to find the global minimum.
Now we will explore some basic optimization methods, that will serve to understand the general ideas behind the subject of Optimization. You should keep in mind that this is a topic that could quickly become very complex such as with algorithms for constrained optimization (when you keep some of the variable at a fixed value, like constant bond length and bond angle). Anyway, chance are high that you will never need to implement you own minimization algorithm, and in general you shouldn't! Several trusted implementations already exists (like the routines that come with scipy).
A special case of energy minimization, that we will not discuss here, is the search for the geometry of transition states. For this particular case the goal is to find a saddle point on the PES and not a local (or global) minimum.
In the next section we will explore 3 different numerical methods to minimize functions. Just to keep things simple, we will minimize the energy of a single bond. If you have more than one variable you should apply the same procedure for each dimension.
In [29]:
def nrg(r_ij):
k_bond = 485
r0 = 1.39
return 0.5 * k_bond * (r_ij - r0)**2
Steepest Descent (or gradient descent) is a greedy optimization routine frequently used on structural bioinformatics and at the hart of many machine learning algorithms. The premise is that at each point of our function the minus derivative (or minus gradient) tells us the direction we should move to find the local minimum. This algorithm works as follows:
In [30]:
def steepest_descent_numerical(nrg, r=1.55, learning_rate=.0001, steps=500, tol=1E-4):
force = derivative(nrg, r)
for i in range(steps):
r = r - learning_rate * force
force = derivative(nrg, r)
if abs(force) < tol:
break
return i, r, nrg(r), force
Steepest Descent, if done right it will always moves toward the local minimum. It has, however, two main problems:
When optimizing a function with more that one variable/dimension every step is necessarily orthogonal to the one before it. The steepest descent algorithm therefore has a tendency to partly undone the energy minimization obtained from the previous step.
As the steepest descent approach to the minimum the rate of convergence slows down. Strictly, it will never reach the minimum. Although, in practical terms the minimum is reached.
In the context of structural bioinformatics and chemoinformatics, steepest descent it is used to quickly relax a bad conformation, followed by a minimization using a different algorithm. Alternatively, steepest descent is used when more sophisticated algorithms fails or are impractical to apply.
Before discussing other minimization methods. We will compare the time consumed by a steepest descent that relies on numerical derivatives vs one that relies on analytical ones.
In [31]:
def steepest_descent_analytical(nrg, r=1.55, learning_rate=.0001, steps=500, tol=1E-4):
force = 485 * (r - 1.39)
for i in range(steps):
r = r - learning_rate * force
force = 485 * (r - 1.39)
if abs(force) < tol:
break
return i, r, nrg(r), force
In [32]:
print steepest_descent_numerical(nrg, r=1.55)
print steepest_descent_analytical(nrg, r=1.55)
In [33]:
%timeit steepest_descent_numerical(nrg, r=1.55)
%timeit steepest_descent_analytical(nrg, r=1.55)
As you can see computing the derivative numerically can be time-consuming, hence it is always a good idea to compute derivatives analytically. And in fact thats the way force-field are minimized.
Lets see now how to use steepest descent minimizer to fix a line too a set of data. Supose that we have a data vector $\mathbf{x}$ and we want to use it to predict $\mathbf{y}$ and we are going to use a simple linear model, hence:
$$ y = \alpha + \beta x $$Where:
$\alpha$ is the y-intercept
$\beta$ is the slope
One way to find the best line that fit this data is to define an error function and minimize that function. Other options, that we will not see here, will be to solve the problem analytically and another is to treat the problem as a Bayesian inference problem.
An error function that makes sense is:
$$Error = \frac{1}{2N} \sum_{i=1}^{N} (y_{i} - \alpha + \beta x_{i})^2$$Since we have 2 parameters to minimize we need to compute two derivatives:
$$\frac{\partial}{\partial \alpha} = - \frac{1}{N} \sum_{i=1}^{N} (y_{i} - \alpha + \beta x_{i})$$$$\frac{\partial}{\partial \beta} = - \frac{1}{N} \sum_{i=1}^{N} x_{i} (y_{i} - \alpha + \beta x_{i})$$Now we will create some synthetic data, that follows a linear relationship.
In [34]:
#np.random.seed(1234)
n_points = 50
true_alpha = 5
true_beta = 2
x = np.linspace(0, 1, n_points)
# obtain y by fitting y = a + b*x
y = true_alpha + true_beta * x + np.random.normal(scale=.3, size=n_points)
plt.scatter(x, y);
Now we will write a steepest descent for our problem.
In [35]:
def sd_fit_line(x, y, alpha=0, beta=1, learning_rate=1, steps=500, tol=1E-4):
alpha_gradient = -np.average(y - (alpha + beta * x))
beta_gradient = -np.average(x* (y - (alpha + beta * x)))
for i in range(steps):
alpha = alpha - learning_rate * alpha_gradient
beta = beta - learning_rate * beta_gradient
alpha_gradient = -np.average(y - (alpha + beta * x))
beta_gradient = -np.average(x* (y - (alpha + beta * x)))
if abs(alpha_gradient) < tol and abs(beta_gradient) < tol:
break
return alpha, beta, i
In [36]:
alpha, beta , total_steps = sd_fit_line(x, y, alpha=0, beta=1)
print alpha, beta, total_steps
Now we will visually check the results. Feel free to compare how well the $\alpha$ and $\beta$ values compare against values obtained by other methods/implementations such as numpy polyfit or statsmodels
In [37]:
fitted_line = alpha + beta * x
plt.scatter(x, y)
plt.plot(x, fitted_line, 'r');
Analitically this problem can be solved using the normal equation:
$$ \theta = (X X^T)^{-1} y X^T $$in that case we have:
In [38]:
k = np.array([1]*len(x))
X = np.vstack((k,x))
print np.dot(np.linalg.inv(np.dot(X, X.T)), np.dot(y, X.T))
For low-dimensional problems linear regression it is generally done using the normal equation. This analytical solution involves inverting a matrix, an operation that is expensive, when the number of dimensions/variables is greater than ~1000 (something common in some machine learning applications) using steepest descent could become a preferred choice.
As mention before another option is to perform a linear regression as a Bayesian inference problem, that in general involves sampling the posterior distribution (in htis case the distribution of $\alpha$ and $\beta$ values) by using a Markov Chain Monte Carlo method.
Conjugate gradient (CG) is a method that tries to avoid the oscillatory behaviour of the steepest descent algorithm. The first step of both methods are the same, but from step two onward CG will move in a direction that is a mix of the previous gradient and the one indicated by the current gradient. Several recipes for mixing both directions have been proposed.
The Newton-Raphson algorithm use the first derivative (or gradient) and the second derivative matrix (know as Hessian matrix). The idea is similar to steepest descent but instead of updating the new position using only the first derivative we use:
$$r_{i+1} = r - f'/f''$$In our example of the bond, the energy is given by $0.5 k_{bond} (r_{i} - r_{eq})^2$, hence:
$$r_{i+1} = r_{i} - \frac{k_{bond} (r_{i} - r_{eq})}{k_{bond}}$$$$r_{i+1} = r_{i} - r_{i} - r_{eq}$$$$r_{i+1} = r_{eq}$$For quadratic functions, the Newton-Raphson (theoretically) locates the local minimum in just one step!
For some problems it could not be possible to provide the Hessian matrix analytically so we have to found it numerically. This could be really time-consuming for some types of functions. Additionally, for problems with a large number of variables storing and manipulating the Hessian could be too expensive (in terms or memory and/or CPU) and hence other methods should be used. One common heuristics applied is to compute an approximated Hessian.
Now we are going to minimize the energy of the blocked alanine. And we are going to use our own FF to do it. We are going to use the conjugate gradient minimizer as implemented in the minimize function provided by SciPy. The gradient vector will be computed analytically, something impractical for real purpose applications.
The first thing we need to do is to create a function that we will pass to the minimize function. In our case that function is simple the sum of the four terms derived previously. The argument of this function will be the flatten coordinate array.
In [39]:
def function(flat_xyz):
xyz = flat_xyz.reshape(-1,3)
e = nrg_bonds(bonds, xyz) + nrg_angles(angles, xyz) + nrg_torsionals(torsionals, xyz) \
+ nrg_vdw(neighbors, xyz)
return e
In [40]:
flat_xyz = xyz.flatten()
result = minimize(function, flat_xyz, method='CG')
stored.updated = list(result['x'].reshape(-1,3))
cmd.alter_state(1, 'all',"(x,y,z)=stored.updated.pop(0)")
cmd.sort()
print result['fun']
In [41]:
model = cmd.get_model('all')
xyz = np.array(model.get_coord_list())
print nrg_bonds(bonds, xyz) + nrg_angles(angles, xyz) + nrg_torsionals(torsionals, xyz) \
+ nrg_vdw(neighbors, xyz)
In [42]:
# code to randomize the cartesian coordinates
# xyz = xyz + np.random.uniform(0, 0.2, size=xyz.shape)
# stored.updated = list(xyz)
# cmd.alter_state(1, 'all',"(x,y,z)=stored.updated.pop(0)")
# cmd.sort()
# print nrg_bonds(bonds, xyz) + nrg_angles(angles, xyz) + nrg_torsionals(torsionals, xyz) \
Markov Chain Monte Carlo are a family of algorithms that are extensively used and have had a profound effect in almost all the branches of science. The main purpose of these algorithm is to get samples from a probability distribution that is difficult to directly sample. This is done by constructing a Markov chain with the desired distribution as its equilibrium distribution (i.e. the more samples we get the closer we get to the desired distribution). In this context a chain is a sequence of states and the associated transition probabilities to move from one state to another, and a Markov chain is the special case when the next state depends only on the current state and not on the previous events.
A well-know example of Markov chain is the so call random-walk or drunkard's walk. For the 1D version of this problem we have that at each step, you can move to the left of to the right with equal probability. From this example it is clear that the transition probabilities depend only on the current state, and not on the way in which we get to that state.
The Monte Carlo part of the name comes from the use of random number to perform computations. The key idea behind a Monte Carlo method is that it is possible to study a complicated system by taking samples from it. For example if you want to know the probabilities of wining a solitary game you could try to analyze the problem analytically (something complicated due to the combinatoric nature of the problem) or just plays as many as possible matches. Indeed this insight came from Stanislaw Ulam (circa 1940) while we was recovering from a surgery and playing solitary. By that time computers had began to be a reality and from that time to the present the increasing computational power of computer have made Monte Carlo methods one of the most used algorithms to solve a wide variety of problems.
Probably the most frequently used Markov Chain Monte Carlo algorithm is the Metropolis-Hastings algorithm that we will discuss in the next section.
The Metropolis–Hastings algorithm can draw samples from any probability distribution $P(x)$ provided you can compute a value that is proportional to $P$. This is very useful feature since in many cases like in Bayesian statistics and mechanical statistic calculating the necessary normalization is difficult for many practical applications. For simple distribution there are more direct methods to sample from them, but Metropolis-Hastings and other MCMC algorithms becomes increasingly useful as the numbers of dimensions increase.
The Metropolis-Hastings algorithm consists on the following steps:
A couple of things to notice:
$A(x_{i+1} | x_i) = \min\left(1,\frac{P(x_{i+1})}{P(x_{i})}\right)$
In [43]:
def metropolis(func, steps=5000):
# seed the random number generator
np.random.seed(1234)
trace = []
# start from a random point or make and educated guess
old_x = func.mean() #np.random.uniform(-10, 10)
old_likelihood = func.pdf(old_x)
for i in range(steps):
# we propose a new position from the previous position
new_x = old_x + np.random.uniform(-1, 1)
# now we compute the new probability
new_likelihood = func.pdf(new_x)
acceptance = new_likelihood/old_likelihood
if acceptance >= 1:
trace.append(new_x)
old_x = new_x
old_likelihood = new_likelihood
elif acceptance > np.random.random():
trace.append(new_x)
old_x = new_x
old_likelihood = new_likelihood
else:
trace.append(old_x)
return trace
In [44]:
# The likelihood, objetive function or "energy function" will be
# a normal distribution with mean 0 and standard deviation 2
func = norm(loc=1, scale=2)
trace = metropolis(func=func)
In [45]:
plt.hist(trace, bins=30, normed=True);
x = np.linspace(-8, 8, 100)
y = func.pdf(x)
plt.plot(x, y, 'r-');
In [46]:
func = expon()
trace = metropolis(func=func)
plt.hist(trace, bins=30, normed=True);
x = np.linspace(0, 8, 100)
y = func.pdf(x)
plt.plot(x, y, 'r-');
We can use Metropolis-Hastings algorithm to study molecules. Statistical mechanics tells us that molecules follows a Boltzmann distribution, where the probability of a particular state, like a conformation $x_i$, is:
$$P(x_i) = \frac{1}{Z} e^{-E(x_i)/(k_b T)}$$Where:
$Z$ is an entity called the partition function and for our interest is just a normalization constant.
$K_b$ is the Boltzmann constant.
$T$ is the absolute temperature.
For the Metropolis-Hastings algorithm we just need the ratio of the probabilities of state $x_i$ and $x_{i+1}$:
$$\frac{P(x_{i+1})}{P(x_{i})} = e^{-\Delta E/(k_b T)}$$where:
$\Delta E = E(x_{i+1}) - E(x_{i})$
It can be proven that a Markov process, under certain conditions, converges to a unique stationary distribution. A sufficient, but not necessary, restriction that guarantee those conditions is to respect detailed balance. Detailed balance means that all transitions must be reversible, that is:
$$P(x_i)P(x_i\rightarrow x_{i+1}) = P(x_{i+1})P(x_{i+1}\rightarrow x_{i})$$Intuitively you can see that if you move between states proportionally to their relative probabilities and you have access to all possible states of the system in the long run you should sample the correct distribution.
Methods used for example in Bayesian statistics (we care about detailed balance)
Metropolis-Hasting can be used to sample from any distribution, including the energy distribution followed by molecules (i.e. the Boltzmann distribution). Indeed it is theoretically possible to use Metropolis-Hasting to sample the whole conformational space of any molecule and obtain the global minimum and all the local minima. Even if we could have a very accurate way to compute the energy the problem will be that the conformation space we need to sample if really large. To complicate things further in practice the energy terms have a lot of inaccuracies that make the conformational space much more rugged that what experiments suggest.
One possible strategy is to make a compromise and conform ourselves with obtaining just the lowest energy conformations and loose information about the details of the conformational space. We will discuss tree of such strategies:
The first two methods are variations of Metropolis-Hastings (that do not fulfill the detailed-balance condition). The third method is an heuristic optimization algorithm inspired in biological evolution and is not part of the Markov Chain Monte Carlo family of algorithms.
This method is essentially the same as Metroplis-Hastings, the only difference is that after we propose a new state for the system we perform and energy minimization and only then we compare the energy of the new and old steps.
As you can imaging in system like globular proteins, with a lot of correlated degrees of freedom, is difficult to perform a movement without obtaining unphysical conformations with a lot of clashes (and high energy). Hence, minimization is good candidate to overcome this issue.
This method is essentially the same as Metroplis-Hastings, the only difference is that in this case the temperature is not fixed. We start with a system at a high temperature and every time we accept a conformation the the system is cooled down. Here the temperature is just a parameter that controls the probability that a conformation with higher energy than the previous step will be accepted. If the temperature is infinite, then every proposed movement will be accepted, on the contrary if the temperature is zero, only movements that decrease the energy will be accepted. The method follows the analogy of the annealing process used when forging metals, or probably more familiar to you the annealing done during a PCR protocol (polymerase chain reaction). Like in those real systems the key to success is on the cooling schedule. In general, severals cycles of cooling and heating are needed.
Molecular Modeling and Simulation: An Interdisciplinary Guide. Tamar Schlick. 2010.
Numerical Methods in Engineering with Python. Jaan Kiusalaas. 2010.
Understanding Molecular Simulation. Daan Frenkel. 2001.
The Handbook of Markov Chain Monte Carlo Editors, Steve Brooks, Andrew Gelman, Galin L. Jones and Xiao-Li Meng. 2011.
The Monte Carlo method Metropolis, N & Ulam S. 1949.
Equations of State Calculations by Fast Computing Machines Metropolis et. al. 1953
Monte Carlo Sampling Methods Using Markov Chains and Their Applications Hastings. 1970.