Sequence alignment and clustering


In [1]:
% matplotlib inline
from __future__ import division
import numpy as np
np.set_printoptions(linewidth=200)
import matplotlib.pyplot as plt
import seaborn as sns

Biological sequences encodes a lot of information and, hence a large portion of bioinformatics research is devoted to the development of methods to extract and use that information. The biological relevant information can be interpreted at least from two (related) points of view, the functional and the evolutionary one. As you may guess, processing sequential data into knowledge is not an easy task. In general, this is accomplish by inferring unknown properties from a similar sequence from which we know those properties. In turns, this inference process involves aligning proteins, i.e determining the degree of similarity of two or more sequences.

Sequence alignment can be used to:

  • Perform searches in databases
  • Infer biological functions
  • Infer the structure of biomolecules
  • Building phylogenetics trees
  • Detect functional regions
  • etc

Homology

Two or more sequences are homologous if they share a common ancestor. Then, homology is a binary relation. Two or more sequences are or are not homologous.

Two or more sequences can have a common ancestor because of and speciation process (orthologs), a duplication event (paralogs) or even a horizontal transfer process (xenologs).

Homology can be inferred from the similarity of two or more sequences. Two homologous sequences are expected to have similar function, structure and dynamics and to some degree interact with similar partners. In general this is true, but also could happen that:

  • two sequences can be similar and non-homologous due to convergent evolutionary process (or simply by chance).
  • two sequences may not be similar to each other but be homologous due to divergent evolution processes.

The sequence-alignment problem

Determining the degree of similarity of two or more sequences, sounds like something simple to do. Nevertheless biology tends to be messier than we use to think. The main facts that complicate things are:

  • Sequences could have different length
  • Some parts of the sequence could be more biological relevant than other (think about exons-introns in DNA or functional sites in proteins)
  • Indels (insertions or deletions) can occur in sequences, due to biological processes or even errors during sequencing experiments.
  • Not all the differences are the same, for example the codon CCU is different from CCA, but both code for the same amino acid. And a mutation of Ile to Leu probably will have an smaller functional/structural effect than a mutation of Ile to Arg.

Given two sequences ADGSYTAGPA and VE-SY-AGPR the result of an alignment will be something like:

    ADGSYTAGPA
    VE-SY-AGPR

In general, solving the sequence-alignment problem can be framed as how to develop an algorithm that is capable of introducing gaps (-) and scoring the matching of identical, similar and different residues.

Global vs local alignment

Global alignment algorithm attempts to align an entire sequence to another entire sequence. On the contrary, a local alignment algorithm is one that attempts to find portions in one sequence that align with portions in the other sequence.

As you may guest, local alignment are a better idea for more distance-related sequences and sequences of very different length.

Substitution Matrix

A substitution matrix describes the probability that one residue (like an aminoacid or base) in a sequence mutates to other residues.

The simplest possible substitution matrix would be the identity matrix i.e. we will assume that each residue is maximally similar to itself, and not able to mutate into any other residue.

$$\begin{bmatrix} 1 & 0 & \cdots & 0 & 0 \\ 0 & 1 & & 0 & 0 \\ \vdots & & \ddots & & \vdots \\ 0 & 0 & & 1 & 0 \\ 0 & 0 & \cdots & 0 & 1 \end{bmatrix}$$

The problem with this matrix is that it only works with very similar sequences. In practice it have been shown that for proteins; empirically determined matrices work best. Such empirical matrices have different values for different pairs of residues. For nucleotide sequences simpler substitution matrices are generally used, wherein only identical matches and mismatches are considered.

The BLOSUM matrices

One of the most commonly used substitution matrices are the family of BLOSUM matrices. These matrices were derived from a database of very conserved regions of protein families (without gaps in their sequence alignments).

The probabilities of mutations are expressed as log-odds scores. Each element of the matrix is computed as:

$$S_{ij}= \left( \frac{1}{\lambda} \right)\log{\left( \frac{p(i,j)}{p(i) \times p(j)} \right) \propto \log \frac {observed\;frequency} {expected\;frequency}}$$

Where:

$p(i,j)$ is the probability of two amino acids $i$ and $j$ replacing each other in homologous sequences.
$p(i)$ and $p(j)$ are the probabilities of finding the amino acids $i$ and $j$ in any protein sequence.
The factor $\lambda$ is a scaling factor, is just there to make the matrix contains easily computable integer values.

The base of the logarithm is not relevant (as far as you apply the same base for all the elements in the same matrix!), and often the same substitution matrix is expressed in different bases.

The BLOSUM62 matrix looks like:

$$ s_{i,j} = \begin{cases} > 0 & \text{if } p(i,j) > p(i) \times p(j) \\ = 0 & \text{if } p(i,j) = p(i) \times p(j) \\ < 0 & \text{if } p(i,j) < p(i) \times p(j) \end{cases} $$

PyMOL is distributed with a copy of the BLOSUM62 matrix. The location of the BLOSUM62 matrix may be different in your system.


In [2]:
!cat /usr/share/pymol/data/pymol/matrices/BLOSUM62


#  Matrix made by matblas from blosum62.iij
#  * column uses minimum score
#  BLOSUM Clustered Scoring Matrix in 1/2 Bit Units
#  Blocks Database = /data/blocks_5.0/blocks.dat
#  Cluster Percentage: >= 62
#  Entropy =   0.6979, Expected =  -0.5209
   A  R  N  D  C  Q  E  G  H  I  L  K  M  F  P  S  T  W  Y  V  B  Z  X  *
A  4 -1 -2 -2  0 -1 -1  0 -2 -1 -1 -1 -1 -2 -1  1  0 -3 -2  0 -2 -1  0 -4 
R -1  5  0 -2 -3  1  0 -2  0 -3 -2  2 -1 -3 -2 -1 -1 -3 -2 -3 -1  0 -1 -4 
N -2  0  6  1 -3  0  0  0  1 -3 -3  0 -2 -3 -2  1  0 -4 -2 -3  3  0 -1 -4 
D -2 -2  1  6 -3  0  2 -1 -1 -3 -4 -1 -3 -3 -1  0 -1 -4 -3 -3  4  1 -1 -4 
C  0 -3 -3 -3  9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4 
Q -1  1  0  0 -3  5  2 -2  0 -3 -2  1  0 -3 -1  0 -1 -2 -1 -2  0  3 -1 -4 
E -1  0  0  2 -4  2  5 -2  0 -3 -3  1 -2 -3 -1  0 -1 -3 -2 -2  1  4 -1 -4 
G  0 -2  0 -1 -3 -2 -2  6 -2 -4 -4 -2 -3 -3 -2  0 -2 -2 -3 -3 -1 -2 -1 -4 
H -2  0  1 -1 -3  0  0 -2  8 -3 -3 -1 -2 -1 -2 -1 -2 -2  2 -3  0  0 -1 -4 
I -1 -3 -3 -3 -1 -3 -3 -4 -3  4  2 -3  1  0 -3 -2 -1 -3 -1  3 -3 -3 -1 -4 
L -1 -2 -3 -4 -1 -2 -3 -4 -3  2  4 -2  2  0 -3 -2 -1 -2 -1  1 -4 -3 -1 -4 
K -1  2  0 -1 -3  1  1 -2 -1 -3 -2  5 -1 -3 -1  0 -1 -3 -2 -2  0  1 -1 -4 
M -1 -1 -2 -3 -1  0 -2 -3 -2  1  2 -1  5  0 -2 -1 -1 -1 -1  1 -3 -1 -1 -4 
F -2 -3 -3 -3 -2 -3 -3 -3 -1  0  0 -3  0  6 -4 -2 -2  1  3 -1 -3 -3 -1 -4 
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4  7 -1 -1 -4 -3 -2 -2 -1 -2 -4 
S  1 -1  1  0 -1  0  0  0 -1 -2 -2  0 -1 -2 -1  4  1 -3 -2 -2  0  0  0 -4 
T  0 -1  0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1  1  5 -2 -2  0 -1 -1  0 -4 
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1  1 -4 -3 -2 11  2 -3 -4 -3 -2 -4 
Y -2 -2 -2 -3 -2 -1 -2 -3  2 -1 -1 -2 -1  3 -3 -2 -2  2  7 -1 -3 -2 -1 -4 
V  0 -3 -3 -3 -1 -2 -2 -3 -3  3  1 -2  1 -1 -2 -2  0 -3 -1  4 -3 -2 -1 -4 
B -2 -1  3  4 -3  0  1 -1  0 -3 -4  0 -3 -3 -2  0 -1 -4 -3 -3  4  1 -1 -4 
Z -1  0  0  1 -3  3  4 -2  0 -3 -3  1 -1 -3 -1  0 -1 -3 -2 -2  1  4 -1 -4 
X  0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2  0  0 -2 -1 -1 -1 -1 -1 -4 
* -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4  1 

In the above matrix:

B = N or D   
Z = Q or E  
X = unknown amino acid   
* = gap penalty  

The average mutual information or relative entropy is a measure of the information per amino acid contained in the matrix. In general, the mutual information is the reduction in entropy/uncertainty about variable $X$ given than we know the value of the variable $Y$. It's used to compare different substitution matrices. The entropy is 0 if the observed distribution of pair frequencies is the same as the expected one by pure chance, and greater than zero if the matrix is more informative (i.e. not random).

$$H = \sum_{i,j} p(i,j) \times s_{i,j}$$

One possible way to represent a substitution matrix, for amino acids, in Python is using $20 \times 20$ array.


In [3]:
BLOSUM62 = np.array([
[ 4, -1, -2, -2,  0, -1, -1,  0, -2, -1, -1, -1, -1, -2, -1,  1,  0, -3, -2,  0],
[-1,  5,  0, -2, -3,  1,  0, -2,  0, -3, -2,  2, -1, -3, -2, -1, -1, -3, -2, -3],
[-2,  0,  6,  1, -3,  0,  0,  0,  1, -3, -3,  0, -2, -3, -2,  1,  0, -4, -2, -3], 
[-2, -2,  1,  6, -3,  0,  2, -1, -1, -3, -4, -1, -3, -3, -1,  0, -1, -4, -3, -3], 
[ 0, -3, -3, -3,  9, -3, -4, -3, -3, -1, -1, -3, -1, -2, -3, -1, -1, -2, -2, -1], 
[-1,  1,  0,  0, -3,  5,  2, -2,  0, -3, -2,  1,  0, -3, -1,  0, -1, -2, -1, -2], 
[-1,  0,  0,  2, -4,  2,  5, -2,  0, -3, -3,  1, -2, -3, -1,  0, -1, -3, -2, -2], 
[ 0, -2,  0, -1, -3, -2, -2,  6, -2, -4, -4, -2, -3, -3, -2,  0, -2, -2, -3, -3], 
[-2,  0,  1, -1, -3,  0,  0, -2,  8, -3, -3, -1, -2, -1, -2, -1, -2, -2,  2, -3], 
[-1, -3, -3, -3, -1, -3, -3, -4, -3,  4,  2, -3,  1,  0, -3, -2, -1, -3, -1,  3], 
[-1, -2, -3, -4, -1, -2, -3, -4, -3,  2,  4, -2,  2,  0, -3, -2, -1, -2, -1,  1], 
[-1,  2,  0, -1, -3,  1,  1, -2, -1, -3, -2,  5, -1, -3, -1,  0, -1, -3, -2, -2], 
[-1, -1, -2, -3, -1,  0, -2, -3, -2,  1,  2, -1,  5,  0, -2, -1, -1, -1, -1,  1], 
[-2, -3, -3, -3, -2, -3, -3, -3, -1,  0,  0, -3,  0,  6, -4, -2, -2,  1,  3, -1], 
[-1, -2, -2, -1, -3, -1, -1, -2, -2, -3, -3, -1, -2, -4,  7, -1, -1, -4, -3, -2], 
[ 1, -1,  1,  0, -1,  0,  0,  0, -1, -2, -2,  0, -1, -2, -1,  4,  1, -3, -2, -2], 
[ 0, -1,  0, -1, -1, -1, -1, -2, -2, -1, -1, -1, -1, -2, -1,  1,  5, -2, -2,  0], 
[-3, -3, -4, -4, -2, -2, -3, -2, -2, -3, -2, -3, -1,  1, -4, -3, -2, 11,  2, -3],
[-2, -2, -2, -3, -2, -1, -2, -3,  2, -1, -1, -2, -1,  3, -3, -2, -2,  2,  7, -1], 
[ 0, -3, -3, -3, -1, -2, -2, -3, -3,  3,  1, -2,  1, -1, -2, -2,  0, -3, -1,  4]])

mat_names ='ARNDCQEGHILKMFPSTWYV'

Lets also define a couple of short sequences to align


In [4]:
seq1 = 'LIFAGKQLEDGRTLS'
seq2 = 'QLIFAAPKQLPGRT'

Needleman-Wunsch algorithm

This algorithm is used to perform optimal global alignments. Optimal means that given a set of parameters (a substitution matrix and a gap penalty) the result is the best possible alignment.

The Needleman-Wunsch algorithm was one of the first applications of dynamic programming to bioinformatic problems. Is still widely used for optimal global alignment, especially when the quality of the result is of major importance.

The Needleman-Wunsch algorithm starts by creating a $m+1 \times n+1$ matrix, where $m$ and $n$ are the length of the first and second sequence to align. This matrix $\mathbf{M}$ is called the scoring matrix. The $\mathbf{M}_{00}$ element is always a zero and the rest of the elements of the first row and column are multiples of the gap penalty.


In [5]:
n, m = 6, 6
gap = -8
scoring = np.zeros((n+1 ,m+1), int)
scoring[0] = np.arange(n+1) * gap
scoring[:,0] = np.arange(n+1) * gap
print scoring


[[  0  -8 -16 -24 -32 -40 -48]
 [ -8   0   0   0   0   0   0]
 [-16   0   0   0   0   0   0]
 [-24   0   0   0   0   0   0]
 [-32   0   0   0   0   0   0]
 [-40   0   0   0   0   0   0]
 [-48   0   0   0   0   0   0]]

The rest of the matrix is filled as follows.

$$M_{ij} = max \begin{cases} M_{i-1,j} + gap \\ M_{i,j-1} + gap \\ M_{i-1,j-1} + S_{a,b} \end{cases}$$

Where $S_{a,b}$ is the value of the substitution matrix for a residue $a$ and a residue $b$.

When the value $\mathbf{M}_{ij}$ was obtained from the left, or from the top we have an indel in our alignment. When we calculate scores from the diagonal we have a matching. The information indicating whether we calculate the $\mathbf{M}_{ij}$ value from the left, top or diagonal element is stored in an auxiliary matrix call the traceback matrix. If the value $\mathbf{M}_{ij}$ is calculated from the top we store a 0, from the left a 1 and from the diagonal a 2. By reading the traceback matrix we will recover the best alignment.


In [6]:
def nw_matrices(seq1, seq2, mat=BLOSUM62, aa_mat=mat_names, gap=-8):
    """
    Computes the Needleman-Wunsch's scoring and traceback matrices.
    
    Parameters
    ----------
    seq1: String. One of the sequences to align
    seq2: String. The other sequence to align
    mat : Array. A substitution Matrix
    aa_mat : List. Column and rows names of the substitution matrix
    gap  : Int. Gap penalty
    
    """ 
    l1, l2 = len(seq1), len(seq2)
    # initialize scoring matrix and fill first row and first column with
    # multiples of the gap penalty.
    scoring = np.zeros((l1+1 ,l2+1), int)
    scoring[0] = np.arange(l2+1) * gap
    scoring[:,0] = np.arange( l1+1) * gap
    # create traceback matrix, fill first row with ones
    traceback = np.zeros( (l1+1 ,l2+1), int)
    traceback[0] = np.ones(l2+1)
    # fill the scoring and traceback matrix
    f = [0, 0, 0]
    for i in range(1, l1+1):
        res_i = aa_mat.index(seq1[i-1])
        for j in range(1, l2+1):
            res_j = aa_mat.index(seq2[j-1])
            # compute the values for a mismatch in either sequence
            f[0] = scoring[i-1, j] + gap
            f[1] = scoring[i, j-1] + gap
            # compute the values for a match. 
            f[2] = scoring[i-1,j-1] + mat[res_i, res_j]
            score = max(f)
            scoring[i,j] = score
            traceback[i,j] = f.index(score)
    return scoring, traceback

In [7]:
nw_scoring, nw_traceback = nw_matrices(seq1, seq2)
print nw_scoring
print nw_traceback


[[   0   -8  -16  -24  -32  -40  -48  -56  -64  -72  -80  -88  -96 -104 -112]
 [  -8   -2   -4  -12  -20  -28  -36  -44  -52  -60  -68  -76  -84  -92 -100]
 [ -16  -10    0    0   -8  -16  -24  -32  -40  -48  -56  -64  -72  -80  -88]
 [ -24  -18   -8    0    6   -2  -10  -18  -26  -34  -42  -50  -58  -66  -74]
 [ -32  -25  -16   -8   -2   10    2   -6  -14  -22  -30  -38  -46  -54  -62]
 [ -40  -33  -24  -16  -10    2   10    2   -6  -14  -22  -30  -32  -40  -48]
 [ -48  -39  -32  -24  -18   -6    2    9    7   -1   -9  -17  -25  -30  -38]
 [ -56  -43  -40  -32  -26  -14   -6    1   10   12    4   -4  -12  -20  -28]
 [ -64  -51  -39  -38  -32  -22  -14   -7    2    8   16    8    0   -8  -16]
 [ -72  -59  -47  -42  -40  -30  -22  -15   -6    4    8   15    7    0   -8]
 [ -80  -67  -55  -50  -45  -38  -30  -23  -14   -4    0    7   14    6   -1]
 [ -88  -75  -63  -58  -53  -45  -38  -31  -22  -12   -8   -1   13   12    4]
 [ -96  -83  -71  -66  -61  -53  -46  -39  -29  -20  -14   -9    5   18   11]
 [-104  -91  -79  -72  -68  -61  -53  -47  -37  -28  -21  -15   -3   10   23]
 [-112  -99  -87  -77  -72  -69  -61  -55  -45  -36  -24  -23  -11    2   15]
 [-120 -107  -95  -85  -79  -71  -68  -62  -53  -44  -32  -25  -19   -6    7]]
[[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
 [0 2 2 1 1 1 1 1 1 1 1 1 1 1 1]
 [0 0 2 2 1 1 1 1 1 1 1 1 1 1 1]
 [0 0 0 2 2 1 1 1 1 1 1 1 1 1 1]
 [0 2 0 0 0 2 1 1 1 1 1 1 1 1 1]
 [0 0 0 0 0 0 2 1 1 1 1 1 2 1 1]
 [0 2 0 0 0 0 0 2 2 1 1 1 1 2 1]
 [0 2 0 0 0 0 0 0 2 2 1 1 1 1 1]
 [0 0 2 2 2 0 0 0 0 2 2 1 1 1 1]
 [0 0 0 2 0 0 0 0 0 2 0 2 1 2 1]
 [0 0 0 0 2 0 0 0 0 0 0 0 2 1 2]
 [0 0 0 0 0 2 0 0 0 0 0 0 2 2 1]
 [0 0 0 0 0 0 0 0 2 0 2 0 0 2 2]
 [0 0 0 2 2 0 2 0 0 0 2 2 0 0 2]
 [0 0 0 2 2 0 0 0 0 0 2 0 0 0 0]
 [0 0 0 0 2 2 2 2 0 0 0 2 0 0 0]]

In [8]:
def nw_backtrace(seq1, seq2, scoring=nw_scoring, traceback=nw_traceback):
    """
    Perform the Needleman-Wunsch's backtrace and returns the aligned 
    sequences and the scoring of the alignment.
    
    Parameters
    ----------
    seq1: String. One of the sequences to align
    seq2: String. The other sequence to align
    scoring : Array. Scoring matrix computed with nw_matrices 
    traceback : Array. Traceback matrix computed with nw_matrices 
    
    """
    st1 = '' 
    st2 = ''
    i, j = traceback.shape
    i -= 1
    j -= 1
    max_score = scoring[i, j]
    while True:
        if traceback[i,j] == 0:
            st1 += seq1[i-1]
            st2 += '-'
            i -= 1
        elif traceback[i,j] == 1:
            st1 += '-'
            st2 += seq2[j-1]
            j -= 1
        elif traceback[i,j] == 2:
            st1 += seq1[i-1]
            st2 += seq2[j-1]
            i -= 1
            j -= 1
        if i == 0 and j == 0:
            break
    # reverse the strings
    st1 = st1[::-1]
    st2 = st2[::-1]
    
    return st1, st2, max_score

In [9]:
seq1_aln, seq2_aln, max_score = nw_backtrace(seq1, seq2)

print seq1_aln
print seq2_aln
max_score


-LIFAG-KQLEDGRTLS
QLIFAAPKQLP-GRT--
Out[9]:
7

Smith-Waterman algorithm

This algorithm is a variation of the Needleman-Wunsch algorithm and also guarantees that it finds the best optimal alignment, but with the difference that performs a local sequence alignment. Instead of looking at the total sequence, the Smith-Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure.

The main difference to the Needleman-Wunsch algorithm is that negative values in the scoring matrix are not allowed i.e. now:

$$M_{ij} = max \begin{cases} M_{m-1,n} + gap \\ M_{m,n-1} + gap \\ M_{m-1,n-1} + S_{a,b} \\ 0 \end{cases}$$

The other difference is that the backtracking starts at the highest scoring matrix cell (not necessary the lower-right one) and trace-back until a cell with a zero (in the scoring matrix) is founded, yielding the highest scoring local alignment.


In [10]:
def sw_matrices(seq1, seq2, mat=BLOSUM62, aa_mat=mat_names, gap=-8):
    """
    Computes the Smith-Waterman's scoring and traceback matrices.
    
    Parameters
    ----------
    seq1: String. One of the sequences to align
    seq2: String. The other sequence to align
    mat : Array. Substitution Matrix
    aa_mat : List. Column and rows names of the substitution matrix
    gap  : Int. Gap penalty
    
    """
    l1, l2 = len(seq1), len(seq2)
    # initialize scoring matrix with zeros
    scoring = np.zeros((l1+1 ,l2+1), int)
    # create traceback matrix, fill first row with ones
    traceback = np.zeros((l1+1 ,l2+1), int)
    traceback[0] = np.ones(l2+1)
    # f matrix 4 values, the last one is always zero, as a result
    # no negative numbers are accepted 
    f = [0, 0, 0, 0]
    # fill the scoring and traceback matrix
    for i in range(1, l1+1):
        res_i = aa_mat.index(seq1[i-1])
        for j in range(1, l2+1):
            res_j = aa_mat.index(seq2[j-1])
            # compute the values for a mismatch in either sequence
            f[0] = scoring[i-1, j] + gap
            f[1] = scoring[i, j-1] + gap
            # compute the values for a match. 
            f[2] = scoring[i-1,j-1] + mat[res_i, res_j]
            score = max(f)
            scoring[i,j] = score
            traceback[i,j] = f.index(score)
    return scoring, traceback

In [11]:
sw_scoring, sw_traceback = sw_matrices(seq1, seq2)
print sw_scoring
print sw_traceback


[[ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  4  2  0  0  0  0  0  0  4  0  0  0  0]
 [ 0  0  2  8  2  0  0  0  0  0  2  1  0  0  0]
 [ 0  0  0  2 14  6  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  6 18 10  2  0  0  0  0  0  0  0]
 [ 0  0  0  0  0 10 18 10  2  0  0  0  6  0  0]
 [ 0  1  0  0  0  2 10 17 15  7  0  0  0  8  0]
 [ 0  5  0  0  0  0  2  9 18 20 12  4  0  1  7]
 [ 0  0  9  2  0  0  0  1 10 16 24 16  8  0  0]
 [ 0  2  1  6  0  0  0  0  2 12 16 23 15  8  0]
 [ 0  0  0  0  3  0  0  0  0  4  8 15 22 14  7]
 [ 0  0  0  0  0  3  0  0  0  0  0  7 21 20 12]
 [ 0  1  0  0  0  0  2  0  2  1  0  0 13 26 19]
 [ 0  0  0  0  0  0  0  1  0  1  0  0  5 18 31]
 [ 0  0  4  2  0  0  0  0  0  0  5  0  0 10 23]
 [ 0  0  0  2  0  1  1  0  0  0  0  4  0  2 15]]
[[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
 [0 3 2 2 2 3 3 3 3 3 2 3 3 3 3]
 [0 3 2 2 2 3 3 3 3 3 2 2 3 3 3]
 [0 3 2 2 2 1 3 3 3 3 2 3 3 3 3]
 [0 3 3 3 0 2 1 1 3 3 3 3 2 3 2]
 [0 3 3 3 3 0 2 1 1 3 3 3 2 3 3]
 [0 2 3 3 3 0 0 2 2 1 3 3 3 2 1]
 [0 2 3 3 3 3 0 0 2 2 1 1 3 2 2]
 [0 3 2 2 2 3 3 0 0 2 2 1 1 1 2]
 [0 2 0 2 3 3 3 3 0 2 0 2 1 2 1]
 [0 2 3 3 2 3 3 3 3 0 0 0 2 1 2]
 [0 3 3 3 3 2 2 3 3 3 0 0 2 2 1]
 [0 2 3 3 3 3 2 3 2 2 3 3 0 2 2]
 [0 3 2 3 3 2 2 2 3 2 2 3 0 0 2]
 [0 3 2 2 2 3 3 3 3 3 2 3 3 0 0]
 [0 2 3 2 2 2 2 3 2 2 3 2 2 0 0]]

In [12]:
def sw_backtrace(seq1, seq2, scoring=sw_scoring, traceback=sw_traceback):
    """
    Perform the Smith-Waterman's backtrace and returns the aligned 
    sequences and the scoring of the alignment.
    
    Parameters
    ----------
    seq1: String. One of the sequences to align
    seq2: String. The other sequence to align
    scoring : Array. Scoring matrix computed with sw_matrices 
    traceback : Array. Traceback matrix computed with sw_matrices 
    
    """
    st1 = '' 
    st2 = ''
    max_score = scoring.max()
    i, j = np.where(scoring == max_score)
    i = int(i)
    j = int(j)
    while True:
        if traceback[i,j] == 0:
            st1 += seq1[i-1]
            st2 += '-'
            i -= 1
        elif traceback[i,j] == 1:
            st1 += '-'
            st2 += seq2[j-1]
            j -= 1
        elif traceback[i,j] == 2:
            st1 += seq1[i-1]
            st2 += seq2[j-1]
            i -= 1
            j -= 1
        if i == 0 and j == 0 or scoring[i, j] == 0:
            break
    # reverse the strings
    st1 = st1[::-1]
    st2 = st2[::-1]
    return st1, st2, max_score

In [13]:
seq1_aln, seq2_aln, max_score = sw_backtrace(seq1, seq2)

print seq1_aln
print seq2_aln
max_score


LIFAG-KQLEDGRT
LIFAAPKQLP-GRT
Out[13]:
31

BLAST

Basic Local Alignment Search Tool (BLAST) is an algorithm for searching sequences on large databases. Instead of performing a full alignment like with the Smith-Waterman algorithm (slow), BLAST use a heuristic (and fast) method. Speed is vital to making this algorithm practical given the huge biological sequence databases currently available. Even when BLAST can not guarantee optimal alignments, gives relatively good accuracy.

We will not study the BLAST algorithm, but we mention it here due to its importance and because is related to the Needleman-Wunsch and Smith-Waterman algorithms.

Strategies for multiple sequence alignment

Multiple Sequences Alignment (MSA) is the generalization of pairwise alignment to more that two sequences. As you can imagine MSA is not as straightforward as aligning two sequences. A naïve approach for producing a MSA uses the dynamic programming technique to identify the globally optimal alignment solution. Hence, for $n$ individual sequences, this naïve method requires constructing the $n-dimensional$ equivalent of the matrix used in standard pairwise sequence alignment. The search space thus increases exponentially with increasing $n$ and is also strongly dependent on sequence length.

To find the global optimum for $n$ sequences this way has been shown to be an NP-complete problem.

Progressive alignment construction (greedy approach)

This is the most commonly used method to generate multiple sequence alignments. Is an heuristic search, that begins by computing all possible pairwise alignments, then a guide tree is determined by an efficient clustering method such as neighbor-joining or UPGMA (see next). The guide tree is used to guide the build up of the MSA. The final MSA is obtained by combining the most similar pair first and progressing to the less similar ones.

Progressive alignments are not guaranteed to be globally optimal. The main reason is that errors made at any stage, are propagated through to the final result. Performance is also particularly bad when all of the sequences in the set are rather distantly related.

Iterative methods (non-greedy approach)

Iterative methods are methods that, unlike the progressive method, allows for realignments i.e alignments done in previous stages can be modified. This methods provides more accurate results at the expenses of taking more computational resources. There are several way to actually implement these iterative methods; one possibility is to use standard optimization techniques like genetic algorithms and simulated annealing (we will explain these techniques in the final chapter), another possibility is to use Hidden Markov models, wich are probabilistic models and can be applied to a large set of problems. In the context of MSA HMM can be used to assign likelihoods to all possible combinations of gaps, matches, and mismatches to determine the most likely MSA or set of possible MSAs. We will discuss HMM in the next chapter in the context of secondary structure prediction.

Computational Molecular Phylogenetics

A phylogenetic tree is a model showing the inferred evolutionary relationships among a set of biological species (or biomolecules, or genes, or populations). The leaves (or external nodes) of the tree represent our current data (e.g. extant species) and the internal nodes represent the inferred common ancestors.

The are different methods to build such trees.

Distance-based methods

In this type of methods, a tree is derived from a distance matrix of pairwise distances. Then a clustering algorithm like UPGMA (unweighted pair-group method using arithmetic averages) or neighbour joining is used. The advantage of distance methods is that they are computationally fast. On the other side, these methods does not have the flexibility of statistical methods.

Character-based methods.

Instead of using a global metric like the distance to cluster the data these methods use an optimality criterion (an objective function) to measure how well the proposed trees fits the available data.

Maximum parsimony

Maximum parsimony is a method of identifying the phylogenetic tree that requires the smallest number of evolutionary events (character changes) to explain the observed data.

The most naive way of identifying the most parsimonious tree is by brute-force (enumerating each possible tree and identifying the one with the smallest score). However, this is only possible for a relatively small number of species, in fact the problem of identifying the most parsimonious tree is known to be NP-hard, consequently a number of heuristic search methods have been developed. Most such methods involve a steepest descent-style minimization mechanism operating on a tree rearrangement criterion.

Maximum likelihood and Bayesian statistics

The maximum likelihood and Bayesian methods uses standard statistical techniques for inferring phylogenetic trees. Both methods use substitution models to assign probabilities to trees. Roughly, the more mutations needed to explain the tree the lower probability a tree will have. The main advantage of statistical methods is their flexibility to build models that take into account varying rates of evolution across both lineages and sites. Additionally, Bayesian methods can be used to include more information trough the definition of a proper prior.

Clustering

Clustering consist on grouping a set of objects into subsets such that all elements within a group are more similar among them than they are to the others. From this definition it is obvious that the definition of similarity will affect the results. There are many situations in which we want to cluster our data, to classify proteins by their structure, to find the evolutionary history of sequences/species, to understand why certain compounds are good antibiotics and other not, to group people by similar interests, etc.

Neighbor joining

Neighbor joining is a clustering method specifically designed for the creation of phylogenetic trees. It is probably the most widely used distance-based algorithm. Neighbor joining uses an agglomerative (or bottom-up) approach to clustering. Like other agglomerative methods it can be summarized in four steps:

  1. Choose the pair of nodes with the highest similarity
  2. Merge the pair into a new node/cluster
  3. Compute the distances between the new node and the former existing nodes
  4. Repeat the procedure until only one node is left

Neighbor joining may be viewed as a greedy algorithm for finding a topology which minimizes the tree length. Neighbor joining at each step greedily joins the pair of nodes which will give the greatest decrease in the estimated tree length. Although, this procedure is not guaranteed to find the optimal topology (by this criterion) often does and in generall is quite close.

Neighbor joining requires as input a distance matrix, according to some distance estimation method. At the beginning all the nodes are liked together in a star-tree. The first step is to choose the pair of nodes with the highest similarity, this is done by computing the Q-matrix, and finding the smallest value in it.

$$Q(i,j)=(n-2)d(i,j)-\sum_{k=1}^n d(i,k) - \sum_{k=1}^n d(j,k)$$

where:
$n$ is the number of nodes $d(i,j)$ is the distance between nodes $i$ and $j$.

After the two nodes are merged into a new node, is necessary to compute the distances between the joined nodes and the new node using the following formulas:

$$\delta(a,u)=\frac{1}{2}d(a,b)+\frac{1}{2(n-2)} \left [ \sum_{k=1}^n d(a,k) - \sum_{k=1}^n d(b,k) \right ]$$$$\delta(b,u)=d(a,b)-\delta(a,u)$$

Where:
$a$ and $b$ are the paired nodes and $u$ is the newly created node.

The final step (until continuing with the iterations) is to compute the distance of the new node to the remaining unclustered nodes

$$d(u,k)=\frac{1}{2} [d(a,k)+d(b,k)-d(a,b)]$$

Let start by defining an arbitrary distance matrix


In [14]:
dist_matrix = np.array([
[ 0, 5, 9, 9, 8],
[ 5, 0,10,10, 9],
[ 9,10, 0, 8, 7],
[ 9,10, 8, 0, 3],
[ 8, 9, 7, 3, 0]])

In [15]:
def nb_joining(d_mat):
    """
    """
    
    n = len(d_mat)
    if n == 2:
        #print d_mat
        return d_mat
    # compute the distance from each node to the rest of the nodes
    net_div = np.array([np.sum(d_mat, axis=1)])
    # Compute the Q matrix
    q_mat = (n-2) * d_mat - net_div - net_div.T
    np.fill_diagonal(q_mat, 0)
    #print q_mat
    # compute the indices of the lowest value.
    # argmin flattens the arrays before computing the minimun, unravel_index 
    # returns the indeces from the unflatten array
    lw = np.unravel_index(np.argmin(q_mat), np.shape(q_mat))
    #print lw
    # Distance from the pair members to the new node
    d_ij = 0.5 * d_mat[lw] \
    + 1/(2*(n-2)) * (np.sum(d_mat[lw[0]]) - np.sum(d_mat[lw[1]])) 
    d_ji = d_mat[lw] - d_ij
    print d_ij, d_ji
    # Compute a new distance-matrix using the new node
    new_node_d = 0.5 * (d_mat[lw[0]] + d_mat[lw[1]] - d_mat[lw])
    d_mat = np.delete(d_mat, lw, axis=0)
    d_mat = np.delete(d_mat, lw, axis=1)
    new_d_mat = np.zeros((n-1, n-1))
    new_d_mat[0] = new_node_d[1:]
    new_d_mat[:,0] = new_node_d[1:]
    new_d_mat[1:,1:] = d_mat
    print new_d_mat
    nb_joining(new_d_mat)
    

nb_joining(dist_matrix)


2.0 3.0
[[ 0.  7.  7.  6.]
 [ 7.  0.  8.  7.]
 [ 7.  8.  0.  3.]
 [ 6.  7.  3.  0.]]
3.0 4.0
[[ 0.  4.  3.]
 [ 4.  0.  3.]
 [ 3.  3.  0.]]
2.0 2.0
[[ 0.  1.]
 [ 1.  0.]]

Graphically, the function _nbjoining performs the following steps:

The K-means clustering

k-means clustering aims to partition $n$ observations into $k$ clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.

K-means is a heuristic algorithm and hence there is no guarantee that it will converge to the global optimum. Additionally, the results may depend on the initial conditions, like with other stochastic algorithm. A common practice is to run the algorithm multiple times with different starting conditions and to check for convergence.

K-means clustering proceeds essentially in two steps:

  1. Assign each point to a cluster
  2. Update the mean for each cluster

k-means clustering can be seen as an special case of the expectation-maximization algorithm, while k-means clustering tends to find clusters of comparable spatial extent, expectation-maximization allows clusters to have different shapes.


In [6]:
data = np.genfromtxt('easy_clust.txt')
x = data[:,0]
y = data[:,1]
plt.plot(x, y, 's');



In [16]:
def assign(data, centroids):
    """
    Assign each point in the data set to the closer centroid.
    
    Parameters
    ----------
    data: Array. data set to clusterize
    centroids. Array. Coordinate of the centroids
    
    """
    # k clusters
    k = len(centroids)
    # create a list of k lists
    clusters = [[] for _ in range(k)]
    # loop trough all the datapoints and check their 
    # distance to each centroid. Add the point to the
    # closer centroid
    for i in data:
        f = []
        for j in centroids:
            #dist = (np.sum((j - i)**2))**0.5
            dist = np.sum((j - i)**2)
            f.append(dist)
        mn = f.index(min(f))
        clusters[mn].append(i)
    return clusters


def k_means(data, k):
    """
    Perform the k_means algorithm
    
    Parameters
    ----------
    data: Array. data set to clusterize
    k. Int. Number of clusters
    
    """    
    # initialize by selecting k data points at random
    centroids = data[np.random.random_integers(len(data)-1, size=k)]    
    # use the k data points as the initial centroids
    clusters = assign(data, centroids)    
    centroids_old = centroids
    
    # loop until the clusters stops changing
    while True:
        centroids = []
        for cluster in clusters:
            centroids.append(np.mean(cluster, axis=0))
        centroids = np.array(centroids)
        if np.array_equal(centroids, centroids_old):
            return clusters
        else:
            clusters = assign(data, centroids)
            centroids_old = centroids

In [31]:
clusters = k_means(data, 3)
for i in range(len(clusters)):
    x = np.array(clusters[i])[:,0]
    y = np.array(clusters[i])[:,1]
    plt.plot(x, y, 'd')



In [32]:
def make_circles(n=500, scatter=3):
    """
    Draw two concentric circles made of scattered points
    
    Parameters
    ----------
    n: int. Number of points in each circle
    scatter. float. Control the spread of the points.
    """

    r0 = 50 + np.random.normal(loc=0, scale=scatter, size=n)
    r1 = 25 + np.random.normal(loc=0, scale=scatter, size=n)
    r = np.concatenate((r0, r1))
    
    theta0 = np.linspace(0, np.pi*2, n)
    theta = np.concatenate((theta0, theta0))
    
    x = r * np.cos(theta)
    y = r * np.sin(theta)
    return np.vstack((x, y)).T

In [33]:
data_circ = make_circles(500, 3)
x = data_circ[:,0]
y = data_circ[:,1]
plt.plot(x, y, 's');



In [36]:
clusters = k_means(data_circ, 2)
for i in range(len(clusters)):
    x = np.array(clusters[i])[:,0]
    y = np.array(clusters[i])[:,1]
    plt.plot(x, y, 'd')


Unless you want to have clusters that contains portions of the inner and outer circle, the previous example shows that k-means may be not the best algorithm for this particular data-set. Indeed, there is not an ideal clustering algorithm that works for every possible data-set. Although, some algorithms (like DBSCAN) have very good performance on more different situations than others. As is always the case for any model/algorithm it is important to evaluate the result of a clustering algorithm in the context of a particular application and using the domain-knowledge at hand.

Further Reading

Henikoff & Henikoff 1992. Amino Acid Substitution Matrices from Protein Blocks. PNAS 89:10915-10919.
Styczynski et al 2008. BLOSUM62 miscalculations improve search performance.
Jason Kinser 2009. Python For Bioinformatics.
Saitou N, Nei M 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees.
Yang 2014. Molecular Evolution: A Statistical Approach.
Clustal. A family of algorithms for multiple sequence alignment.