Protein and RNA structure from sequence

Secondary structure prediction (SSP) is the tasks of predicting the secondary structure of proteins (and also RNA) using as input solely a sequence. For protein, SSP methods generally assign each amino acid to one of four categories; $\alpha$-helices, $\beta$-strands, turns and coils. Amino acids in the coil class are those that do not belong to any of the other classes. Some methods subdivide the categories of $\alpha$-helices and $\beta$-strands into subcategories.

The gold-standard for testing SSP methods is to compare its predictions against the secondary structure assignment obtained by methods such as DSSP or STRIDE applied to the X-ray determined protein.

It seems that methods of SSP in proteins has reached a plateau at almost 80% of accuracy, measured using the $Q_3$ parameter.

$$ Q_3 = \frac{N_\alpha + N_\beta + N_c}{N} * 100 $$

Where:
$N$ is the total number of residues and $N_\alpha$, $N_\beta$, $N_c$ are the number of correctly predicted residues in $\alpha$, $\beta$ and coil, respectivelly.

Attempts to improve this accuracy have proven to be a hard exercise. And it seems that secondary structure prediction for protein is almost a solved problem. The reason is that the criteria for defining secondary structure are not exactly unique. For instance, DSSP and STRIDE methods could have disagreement in as much as ~14% of the residues. Disagreement between STRIDE and crystallographers' assignment could be around 10% of the residues. The extend of the differences among methods depends on several factors like the quality of the protein model and the type of proteins, for example all method tend to agree more for regular helical portion and have more disagreement to delimit the extremes of the helices or portion of $\beta$-strand.

Since SSP methods have high accuracy they are often used as the early steps in more complicated tasks such as fold recognition, ab-initio protein structure prediction, classification of structural motifs, etc.


In [19]:
% matplotlib inline
from __future__ import division
import numpy as np
np.set_printoptions(linewidth=200)
import matplotlib.pyplot as plt
import seaborn as sns

Chou-Fasman

The Chou and Fasman method is a classical SSP method, one major feature is their simplicity. It is based on the estimated propensities of each of the 20 proteinogenic amino acids to be in $\alpha$-helices, $\beta$-strands, or turns ($P_\alpha$, $P_\beta$, $P_{turn}$). The method was not conceived as a computer algorithm, but as a set of rules to assist scientists, in classifying residues as forming part of helices, strands or turns. Some authors has pointed out that one problem of the Chou-Fasman method is the difficulty to interpret regions of equal potential for $\alpha$-helices, $\beta$-strands. The lack of clear rules has prevented programmers from writing computer algorithms that provides good performance without human intervention. On the contrary, the authors of the method argue that the ambiguity that arises in determining regions of equal potential is not a drawback of the method, but rather provides valuable information into regions that can undergo conformational change.

An abbreviated version of the Chou-Fasman rules:

  1. A cluster of four helical residues ($Pa \geq 1.06$) out of six along the protein sequence will initiate a helix. The helical segment is extended in both directions until a tetrapeptide with $P_\alpha \lt 1.0$ is reached. Proline cannot occur in a helix, except if occurs within the last three residues at the N-terminal helical end. Any segment that is at least six residues long with $P_\alpha \gt 1.03$ and $P_\alpha > P_\beta$ is predicted as helical.

  2. A cluster of 3 residues with $Pb \geq 1.05$ out of 5 along the protein sequence will initiate a $\beta$-strand. The strand segment is extended in both directions until a tetrapeptide with $P_\beta \lt 1.0$ is reached. Any segment with $P_\beta \gt 1.05$ as well as $P_\beta \gt P_\alpha$ is predicted as $\beta$-strand.

  3. The propensity of a turn at residue $i$ is calculated as $p_t = f_i \times f_{i+1} \times f_{i+2} \times f_{i+3}$. Tetrapeptides with $p_t \gt 0.75 \times 10^{-4}$ as well as $P_t \gt 1.0$ and $P_\alpha \lt P_t \gt P_\beta$ are predicted as $\beta$-turns.

  4. Any segment containing overlapping $\alpha$ and $\beta$ regions is helical if $P_\alpha \gt P_\beta$ or $\beta$-sheet if $P_\beta \gt P_\alpha$.

Following the original spirit of the method, we are going to implement a version that computes the propensities for each residue to be in $\alpha$-helix, $\beta$-strand of $\beta$-turn. How to solve conflicts in the overlapping region will be left to the user.


In [3]:
def Chou_Fasman(seq):
    """
    An implementation of the simple/original version of the 
    Chou-Fasman method. The results is the propensity per residue 
    smoothed using a 4 residue window. The interpretation of the 
    overlapping regions is left to the user.
    
    Parameters
    ----------
    
    seq: Str. Protein sequence
    
    """
    # Chou-Fasman parameters computed from a set of 29 Proteins.
    Pa = {'A':1.42, 'C':0.70, 'E':1.51, 'D':1.01, 'G':0.57, 'F':1.13,
          'I':1.08, 'H':1.00, 'K':1.16, 'M':1.45, 'L':1.21, 'N':0.67,
          'Q':1.11, 'P':0.57, 'S':0.77, 'R':0.98, 'T':0.83, 'W':1.08,
          'V':1.06, 'Y':0.69}
    Pb = {'A':0.83, 'C':1.19, 'E':0.37, 'D':0.54, 'G':0.75, 'F':1.38,
          'I':1.60, 'H':0.87, 'K':0.74, 'M':1.05, 'L':1.30, 'N':0.89,
          'Q':1.10, 'P':0.55, 'S':0.75, 'R':0.93, 'T':1.19, 'W':1.37,
          'V':1.70, 'Y':1.47}
    Pturn = {'A':0.66, 'C':1.19, 'E':0.74, 'D':1.46, 'G':1.56, 'F':0.60,
             'I':0.47, 'H':0.95, 'K':1.01, 'M':0.60, 'L':0.59, 'N':1.56,
             'Q':0.98, 'P':1.52, 'S':1.43, 'R':0.95, 'T':0.96, 'W':0.96,
             'V':0.50, 'Y':1.14}
    F0 = {'A':0.060, 'C':0.149, 'E':0.056, 'D':0.147, 'G':0.102, 'F':0.059,
          'I':0.043, 'H':0.140, 'K':0.055, 'M':0.068, 'L':0.061, 'N':0.161,
          'Q':0.074, 'P':0.102, 'S':0.120, 'R':0.070, 'T':0.086, 'W':0.077,
          'V':0.062, 'Y':0.082}
    F1 = {'A':0.076, 'C':0.050, 'E':0.060, 'D':0.110, 'G':0.085, 'F':0.041,
          'I':0.034, 'H':0.047, 'K':0.115, 'M':0.082, 'L':0.025, 'N':0.083,
          'Q':0.098, 'P':0.301, 'S':0.139, 'R':0.106, 'T':0.108, 'W':0.013,
          'V':0.048, 'Y':0.065}
    F2 = {'A':0.035, 'C':0.117, 'E':0.077, 'D':0.179, 'G':0.190, 'F':0.065,
          'I':0.013, 'H':0.093, 'K':0.072, 'M':0.014, 'L':0.036, 'N':0.191,
          'Q':0.037, 'P':0.034, 'S':0.125, 'R':0.099, 'T':0.065, 'W':0.064,
          'V':0.028, 'Y':.114}
    F3 = {'A':0.058, 'C':0.128, 'E':0.064, 'D':0.081, 'G':0.152, 'F':0.065,
          'I':0.056, 'H':0.054, 'K':0.095, 'M':0.055, 'L':0.070, 'N':0.091,
          'Q':0.098, 'P':0.068, 'S':0.106, 'R':0.085, 'T':0.079, 'W':0.167,
          'V':0.053, 'Y':0.125}
    
    seq_lenght = len(seq)
    ns = 4
    Pa_w = []
    Pb_w = []
    Pturn_w = []
    
    # Compute the probability of each residue to belong to a turn
    # (except for the last 3)
    pt = [(F0[seq[i]] * F1[seq[i+1]] * F2[seq[i+2]] * F3[seq[i+3]])
          for i in range(seq_lenght-3)]
    
    #Compute the propensity per residue smoothed using a 4 residue window
    for start in range(seq_lenght):
        Pa_w.append(sum([Pa[i] for i in seq[start:start+ns]]) / ns)
        Pb_w.append(sum([Pb[i] for i in seq[start:start+ns]]) / ns)
        Pturn_w.append(sum([Pturn[i] for i in seq[start:start+ns]]) / ns)
            
    plt.figure(figsize=(8,8))
    plt.subplot(2,1,1)
    plt.plot(Pa_w, label= r'$<P_\alpha>$')
    plt.plot(Pb_w, label= r'$<P_\beta>$')
    plt.plot(Pturn_w, label= r'$<P_turn>$')
    plt.axhline(y=1, c='black', ls='--')
    plt.xlabel('Residues')
    plt.legend(loc=0)
    plt.subplot(2,1,2)
    plt.plot(pt, c=(0.5, 0.45, 0.70), label= r'$<P_t>$')
    plt.axhline(y=7.5E-5, c='black', ls='--')
    plt.xlabel('Residues')
    plt.legend(loc=0)

In [4]:
seq = 'MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG'
Chou_Fasman(seq)


Now we are going to try to automatize the Chou_Fasmann method, but remember that this method was never intended to be a computer algorithm and that rules for defining to which type of secondary structure a portion belongs are rather complex, for example what to do if two regions overlap and one region is contained by the other and the contained region has a larger average value and then we break the longer region but the remaining residues now are not long enough to form a stable structure. In short, should a small region of stable structure disrupt a longer region? May be yes may be not.

So, you have been warned, don't take the following example to serious and better think about it as an exercise in the Pythonic arts.


In [5]:
def Chou_Fasman_auto(seq):
    """
    A fully automatic version of the Chou-Fasman method. 
    
    Parameters
    ----------
    
    seq: Str. Protein sequence
    
    """
    # Chou-Fasman parameters computed from a set of 29 Proteins.
    Pa = {'A':1.42, 'C':0.70, 'E':1.51, 'D':1.01, 'G':0.57, 'F':1.13,
          'I':1.08, 'H':1.00, 'K':1.16, 'M':1.45, 'L':1.21, 'N':0.67,
          'Q':1.11, 'P':0.57, 'S':0.77, 'R':0.98, 'T':0.83, 'W':1.08,
          'V':1.06, 'Y':0.69}
    Pb = {'A':0.83, 'C':1.19, 'E':0.37, 'D':0.54, 'G':0.75, 'F':1.38,
          'I':1.60, 'H':0.87, 'K':0.74, 'M':1.05, 'L':1.30, 'N':0.89,
          'Q':1.10, 'P':0.55, 'S':0.75, 'R':0.93, 'T':1.19, 'W':1.37,
          'V':1.70, 'Y':1.47}
    Pturn = {'A':0.66, 'C':1.19, 'E':0.74, 'D':1.46, 'G':1.56, 'F':0.60,
             'I':0.47, 'H':0.95, 'K':1.01, 'M':0.60, 'L':0.59, 'N':1.56,
             'Q':0.98, 'P':1.52, 'S':1.43, 'R':0.95, 'T':0.96, 'W':0.96,
             'V':0.50, 'Y':1.14}
    F0 = {'A':0.060, 'C':0.149, 'E':0.056, 'D':0.147, 'G':0.102, 'F':0.059,
          'I':0.043, 'H':0.140, 'K':0.055, 'M':0.068, 'L':0.061, 'N':0.161,
          'Q':0.074, 'P':0.102, 'S':0.120, 'R':0.070, 'T':0.086, 'W':0.077,
          'V':0.062, 'Y':0.082}
    F1 = {'A':0.076, 'C':0.050, 'E':0.060, 'D':0.110, 'G':0.085, 'F':0.041,
          'I':0.034, 'H':0.047, 'K':0.115, 'M':0.082, 'L':0.025, 'N':0.083,
          'Q':0.098, 'P':0.301, 'S':0.139, 'R':0.106, 'T':0.108, 'W':0.013,
          'V':0.048, 'Y':0.065}
    F2 = {'A':0.035, 'C':0.117, 'E':0.077, 'D':0.179, 'G':0.190, 'F':0.065,
          'I':0.013, 'H':0.093, 'K':0.072, 'M':0.014, 'L':0.036, 'N':0.191,
          'Q':0.037, 'P':0.034, 'S':0.125, 'R':0.099, 'T':0.065, 'W':0.064,
          'V':0.028, 'Y':.114}
    F3 = {'A':0.058, 'C':0.128, 'E':0.064, 'D':0.081, 'G':0.152, 'F':0.065,
          'I':0.056, 'H':0.054, 'K':0.095, 'M':0.055, 'L':0.070, 'N':0.091,
          'Q':0.098, 'P':0.068, 'S':0.106, 'R':0.085, 'T':0.079, 'W':0.167,
          'V':0.053, 'Y':0.125}

    seq_lenght = len(seq)
    a_ns = 4
    b_ns = 3
    a_min_len = 6
    b_min_len = 5
    Pa_w = []
    Pb_w = []
    Pturn_w = []


    #Compute the propensity per residue smoothed using a 4 residue window
    for start in range(seq_lenght):
        Pa_w.append(sum([Pa[i] for i in seq[start:start+4]]) / 4)
        Pb_w.append(sum([Pb[i] for i in seq[start:start+4]]) / 4)
        Pturn_w.append(sum([Pturn[i] for i in seq[start:start+4]]) / 4)
            
            
    # Compute which regions have high propensities for alpha-helix
    alpha_regions = []
    count = 0
    for res in range(seq_lenght - a_min_len):
        begin = res
        end = res + a_min_len
        primer_alpha = [i for i in range(begin, end) if Pa[seq[i]] >= 1.06]
        # if the primer is longer that a cutoff try to extend the region to both ends
        if len(primer_alpha) >= a_ns:
            while Pa_w[begin]  > 1 and begin > 0:
                begin -= 1
            while Pa_w[end-3]  > 1 and end < seq_lenght-1:
                end += 1
            alpha_region = range(begin, end+1)
            P_alpha_region = sum([Pa[i] for i in seq[begin:end+1]]) / len(alpha_region)
            if P_alpha_region > 1.03 and len(alpha_region) > a_min_len:
                if 'P' not in [seq[i] for i in alpha_region[3:-1]]:
                    alpha_regions.extend(alpha_region)


    # Compute which regions have high propensities for beta-strand
    beta_regions = []
    for res in range(seq_lenght - b_min_len):
        begin = res
        end = res + b_min_len
        primer_betas = [i for i in range(begin, end) if Pb[seq[i]] >= 1.05]
        # if the primer is longer that a cutoff try to extend the region to both ends
        if len(primer_betas) >= b_ns:
            while Pb_w[begin]  > 1 and begin > 0:
                begin -= 1
            while Pb_w[end-3]  > 1 and end < seq_lenght-1:
                end += 1  
            beta_region = range(begin, end+1)
            P_beta_region = sum([Pb[i] for i in seq[begin:end+1]]) / len(beta_region)
            if P_beta_region > 1.05 and len(beta_region) > b_min_len:
                beta_regions.extend(beta_region)


    # We now compute the propensities for bends
    turn_regions = []
    pt = [(F0[seq[i]] * F1[seq[i+1]] * F2[seq[i+2]] * F3[seq[i+3]])
        for i in range(len(seq)-3)]
    for i in range(len(pt)):
        if Pturn_w[i] > 1 and pt[i] > 7.5E-5:
            if Pturn_w[i] > Pa_w[i] and Pturn_w[i] > Pb_w[i]:
                turn_regions.append(i)


    # To solve for overlapping regions we are going to use Python's sets
    alphas = set(alpha_regions)
    betas = set(beta_regions)
    
    reduced_alpha = []
    reduced_beta = []
    
    # check for regions that are helix or beta and do not overlap
    reduced_alpha.extend(alphas.difference(betas))  # alphas, no doubt
    reduced_beta.extend(betas.difference(alphas))  # betas, no doubt
    

    # check for overlapping regions
    overlap = sorted(alphas.intersection(betas))
    # define regions by grouping consecutive numbers
    regions = []
    if overlap:
        tag = [i-j for i, j in enumerate(overlap)]
        for i in set(tag):
            regions.append([x for x, y in zip(overlap, tag) if y == i])
    # find out if the overlapping regions are alpha or beta
    for region in regions:
        Pa_ave = sum([Pa[i] for i in seq[region[0]:region[-1]+1]])
        Pb_ave = sum([Pb[i] for i in seq[region[0]:region[-1]+1]])
        if Pa_ave > Pb_ave:
            reduced_alpha.extend(region)
        else:
            reduced_beta.extend(region)


    # by default residues are coils
    coils = ['C'] * seq_lenght
    # replace the coils, by alpha, beta or turns portions.
    for i in reduced_alpha:
        coils[i] = 'H'
    for i in reduced_beta:
        coils[i] = 'B'
    for turn in turn_regions:
        coils[turn] = 'T'
        
    print '%s\n%s' % (seq, ''.join(coils))

In [6]:
seq = 'MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG'
Chou_Fasman_auto(seq)


MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG
BBBBBBBTBBBBBBBBBTTHHHHHHHHHHHHTHHHHTHHHHHHHHHHHHTTTHHHTTTCCCCCBBBBBBBBBTBBC

RNA Secondary structure prediction

For a long time RNA molecules were regarded as "merely" an intermediary in the flow of genetic information. Now, we recognize nucleic acids as central molecules for the storing, flow and regulation of genetic and epigenetic information. The interest on nucleic acids, and specially RNA molecules, has increased in the last few years with the discovery of ribozymes, riboswitches, and several types of functional non-coding RNAs molecules.

Knowing the 3D structure of RNAs could be of significant importance for understanding their biological role. Here we will discuss briefly one approach to tackle the secondary structure problem of RNAs.

The Nussinov-Jacobson algorithm

Instead of using parameters derived from statistical analysis of know structures the Nussinov-Jacobson algorithm uses a simpler approach. It assumes that the main driving force that determine the structure of RNA is the formation of hydrogen bonds between base-pairs, contribution like base stacking, and pseudoknots are not considered in the model. Hence, the solution to the secondary structure problem will be the one that maximizes the number of base-pairs.

The Nussinov-Jacobson algorithm is a dynamic programing algorithm similar to the Needleman-Wunsch and Smith-Waterman algorithms that we have already saw. It uses a very simple substitution matrix. Basically the element will be $1$ for $G-C$ or $A-U$ pairs and $0$ otherwise.


In [7]:
nus_sub_mat = np.array([[0, 1, 0, 0],
                        [1, 0, 0, 0],
                        [0, 0, 0, 1],
                        [0, 0, 1, 0]])
mat_names ='GCUA'

Other substitution matrices are possible like the one below, that takes into account that $GC$ pairs are more stable than $AU$ pairs and that the non-canonical base-pair $GU$ occurs fairly often in RNA. Although the $AU$ pairs and the $GU$ pairs both form 2 hydrogen bond the value for $GU$ pairs in the substitution matrix is 1 to reflect that $GU$ pairs are somewhat less stable than $AU$ pairs.


In [8]:
nus_sub_mat = np.array([[0, 3, 1, 0],
                        [3, 0, 0, 0],
                        [1, 0, 0, 2],
                        [0, 0, 2, 0]])

In the Nussinov-Jacobson algorithm we start by creating a $n \times n$ zero matrix, where $n$ is the length of the RNA sequence. We will call this matrix the scoring matrix ($\mathbf{M}$). To fill the matrix we perform a nested loop trough the $(i,j)$ elements. And efficient way to fill $\mathbf{M}$ is to fill it one diagonal at a time. Since bases do not form pairs with them-selfs we don't need to fill the main diagonal, we could start from the diagonal above the main diagonal if we allow base-pairs between neighbors, but a more realistic restriction is that loops can't be shorter than 3 bases, hence we start filling the from the fourth diagonal ($i=3$). Since the matrix is symmetric we can avoid filling the lower-triangular matrix. Notice that what we are doing here is first checking if the zeroth-neighbors match then, if the first-neighbors match, then the second-neighbors an so on. If we add the restriction that a loop has to be at least $n$ bases long, we are saying that a base could only form a pair with a residue that is at least $n+1$ bases away in the sequence.

The matrix is filled as follows:

$$M_{ij} = max \begin{cases} M_{i, j-1} \\ M_{i+1, j} \\ M_{i+1, j-1} + S_{a,b} \\ \displaystyle \max_{i+4\leqslant k \leqslant j-5} \left \{ S_{i, k} + S_{k+1, j}\right \} \end{cases}$$

Where $S_{a,b}$ is the value of the substitution matrix when a base $a$ match a base $b$.

Notice that the first 3 cases are equivalent to what we have for the Needleman-Wunsch and Smith-Waterman algorithms. The first two cases correspond to mismatches (bases that do not form a pair), the third case is a match and the fourth case corresponds to what is call a bifurcation. A bifurcation is the combination of two optimal substructures.


In [9]:
def nus_matrices(seq, mat=nus_sub_mat, aa_mat=mat_names, loop=3):
    """
    Computes the Nussinov scoring matrix.
    
    Parameters
    ----------
    seq1: String. RNA sequence
    mat : Array. Substitution Matrix
    aa_mat : List. Column and rows names of the substitution matrix
    loop : Int. Number of bases of the shortest allowed loop
    
    """
    p = loop + 1
    length = len(seq)
    # initialize scoring matrix
    scoring = np.zeros((length, length), int)
    # fill the scoring matrix
    f = [0, 0, 0, 0]
    for n in range(p, length):
        for j in range(n, length):
            i = j - n
            res_i = aa_mat.index(seq[i])
            res_j = aa_mat.index(seq[j])
            # Nucleotides paired
            f[0] = scoring[i+1, j-1] + mat[res_i, res_j]
            # Nucleotides unpaired
            f[1] = scoring[i+1, j]
            # Nucleotides unpaired
            f[2] = scoring[i, j-1]
            # Bifurcation
            if j > i + p*2:
                tmp = [scoring[i, k] + scoring[k+1, j] for k in range(i+p, j-p+1)]  
                f[3] = np.max(tmp)
            score = max(f)
            scoring[i,j] = score
    return scoring

In [18]:
#seq = 'GGGAAAAAUCC'
#seq = 'AGGGAAACCCAAAAACCCAAAGGGA'
seq = 'GGCUCAUCACAUCCUGGGGCUGGAAACGGUCCCAAGGGUAUGGCUGUUCGCCAUUUAAAGUGGUACGCGAGCUA'
scoring = nus_matrices(seq, mat=nus_sub_mat, aa_mat=mat_names, loop=3)
scoring


Out[18]:
array([[ 0,  0,  0, ..., 68, 69, 69],
       [ 0,  0,  0, ..., 68, 68, 68],
       [ 0,  0,  0, ..., 65, 65, 67],
       ..., 
       [ 0,  0,  0, ...,  0,  0,  0],
       [ 0,  0,  0, ...,  0,  0,  0],
       [ 0,  0,  0, ...,  0,  0,  0]])

In [12]:
def traceback(seq, i, j, scoring=scoring, mat=nus_sub_mat, aa_mat=mat_names):
    if i < j:
        res_i = aa_mat.index(seq[i])
        res_j = aa_mat.index(seq[j])
        if scoring[i, j] == scoring[i+1, j]:
            traceback(seq, i+1, j)
        elif scoring[i, j] == scoring[i, j-1]:
            traceback(seq, i, j-1)        
        elif scoring[i, j] == scoring[i+1, j-1] + mat[res_i, res_j]:
            result[i] = '(' 
            result[j] = ')' 
            #'%4d%4d%3s%2s' % (i, j, seq[i], seq[j])
            traceback(seq, i+1, j-1)
        else:
            for k in range(i+1, j):
                if scoring[i, j] == scoring[i, k] + scoring[k+1, j]:
                     traceback(seq, i, k)
                     traceback(seq, k+1, j)
                     break
    return result

In [13]:
result = ['.']*len(seq)
result = traceback(seq, 0, len(seq)-1)

print '%s\n%s' % (seq, ''.join(result))


AGGGAAACCCAAAAACCCAAAGGGA
.(((...))).....(((...))).

The above representation of the secondary structure of an RNA is called the dot-bracket representation. You can convert that representation to the more familiar visualization using the mfold web server. For the sequence:

GGCUCAUCACAUCCUGGGGCUGGAAACGGUCCCAAGGGUAUGGCUGUUCGCCAUUUAAAGUGGUACGCGAGCUA

the result of our prediction looks like this.

Alternatives to Nussinov

As you have seen the Nussinov algorithm made some strong assumptions that we already know does not hold in real situations. We know that stacking interactions are important an stabilize single strands of RNA (that are not able to form hydrogen bonds) and we know pseudoknots occur in nature. Nevertheless this algorithm can be seen as a point to start building more complex algorithms and in fact a lot of method for RNA SSP are somehow modifications and improvements over the basic idea of the Nussinov-Jacobson algorithm. One example is the MFOLD algorithm that take into account the destabilization given by different types of loop and the stabilization provided by and stacking interactions.

Machine Learning Methods

As in other areas of bioinformatics Machine learning methods has been proven to be really useful for dealing with Secondary Structure Prediction methods.

Machine Learning algorithms works by building models based on the inputs and using that models to make predictions. The main idea is that these method can learn from data instead of just following explicitly programmed instructions.

Machine learning is now a discipline at the interface of computer science and statistics. Originally it was developed by the artificial intelligence community and separated from the statistical community. Currently, is more focused on solving practical problem and there is and increasing interest toward methods and models borrowed from statistics and probability theory, including Bayesian statistics.

Machine learning has been employed in a range of computing tasks such as spam filtering, optical character recognition, search engines, computer vision, expert system to help medical diagnosis, systems that play games like chess and Jeopardy and of course in secondary structure prediction and almost all other problems related to structural bioinformatics.

Machine learning and statistics also overlap, indeed many method discovered in machine learning have been know by statistician for a long time while other share some theoretical foundations. Both disciplines try to solve similar problem, and while machine learning is more algorythmic and statistics is more mathematic, the line when one starts and the other began is becoming more and more more blur as machine learning becomes more statistical in nature and as statistic increasingly relies on computational methods. May be the term data science is well suited to describe both disciplines as a whole as the Venn diagram by Drew Conway implies.

Methods in machine learning can be classified in different ways. Take the following classification just as a guide to try to grasp what machine learning is about.

One way to classify machine learning algorithms is by the type of input the method uses:

  • Supervised learning You have labeled pairs of input and outputs and the goal is to learn the rule that maps inputs to outputs. Linear regression (or regression in general) is an example of this type of task. Another example is Spam filter (or classification in general).

  • Unsupervised learning You have a set of unlabeled data, the goal is to find patterns in the data. Clustering, PCA and Hidden Markov Models are examples of unsupervised learning

While the two above methods are the most popular ones. There is a third type called _semisupervised learning that combines both methods, this is useful when you have only part of your data labeled. There is also a four type of machine learning method, known as reinforcement learning, an example of this type of method will be to learn a program how to play a game by simple letting the software plays the game several times.

Further Reading