Determining the posterior probabilities using YAHMM

by Tamas Nagy

When working with Hidden Markov Models (HMMs), we are often interested in determining the "optimal" sequence of hidden states $Q = q_1q_2...q_N$ given a set of observations $O = O_1O_2...O_N$ and a model $\lambda$. So how do we find the optimal sequence? Well, it depends on our definition of optimality.

One option is to maximize the probability of entire state sequence $Q$ given $O$ and $\lambda$. This is idea behind the Viterbi algorithm. However, we often find that the most probable path can be quite off for individual observations.

An alternative is to defined optimality as maximizing the number of states $q_k$ that are individually most likely. Thus, our focus here is to be as correct as possible for each position separately, ignoring the rest of the sequence. However, this definintion also has a major drawback: the state sequence created in this manner may not even be a valid state sequence!

For this reason it is often helpful to use both methods for interrogating the results of a model.



In [1]:

    
from yahmm import *
import seaborn as sns, numpy as np
from matplotlib.patches import Rectangle
from __future__ import division, print_function

# %install_ext https://raw.githubusercontent.com/rasbt/python_reference/master/ipython_magic/watermark.py
%load_ext watermark
%watermark -t -d -z -p yahmm,seaborn,numpy,matplotlib









    



20/07/2014 17:03:57 CEST

yahmm 1.0.0
seaborn 0.3.1
numpy 1.8.1
matplotlib 1.3.1

In this example, we will be reproducing Figure 2 from "Hidden Markov Models for Prediction of Protein Features" by Bystroff and Krogh. The model here is a simple two-state one for predicting transmembrane regions of proteins using the amino acid bias present in these regions. Transmembrane domains generally have a higher frequencies of hydrophobic residues (especially leucine and isoleucine) and lower frequencies of charged and polar residues. The model will have two states: one for transmembrane regions and one for other regions. The model will be initialized in the other state and will be infinite (i.e. no end state).

Tables 1 and 2 provide the necessary emission and transition probabilities



In [2]:

    
aas = list('IFLVMWAYGCTSPHNQDEKR')
tm_freq = np.array([1826,1370,2562,1751,616,414,1657,615,1243,
                    289,755,806,423,121,250,141,104,110,78,83])/15214
other_freq = np.array([2187,1854,4156,2935,1201,819,3382,1616,
                       3352,960,2852,3410,2640,1085,2279,2054,
                       2551,2983,2651,2933]) / 47900

# Add emission frequencies to states
_m = State(DiscreteDistribution(dict(zip(aas, tm_freq))), name='M')
_x = State(DiscreteDistribution(dict(zip(aas, other_freq))), name='X')

model = Model('TransmembraneDomainPrediction', start=_x)

# Transition frequencies from Table 2
model.add_transition(_m, _x, 0.046)
model.add_transition(_m, _m, 0.954)
model.add_transition(_x, _x, 0.982)
model.add_transition(_x, _m, 0.015)

model.bake(verbose=True)









    



TransmembraneDomainPrediction : X summed to 0.997, normalized to 1.0

For comparability, we will also use the sequence of a human Alpha-1D adrenergic receptor (Uniprot-Acc: P25100). Uniprot reports seven separate transmembrane regions for this proteins.



In [3]:

    
ada1d_seq = list('MTFRDLLSVSFEGPRPDSSAGGSSAGGGGGSAGGAAPSEGPAVGGVPGGAGGGGGVVGAG\
SGEDNRSSAGEPGSAGAGGDVNGTAAVGGLVVSAQGVGVGVFLAAFILMAVAGNLLVILS\
VACNRHLQTVTNYFIVNLAVADLLLSATVLPFSATMEVLGFWAFGRAFCDVWAAVDVLCC\
TASILSLCTISVDRYVGVRHSLKYPAIMTERKAAAILALLWVVALVVSVGPLLGWKEPVP\
PDERFCGITEEAGYAVFSSVCSFYLPMAVIVVMYCRVYVVARSTTRSLEAGVKRERGKAS\
EVVLRIHCRGAATGADGAHGMRSAKGHTFRSSLSVRLLKFSREKKAAKTLAIVVGVFVLC\
WFPFFFVLPLGSLFPQLKPSEGVFKVIFWLGYFNSCVNPLIYPCSSREFKRAFLRLLRCQ\
CRRRRRRRPLWRVYGHHWRASTSGLRQDCAPSSGDAPPGAPLALTALPDPDPEPPGTPEM\
QAPVASRRKPPSAFREWRLLGPFRRPTTQLRAKVSSLSHKIRAGGAQRAEAACAQRSEVE\
AVSLGVPHEVAEGATCQAYELADYSNLRETDI')

actual_transmems = [(96,121),(134,159),(170,192),(214,238),
                   (252,275),(349,373),(381,405)]

Let's first get the most probable path using Viterbi and find the start and end states of the viterbi predicted regions



In [4]:

    
viterbi_path = model.viterbi(ada1d_seq)[1]

viterbi_pred = np.array([state[1].name for state in viterbi_path[1:]])

viterbi_pred_mem = viterbi_pred == 'M'
starts = np.nonzero(viterbi_pred_mem & ~np.roll(viterbi_pred_mem, 1))[0]
ends = np.nonzero(viterbi_pred_mem & ~np.roll(viterbi_pred_mem, -1))[0]
transmems = zip(starts, ends)

Now let's calculate the posterior probabilities $\gamma_1\gamma_2...\gamma_k...\gamma_N$ for being in state $S_i$ for each observation $k$. So for each position $k$, we'll calculate:

$$ \gamma_k(i) = \frac{\alpha_k(i)\beta_k(i)}{\sum_{i}^M \alpha_k(i)\beta_k(i)} $$

where $M$ here is the number of states and the $\alpha_k(i)\beta_k(i)$ term are probabilities given by the forward-backward algorithm. In our simple example, $M=2$ for the two states (transmembrane and other), but this technique can be applied to more complex models as well.



In [5]:

    
trans, ems = model.forward_backward(ada1d_seq)
ems = np.exp(ems)
probs = ems/np.sum(ems, axis=1)[:, np.newaxis]

Now, let's plot everything. We'll need to plot the true transmembrane regions, the viterbi-predicted regions, and the posterior probabilities



In [6]:

    
%matplotlib inline
colors = sns.color_palette('deep', n_colors=6, desat=0.5)
sns.set_context(rc={"figure.figsize": (14, 6)})
sns.plt.axhline(y=1.1, c=colors[0], alpha=0.7)
sns.plt.axhline(y=0.5, c=colors[1], alpha=0.7)
sns.plt.xlim([1, len(ada1d_seq)+1])
sns.plt.ylim([0,1.2])
sns.plt.ylabel(r'posterior probs, $\gamma_k$')
sns.plt.xlabel(r'$k$')
axis = sns.plt.gca()

# Plot viterbi predicted TMDs
for start, end in transmems:
    axis.add_patch(Rectangle((start+1, 1.075), end-start+1, 0.05, 
                             facecolor=colors[0], alpha=0.7))
    axis.text(480,1.13, 'viterbi')

# Plot actual TMDs
for start, end in actual_transmems:
    axis.add_patch(Rectangle((start+1, .475), end-start+1, 0.05, 
                             facecolor=colors[1], alpha=0.7))
    axis.text(480, .53, 'actual')

# Get indices of states
indices = { state.name: i for i, state in enumerate( model.states ) }

sns.plt.plot(range(1, len(ada1d_seq)+1), probs[:, indices['M']], 
             c=colors[2], alpha=0.7)
sns.plt.gcf().savefig('posteriors.png')

As you can see, the posterior probabilities are much better at recapitulating the transmembrane regions from the sequence than the viterbi output. Both techniques are useful in helping us understand and analyze our data, but they answer different questions.

Found an error? Found five? Was something not clear? Let us know at https://github.com/jmschrei/yahmm/issues. Even better? Submit a pull request!

(words of encouragement are welcome too!)