By: Tamas Nagy
This is a toy implementation of a Hidden Markov Model (HMM) for recognizing 5' splice-sites. This example is based on the Sean Eddy's 2004 paper in Nature Biotechnology.
I've reproduced Figure 1 from the paper below:
In [1]:
from IPython.display import Image
Image(url='http://1.bp.blogspot.com/_9rDJWEd9qEA/TKovS1g05gI/AAAAAAAAAlA/R3_cCjLPxCk/s1600/EddyHMMfig.jpg', embed=True)
Out[1]:
Adapted by permission from Macmillan Publishers Ltd: Nature Biotechnology 22, 1315 - 1316 (2004) doi:10.1038/nbt1004-1315, copyright 2004
What we have here is a simple three state model. As you can see, all nucleotides have an equal probability of being present in state E
, but the following state 5
is highly enriched in guanosine, followed by the adenine/thymine-rich state I
. The emission probabilities of each nucleotide is given above the states. The transition probabilities are shown next to their respective transition arrow.
We want to build a model in such a way that, given a sequence, we can calculate the maximum likelihood path, i.e. the location of the 5' site, and the confidence that we have in our result.
In [2]:
from yahmm import *
import numpy as np, matplotlib.pyplot as plt
# %install_ext https://raw.githubusercontent.com/rasbt/python_reference/master/ipython_magic/watermark.py
%load_ext watermark
%watermark -t -d -z -p yahmm,numpy,matplotlib
Then create the model
In [3]:
model = Model("5prime_ss_recog")
Next, let's set up the states using the emission states specified in the article
In [4]:
state_e = State(DiscreteDistribution({'A':0.25, 'C':0.25,
'G':0.25,'T':0.25}), name='E')
state_5 = State(DiscreteDistribution({'A':0.05,'C':0,
'G':0.95,'T':0}), name='5')
state_i = State(DiscreteDistribution({'A':0.4,'C':0.1,
'G':0.1,'T':0.4}), name='I')
After that we'll need to add all the transition probabilities. Don't forget to include the start and end states.
In [5]:
model.add_transition(model.start, state_e, 1)
model.add_transition(state_e, state_e, 0.9)
model.add_transition(state_e, state_5, 0.1)
model.add_transition(state_5, state_i, 1)
model.add_transition(state_i, state_i, 0.9)
model.add_transition(state_i, model.end, 0.1)
Let's finalize our model
In [6]:
model.bake(verbose=True)
In [7]:
seq = 'CTTCATGTGAAAGCAGACGTAAGTCA'
Now, let's find the most probable path using the Viterbi algorithm
In [8]:
print(seq)
# ML path
viterbi_prob, viterbi_path = model.viterbi(list(seq))
print('%s log P = %f'%(''.join(state[1].name for state in viterbi_path[1:-1]), viterbi_prob))
As you can see, this agrees nicely with what we see in Figure 1 of the paper. The most probable 5' splice site the located at the penultimate guanosine.
But the question remains: How confident are we in our answer? Since HMMs are probabilistic models we can calculate our confidence using a process termed posterior decoding.
We'll first get the index of the 5'
state and use this to get the transition probabilities at each nucleotide position in our test sequence.
In [9]:
indices = { state.name: i for i, state in enumerate( model.states ) }
ss_state_index = indices['5']
probs = [np.exp(prob[ss_state_index]) for prob
in model.forward_backward(list(seq))[1]]
print(probs[:10])
Let's graph the result so that we can better see what is going on
In [10]:
%matplotlib inline
rects = plt.bar(range(1, len(seq)+1), probs, align='center')
plt.xticks(range(1, len(seq)+1), list(seq))
plt.ylim([0,0.6])
def label(rects):
for rect in rects:
height = rect.get_height()
if(height > .1):
plt.text(rect.get_x()+rect.get_width()/2., 1.05*height,
'%s%%'%int(height*100),ha='center', va='bottom')
label(rects)
As you can see we get a probability of 46% that the penultimate G is a 5' splice site.
Found an error? Found five? Was something not clear? Let us know at https://github.com/jmschrei/yahmm/issues. Even better? Submit a pull request!
(words of encouragement are welcome too!)