Markov Chains

author: Jacob Schreiber
contact: jmschreiber91@gmail.com

Markov Chains are a simple model based on conditional probability, where a sequence is modelled as the product of conditional probabilities. A n-th order Markov chain looks back n emissions to base its conditional probability on. For example, a 3rd order Markov chain models $P(X_{t} | X_{t-1}, X_{t-2}, X_{t-3})$.

However, a full Markov model needs to model the first observations, and the first n-1 observations. The first observation can't really be modelled well using $P(X_{t} | X_{t-1}, X_{t-2}, X_{t-3})$, but can be modelled by $P(X_{t})$. The second observation has to be modelled by $P(X_{t} | X_{t-1} )$. This means that these distributions have to be passed into the Markov chain as well.

We can initialize a Markov chain easily enough by passing in a list of the distributions.



In [1]:

    
from pomegranate import *
%pylab inline









    



Populating the interactive namespace from numpy and matplotlib






    



/home/jmschr/anaconda/lib/python2.7/site-packages/IPython/core/magics/pylab.py:161: UserWarning: pylab import has clobbered these variables: ['log', 'random']
`%matplotlib` prevents importing * from pylab and numpy
  "\n`%matplotlib` prevents importing * from pylab and numpy"



In [2]:

    
d1 = DiscreteDistribution({'A': 0.10, 'C': 0.40, 'G': 0.40, 'T': 0.10})
d2 = ConditionalProbabilityTable([['A', 'A', 0.10],
                                ['A', 'C', 0.50],
                                ['A', 'G', 0.30],
                                ['A', 'T', 0.10],
                                ['C', 'A', 0.10],
                                ['C', 'C', 0.40],
                                ['C', 'T', 0.40],
                                ['C', 'G', 0.10],
                                ['G', 'A', 0.05],
                                ['G', 'C', 0.45],
                                ['G', 'G', 0.45],
                                ['G', 'T', 0.05],
                                ['T', 'A', 0.20],
                                ['T', 'C', 0.30],
                                ['T', 'G', 0.30],
                                ['T', 'T', 0.20]], [d1])

clf = MarkovChain([d1, d2])

Markov chains have log probability, fit, summarize, and from summaries methods implemented. They do not have classification capabilities by themselves, but when combined with a Naive Bayes classifier can be used to do discrimination between multiple models (see the Naive Bayes tutorial notebook).

Lets see the log probability of some data.



In [3]:

    
clf.log_probability( list('CAGCATCAGT') )









    Out[3]:





-17.532789486599906



In [4]:

    
clf.log_probability( list('C') )









    Out[4]:





-0.916290731874155



In [5]:

    
clf.log_probability( list('CACATCACGACTAATGATAAT') )









    Out[5]:





-38.55615991599665

We can fit the model to sequences which we pass in, and as expected, get better performance on sequences which we train on.



In [6]:

    
clf.fit( map( list, ('CAGCATCAGT', 'C', 'ATATAGAGATAAGCT', 'GCGCAAGT', 'GCATTGC', 'CACATCACGACTAATGATAAT') ) )
print clf.log_probability( list('CAGCATCAGT') ) 
print clf.log_probability( list('C') )
print clf.log_probability( list('CACATCACGACTAATGATAAT') )









    



-9.49627091139
-0.69314718056
-25.2575143893



In [7]:

    
print clf.distributions[0]









    



{
    "frozen" :false,
    "class" :"Distribution",
    "parameters" :[
        {
            "A" :0.16666666666666666,
            "C" :0.5,
            "T" :0.0,
            "G" :0.33333333333333331
        }
    ],
    "name" :"DiscreteDistribution"
}



In [8]:

    
print clf.distributions[1]









    



A	A	0.181818181818
A	C	0.136363636364
A	G	0.272727272727
A	T	0.409090909091
C	A	0.666666666667
C	C	0.0
C	T	0.166666666667
C	G	0.166666666667
G	A	0.333333333333
G	C	0.5
G	G	0.0
G	T	0.166666666667
T	A	0.5
T	C	0.2
T	G	0.2
T	T	0.1