Markov Chains

author: Jacob Schreiber
contact: jmschreiber91@gmail.com

Markov Chains are a simple model based on conditional probability, where a sequence is modelled as the product of conditional probabilities. A n-th order Markov chain looks back n emissions to base its conditional probability on. For example, a 3rd order Markov chain models $P(X_{t} | X_{t-1}, X_{t-2}, X_{t-3})$.

However, a full Markov model needs to model the first observations, and the first n-1 observations. The first observation can't really be modelled well using $P(X_{t} | X_{t-1}, X_{t-2}, X_{t-3})$, but can be modelled by $P(X_{t})$. The second observation has to be modelled by $P(X_{t} | X_{t-1} )$. This means that these distributions have to be passed into the Markov chain as well.

We can initialize a Markov chain easily enough by passing in a list of the distributions.


In [1]:
%matplotlib inline
import time
import pandas
import random
import numpy
import matplotlib.pyplot as plt
import seaborn; seaborn.set_style('whitegrid')
import itertools

from pomegranate import *

random.seed(0)
numpy.random.seed(0)
numpy.set_printoptions(suppress=True)

%load_ext watermark
%watermark -m -n -p numpy,scipy,pomegranate


Fri Jan 10 2020 

numpy 1.18.1
scipy 1.4.1
pomegranate 0.12.0

compiler   : Clang 10.0.0 (clang-1000.11.45.5)
system     : Darwin
release    : 17.7.0
machine    : x86_64
processor  : i386
CPU cores  : 4
interpreter: 64bit

In [2]:
from pomegranate import *
%pylab inline


Populating the interactive namespace from numpy and matplotlib
/Users/ksachdeva/Desktop/Dev/myoss/pomegranate/env/lib/python3.7/site-packages/IPython/core/magics/pylab.py:160: UserWarning: pylab import has clobbered these variables: ['random']
`%matplotlib` prevents importing * from pylab and numpy
  "\n`%matplotlib` prevents importing * from pylab and numpy"

In [3]:
d1 = DiscreteDistribution({'A': 0.10, 'C': 0.40, 'G': 0.40, 'T': 0.10})
d2 = ConditionalProbabilityTable([['A', 'A', 0.10],
                                ['A', 'C', 0.50],
                                ['A', 'G', 0.30],
                                ['A', 'T', 0.10],
                                ['C', 'A', 0.10],
                                ['C', 'C', 0.40],
                                ['C', 'T', 0.40],
                                ['C', 'G', 0.10],
                                ['G', 'A', 0.05],
                                ['G', 'C', 0.45],
                                ['G', 'G', 0.45],
                                ['G', 'T', 0.05],
                                ['T', 'A', 0.20],
                                ['T', 'C', 0.30],
                                ['T', 'G', 0.30],
                                ['T', 'T', 0.20]], [d1])

clf = MarkovChain([d1, d2])

Markov chains have log probability, fit, summarize, and from summaries methods implemented. They do not have classification capabilities by themselves, but when combined with a Naive Bayes classifier can be used to do discrimination between multiple models (see the Naive Bayes tutorial notebook).

Lets see the log probability of some data.


In [4]:
clf.log_probability( list('CAGCATCAGT') )


Out[4]:
-17.532789486599906

In [5]:
clf.log_probability( list('C') )


Out[5]:
-0.916290731874155

In [6]:
clf.log_probability( list('CACATCACGACTAATGATAAT') )


Out[6]:
-38.55615991599665

We can fit the model to sequences which we pass in, and as expected, get better performance on sequences which we train on.


In [11]:
clf.fit(list(map(list,('CAGCATCAGT', 'C', 'ATATAGAGATAAGCT', 'GCGCAAGT', 'GCATTGC', 'CACATCACGACTAATGATAAT'))))
print(clf.log_probability( list('CAGCATCAGT') ) )
print(clf.log_probability( list('C') ))
print(clf.log_probability( list('CACATCACGACTAATGATAAT') ))


-9.496270911389157
-0.6931471805599453
-25.257514389300717

In [12]:
print(clf.distributions[0])


{
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "A" :0.16666666666666666,
            "C" :0.5,
            "G" :0.3333333333333333,
            "T" :0.0
        }
    ],
    "frozen" :false
}

In [13]:
print(clf.distributions[1])


A	A	0.18181818181818182
A	C	0.13636363636363635
A	G	0.2727272727272727
A	T	0.4090909090909091
C	A	0.6666666666666666
C	C	0.0
C	T	0.16666666666666669
C	G	0.16666666666666669
G	A	0.3333333333333333
G	C	0.5
G	G	0.0
G	T	0.16666666666666669
T	A	0.5
T	C	0.2
T	G	0.2
T	T	0.10000000000000002