author: Jacob Schreiber
contact: jmschreiber91@gmail.com
Markov Chains are a simple model based on conditional probability, where a sequence is modelled as the product of conditional probabilities. A n-th order Markov chain looks back n emissions to base its conditional probability on. For example, a 3rd order Markov chain models $P(X_{t} | X_{t-1}, X_{t-2}, X_{t-3})$.
However, a full Markov model needs to model the first observations, and the first n-1 observations. The first observation can't really be modelled well using $P(X_{t} | X_{t-1}, X_{t-2}, X_{t-3})$, but can be modelled by $P(X_{t})$. The second observation has to be modelled by $P(X_{t} | X_{t-1} )$. This means that these distributions have to be passed into the Markov chain as well.
We can initialize a Markov chain easily enough by passing in a list of the distributions.
In [1]:
%matplotlib inline
import time
import pandas
import random
import numpy
import matplotlib.pyplot as plt
import seaborn; seaborn.set_style('whitegrid')
import itertools
from pomegranate import *
random.seed(0)
numpy.random.seed(0)
numpy.set_printoptions(suppress=True)
%load_ext watermark
%watermark -m -n -p numpy,scipy,pomegranate
In [2]:
from pomegranate import *
%pylab inline
In [3]:
d1 = DiscreteDistribution({'A': 0.10, 'C': 0.40, 'G': 0.40, 'T': 0.10})
d2 = ConditionalProbabilityTable([['A', 'A', 0.10],
['A', 'C', 0.50],
['A', 'G', 0.30],
['A', 'T', 0.10],
['C', 'A', 0.10],
['C', 'C', 0.40],
['C', 'T', 0.40],
['C', 'G', 0.10],
['G', 'A', 0.05],
['G', 'C', 0.45],
['G', 'G', 0.45],
['G', 'T', 0.05],
['T', 'A', 0.20],
['T', 'C', 0.30],
['T', 'G', 0.30],
['T', 'T', 0.20]], [d1])
clf = MarkovChain([d1, d2])
Markov chains have log probability, fit, summarize, and from summaries methods implemented. They do not have classification capabilities by themselves, but when combined with a Naive Bayes classifier can be used to do discrimination between multiple models (see the Naive Bayes tutorial notebook).
Lets see the log probability of some data.
In [4]:
clf.log_probability( list('CAGCATCAGT') )
Out[4]:
In [5]:
clf.log_probability( list('C') )
Out[5]:
In [6]:
clf.log_probability( list('CACATCACGACTAATGATAAT') )
Out[6]:
We can fit the model to sequences which we pass in, and as expected, get better performance on sequences which we train on.
In [11]:
clf.fit(list(map(list,('CAGCATCAGT', 'C', 'ATATAGAGATAAGCT', 'GCGCAAGT', 'GCATTGC', 'CACATCACGACTAATGATAAT'))))
print(clf.log_probability( list('CAGCATCAGT') ) )
print(clf.log_probability( list('C') ))
print(clf.log_probability( list('CACATCACGACTAATGATAAT') ))
In [12]:
print(clf.distributions[0])
In [13]:
print(clf.distributions[1])