author: Jacob Schreiber
contact: jmschreiber91@gmail.com
Markov Chains are a simple model based on conditional probability, where a sequence is modelled as the product of conditional probabilities. A n-th order Markov chain looks back n emissions to base its conditional probability on. For example, a 3rd order Markov chain models $P(X_{t} | X_{t-1}, X_{t-2}, X_{t-3})$.
However, a full Markov model needs to model the first observations, and the first n-1 observations. The first observation can't really be modelled well using $P(X_{t} | X_{t-1}, X_{t-2}, X_{t-3})$, but can be modelled by $P(X_{t})$. The second observation has to be modelled by $P(X_{t} | X_{t-1} )$. This means that these distributions have to be passed into the Markov chain as well.
We can initialize a Markov chain easily enough by passing in a list of the distributions.
In [1]:
from pomegranate import *
%pylab inline
In [2]:
d1 = DiscreteDistribution({'A': 0.10, 'C': 0.40, 'G': 0.40, 'T': 0.10})
d2 = ConditionalProbabilityTable([['A', 'A', 0.10],
['A', 'C', 0.50],
['A', 'G', 0.30],
['A', 'T', 0.10],
['C', 'A', 0.10],
['C', 'C', 0.40],
['C', 'T', 0.40],
['C', 'G', 0.10],
['G', 'A', 0.05],
['G', 'C', 0.45],
['G', 'G', 0.45],
['G', 'T', 0.05],
['T', 'A', 0.20],
['T', 'C', 0.30],
['T', 'G', 0.30],
['T', 'T', 0.20]], [d1])
clf = MarkovChain([d1, d2])
Markov chains have log probability, fit, summarize, and from summaries methods implemented. They do not have classification capabilities by themselves, but when combined with a Naive Bayes classifier can be used to do discrimination between multiple models (see the Naive Bayes tutorial notebook).
Lets see the log probability of some data.
In [3]:
clf.log_probability( list('CAGCATCAGT') )
Out[3]:
In [4]:
clf.log_probability( list('C') )
Out[4]:
In [5]:
clf.log_probability( list('CACATCACGACTAATGATAAT') )
Out[5]:
We can fit the model to sequences which we pass in, and as expected, get better performance on sequences which we train on.
In [6]:
clf.fit( map( list, ('CAGCATCAGT', 'C', 'ATATAGAGATAAGCT', 'GCGCAAGT', 'GCATTGC', 'CACATCACGACTAATGATAAT') ) )
print clf.log_probability( list('CAGCATCAGT') )
print clf.log_probability( list('C') )
print clf.log_probability( list('CACATCACGACTAATGATAAT') )
In [7]:
print clf.distributions[0]
In [8]:
print clf.distributions[1]