author: Jacob Schreiber
contact: jmschreiber91@gmail.com
Hidden Markov models (HMMs) are the flagship of the pomegranate package in that they have the most features of all of the models and that they were the first algorithm implemented.
Hidden Markov models are a form of structured prediction method that are popular for tagging all elements in a sequence with some "hidden" state. They can be thought of as extensions of Markov chains where, instead of the probability of the next observation being dependant on the current observation, the probability of the next hidden state is dependant on the current hidden state, and the next observation is derived from that hidden state. An example of this can be part of speech tagging, where the observations are words and the hidden states are parts of speech. Each word gets tagged with a part of speech, but dynamic programming is utilized to search through all potential word-tag combinations to identify the best set of tags across the entire sentence.
Another perspective of HMMs is that they are an extension on mixture models that includes a transition matrix. Conceptually, a mixture model has a set of "hidden" states---the mixture components---and one can calculate the probability that each sample belongs to each component. This approach treats each observations independently. However, like in the part-of-speech example we know that an adjective typically is followed by a noun, and so position in the sequence matters. A HMM adds a transition matrix between the hidden states to incorporate this information across the sequence, allowing for higher probabilities of transitioning from the "adjective" hidden state to a noun or verb.
pomegranate implements HMMs in a flexible manner that goes beyond what other packages allow. Let's see some examples.
In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn; seaborn.set_style('whitegrid')
import numpy
from pomegranate import *
numpy.random.seed(0)
numpy.set_printoptions(suppress=True)
%load_ext watermark
%watermark -m -n -p numpy,scipy,pomegranate
Lets take the simplified example of CG island detection on a sequence of DNA. DNA is made up of the four canonical nucleotides, abbreviated 'A', 'C', 'G', and 'T'. We can say that regions of the genome that are enriched for nucleotides 'C' and 'G' are 'CG islands', which is a simplification of the real biological concept but sufficient for our example. The issue with identifying these regions is that they are not exclusively made up of the nucleotides 'C' and 'G', but have some 'A's and 'T's scatted amongst them. A simple model that looked for long stretches of C's and G's would not perform well, because it would miss most of the real regions.
We can start off by building the model. Because HMMs involve the transition matrix, which is often represented using a graph over the hidden states, building them requires a few more steps that a simple distribution or the mixture model. Our simple model will be composed of two distributions. One distribution wil be a uniform distribution across all four characters and one will have a preference for the nucleotides C and G, while still allowing the nucleotides A and T to be present.
In [2]:
d1 = DiscreteDistribution({'A': 0.25, 'C': 0.25, 'G': 0.25, 'T': 0.25})
d2 = DiscreteDistribution({'A': 0.10, 'C': 0.40, 'G': 0.40, 'T': 0.10})
For the HMM we have to first define states, which are a pair of a distribution and a name.
In [3]:
s1 = State(d1, name='background')
s2 = State(d2, name='CG island')
Now we define the HMM and pass in the states.
In [4]:
model = HiddenMarkovModel()
model.add_states(s1, s2)
Then we have to define the transition matrix, which is the probability of going from one hidden state to the next hidden state. In some cases, like this one, there are high self-loop probabilities, indicating that it's likely that one will stay in the same hidden state from one observation to the next in the sequence. Other cases have a lower probability of staying in the same state, like the part of speech tagger. A part of the transition matrix is the start probabilities, which is the probability of starting in each of the hidden states. Because we create these transitions one at a time, they are very amenable to sparse transition matrices, where it is impossible to transition from one hidden state to the next.
In [5]:
model.add_transition(model.start, s1, 0.5)
model.add_transition(model.start, s2, 0.5)
model.add_transition(s1, s1, 0.9)
model.add_transition(s1, s2, 0.1)
model.add_transition(s2, s1, 0.1)
model.add_transition(s2, s2, 0.9)
Now, finally, we need to bake the model in order to finalize the internal structure. Bake must be called when the model has been fully specified.
In [6]:
model.bake()
Now we can make predictions on some sequence. Let's create some sequence that has a CG enriched region in the middle and see whether we can identify it.
In [7]:
seq = numpy.array(list('CGACTACTGACTACTCGCCGACGCGACTGCCGTCTATACTGCGCATACGGC'))
hmm_predictions = model.predict(seq)
print("sequence: {}".format(''.join(seq)))
print("hmm pred: {}".format(''.join(map( str, hmm_predictions))))
It looks like it successfully identified a CG island in the middle (the long stretch of 0's) and another shorter one at the end. The predicted integers don't correspond to the order in which states were added to the model, but rather, the order that they exist in the model after a topological sort. More importantly, the model wasn't tricked into thinking that every CG or even pair of CGs was an island. It required many C's and G's to be part of a longer stretch to identify that region as an island. Naturally, the balance of the transition and emission probabilities will heavily influence what regions are detected.
Let's say, though, that we want to get rid of that CG island prediction at the end because we don't believe that real islands can occur at the end of the sequence. We can take care of this by adding in an explicit end state that only the non-island hidden state can get to. We enforce that the model has to end in the end state, and if only the non-island state gets there, the sequence of hidden states must end in the non-island state. Here's how:
In [8]:
model = HiddenMarkovModel()
model.add_states(s1, s2)
model.add_transition(model.start, s1, 0.5)
model.add_transition(model.start, s2, 0.5)
model.add_transition(s1, s1, 0.89 )
model.add_transition(s1, s2, 0.10 )
model.add_transition(s1, model.end, 0.01)
model.add_transition(s2, s1, 0.1 )
model.add_transition(s2, s2, 0.9)
model.bake()
Note that all we did was add a transition from s1
to model.end
with some low probability. This probability doesn't have to be high if there's only a single transition there, because there's no other possible way of getting to the end state.
In [9]:
seq = numpy.array(list('CGACTACTGACTACTCGCCGACGCGACTGCCGTCTATACTGCGCATACGGC'))
hmm_predictions = model.predict(seq)
print("sequence: {}".format(''.join(seq)))
print("hmm pred: {}".format(''.join(map( str, hmm_predictions))))
This seems far more reasonable. There is a single CG island surrounded by background sequence, and something at the end. If we knew that CG islands cannot occur at the end of sequences, we need only modify the underlying structure of the HMM in order to say that the sequence must end from the background state.
In the same way that mixtures could provide probabilistic estimates of class assignments rather than only hard labels, hidden Markov models can do the same. These estimates are the posterior probabilities of belonging to each of the hidden states given the observation, but also given the rest of the sequence.
In [10]:
print(model.predict_proba(seq)[12:19])
We can see here the transition from the first non-island region to the middle island region, with high probabilities in one column turning into high probabilities in the other column. The predict
method is just taking the most likely element, the maximum-a-posteriori estimate.
In addition to using the forward-backward algorithm to just calculate posterior probabilities for each observation, we can count the number of transitions that are predicted to occur between the hidden states.
In [11]:
trans, ems = model.forward_backward(seq)
print(trans)
This is the transition table, which has the soft count of the number of transitions across an edge in the model given a single sequence. It is a square matrix of size equal to the number of states (including start and end state), with number of transitions from (row_id) to (column_id). This is exemplified by the 1.0 in the first row, indicating that there is one transition from background state to the end state, as that's the only way to reach the end state. However, the third (or fourth, depending on ordering) row is the transitions from the start state, and it only slightly favors the background state. These counts are not normalized to the length of the input sequence, but can easily be done so by dividing by row sums, column sums, or entire table sums, depending on your application.
A possible reason not to normalize is to run several sequences through and add up their tables, because normalizing in the end and extracting some domain knowledge. It is extremely useful in practice. For example, we can see that there is an expectation of ~2.9 transitions from CG island to background, and ~2.4 from background to CG island. This could be used to infer that there are ~2-3 edges, which makes sense if you consider that the start and end of the sequence seem like they might be part of the CG island states except for the strict transition probabilities used (look at the first few rows of the emission table above.)
Lets move on to a more complicated structure, that of a profile HMM. A profile HMM is used to align a sequence to a reference 'profile', where the reference profile can either be a single sequence, or an alignment of many sequences (such as a reference genome). In essence, this profile has a 'match' state for every position in the reference profile, and 'insert' state, and a 'delete' state. The insert state allows the external sequence to have an insertion into the sequence without throwing off the entire alignment, such as the following:
ACCG : Sequence
|| |
AC-G : Reference
or a deletion, which is the opposite:
A-G : Sequence
| |
ACG : Reference
The bars in the middle refer to a perfect match, whereas the lack of a bar means either a deletion/insertion, or a mismatch. A mismatch is where two positions are aligned together, but do not match. This models the biological phenomena of mutation, where one nucleotide can convert to another over time. It is usually more likely in biological sequences that this type of mutation occurs than that the nucleotide was deleted from the sequence (shifting all nucleotides over by one) and then another was inserted at the exact location (moving all nucleotides over again). Since we are using a probabilistic model, we get to define these probabilities through the use of distributions! If we want to model mismatches, we can just set our 'match' state to have an appropriate distribution with non-zero probabilities over mismatches.
Lets now create a three nucleotide profile HMM, which models the sequence 'ACT'. We will fuzz this a little bit in the match states, pretending to have some prior information about what mutations occur at each position. If you don't have any information, setting a uniform, small, value over the other values is usually okay.
In [12]:
model = HiddenMarkovModel( "Global Alignment")
# Define the distribution for insertions
i_d = DiscreteDistribution( { 'A': 0.25, 'C': 0.25, 'G': 0.25, 'T': 0.25 } )
# Create the insert states
i0 = State( i_d, name="I0" )
i1 = State( i_d, name="I1" )
i2 = State( i_d, name="I2" )
i3 = State( i_d, name="I3" )
# Create the match states
m1 = State( DiscreteDistribution({ "A": 0.95, 'C': 0.01, 'G': 0.01, 'T': 0.02 }) , name="M1" )
m2 = State( DiscreteDistribution({ "A": 0.003, 'C': 0.99, 'G': 0.003, 'T': 0.004 }) , name="M2" )
m3 = State( DiscreteDistribution({ "A": 0.01, 'C': 0.01, 'G': 0.01, 'T': 0.97 }) , name="M3" )
# Create the delete states
d1 = State( None, name="D1" )
d2 = State( None, name="D2" )
d3 = State( None, name="D3" )
# Add all the states to the model
model.add_states( [i0, i1, i2, i3, m1, m2, m3, d1, d2, d3 ] )
# Create transitions from match states
model.add_transition( model.start, m1, 0.9 )
model.add_transition( model.start, i0, 0.1 )
model.add_transition( m1, m2, 0.9 )
model.add_transition( m1, i1, 0.05 )
model.add_transition( m1, d2, 0.05 )
model.add_transition( m2, m3, 0.9 )
model.add_transition( m2, i2, 0.05 )
model.add_transition( m2, d3, 0.05 )
model.add_transition( m3, model.end, 0.9 )
model.add_transition( m3, i3, 0.1 )
# Create transitions from insert states
model.add_transition( i0, i0, 0.70 )
model.add_transition( i0, d1, 0.15 )
model.add_transition( i0, m1, 0.15 )
model.add_transition( i1, i1, 0.70 )
model.add_transition( i1, d2, 0.15 )
model.add_transition( i1, m2, 0.15 )
model.add_transition( i2, i2, 0.70 )
model.add_transition( i2, d3, 0.15 )
model.add_transition( i2, m3, 0.15 )
model.add_transition( i3, i3, 0.85 )
model.add_transition( i3, model.end, 0.15 )
# Create transitions from delete states
model.add_transition( d1, d2, 0.15 )
model.add_transition( d1, i1, 0.15 )
model.add_transition( d1, m2, 0.70 )
model.add_transition( d2, d3, 0.15 )
model.add_transition( d2, i2, 0.15 )
model.add_transition( d2, m3, 0.70 )
model.add_transition( d3, i3, 0.30 )
model.add_transition( d3, model.end, 0.70 )
# Call bake to finalize the structure of the model.
model.bake()
Now lets try to align some sequences to it and see what happens!
In [13]:
for sequence in map( list, ('ACT', 'GGC', 'GAT', 'ACC') ):
logp, path = model.viterbi( sequence )
print("Sequence: '{}' -- Log Probability: {} -- Path: {}".format(
''.join( sequence ), logp, " ".join( state.name for idx, state in path[1:-1] ) ))
The first and last sequence are entirely matches, meaning that it thinks the most likely alignment between the profile ACT and ACT is A-A, C-C, and T-T, which makes sense, and the most likely alignment between ACT and ACC is A-A, C-C, and T-C, which includes a mismatch. Essentially, it's more likely that there's a T-C mismatch at the end then that there was a deletion of a T at the end of the sequence, and a separate insertion of a C.
The two middle sequences don't match very well, as expected! G's are not very likely in this profile at all. It predicts that the two G's are inserts, and that the C matches the C in the profile, before hitting the delete state because it can't emit a T. The third sequence thinks that the G is an insert, as expected, and then aligns the A and T in the sequence to the A and T in the master sequence, missing the middle C in the profile.
By using deletes, we can handle other sequences which are shorter than three characters. Lets look at some more sequences of different lengths.
In [14]:
for sequence in map( list, ('A', 'GA', 'AC', 'AT', 'ATCC', 'ACGTG', 'ATTT', 'TACCCTC', 'TGTCAACACT') ):
logp, path = model.viterbi( sequence )
print("Sequence: '{}' -- Log Probability: {} -- Path: {}".format(
''.join( sequence ), logp, " ".join( state.name for idx, state in path[1:-1] ) ))
Again, more of the same expected. You'll notice most of the use of insertion states are at I0, because most of the insertions are at the beginning of the sequence. It's more probable to simply stay in I0 at the beginning instead of go from I0 to D1 to I1, or going to another insert state along there. You'll see other insert states used when insertions occur in other places in the sequence, like 'ATTT' and 'ACGTG'. Now that we have the path, we need to convert it into an alignment, which is significantly more informative to look at.
In [15]:
def path_to_alignment( x, y, path ):
"""
This function will take in two sequences, and the ML path which is their alignment,
and insert dashes appropriately to make them appear aligned. This consists only of
adding a dash to the model sequence for every insert in the path appropriately, and
a dash in the observed sequence for every delete in the path appropriately.
"""
for i, (index, state) in enumerate( path[1:-1] ):
name = state.name
if name.startswith( 'D' ):
y = y[:i] + '-' + y[i:]
elif name.startswith( 'I' ):
x = x[:i] + '-' + x[i:]
return x, y
for sequence in map( list, ('A', 'GA', 'AC', 'AT', 'ATCC', 'ACGTG', 'ATTT', 'TACCCTC', 'TGTCAACACT') ):
logp, path = model.viterbi( sequence )
x, y = path_to_alignment( 'ACT', ''.join(sequence), path )
print("Sequence: {}, Log Probability: {}".format( ''.join(sequence), logp ))
print("{}\n{}".format( x, y ))
print()
There are two main algorithms for training hidden Markov models-- Baum Welch (structured version of Expectation Maximization), and Viterbi training. Since we don't start off with labels on the data, these are both unsupervised training algorithms. In order to assign labels, Baum Welch uses EM to assign soft labels (weights in this case) to each point belonging to each state, and then using weighted MLE estimates to update the distributions. Viterbi assigns hard labels to each observation using the Viterbi algorithm, and then updates the distributions based on these hard labels.
pomegranate is extremely well featured when it comes to regularization methods for training, supporting tied emissions and edges, edge and emission inertia, freezing nodes or edges, edge pseudocounts, and multithreaded training. Lets look at some examples of the following:
Sometimes we want to say that multiple states model the same phenomena, but are simply at different points in the graph because we are utilizing complicated edge structure. An example is in the example of the global alignment HMM we saw. All insert states represent the same phenomena, which is nature randomly inserting a nucleotide, and this probability should be the same regardless of position. However, we can't simply have a single insert state, or we'd be allowed to transition from any match state to any other match state.
You can tie emissions together simply by passing the same distribution object to multiple states. That's it.
In [16]:
d = NormalDistribution( 5, 2 )
s1 = State( d, name="Tied1" )
s2 = State( d, name="Tied2" )
s3 = State( NormalDistribution( 5, 2 ), name="NotTied1" )
s4 = State( NormalDistribution( 5, 2 ), name="NotTied2" )
You have now indicated that these two states are tied, and when training, the weights of all points going to s2 will be added to the weights of all points going to s1 when updating d. As a side note, this is implemented in a computationally efficient manner such that d will only be updated once, not twice (but giving the same result). s3 and s4 are not tied together, because while they have the same distribution, it is not the same python object.
Edges can be tied together for the same reason. If you have a modular structure to your HMM, perhaps you believe this repeating structure doesn't (or shouldn't) have a position specific edge structure. You can do this simply by adding a group when you add transitions.
In [17]:
model = HiddenMarkovModel()
model.add_states( [s1, s2] )
model.add_transition( model.start, s1, 0.5, group='a' )
model.add_transition( model.start, s2, 0.5, group='b' )
model.add_transition( s1, s2, 0.5, group='a' )
model.add_transition( s2, s1, 0.5, group='b' )
model.bake()
The above model doesn't necessarily make sense, but it shows how simple it is to tie edges as well. You can go ahead and train normally from this point, without needing to change any code.
The next options are inertia on edges or on distributions. This simply means that you update your parameters as (previous_parameter inertia) + (new_parameter (1-inertia) ). It is a way to prevent your updates from overfitting immediately. You can specify this in the train function using either edge_inertia
or distribution_inertia
. These default to 0, with 1 being the maximum, meaning that you don't update based on new evidence, the same as freezing a distribution or the edges.
In [18]:
model.fit( [[5, 2, 3, 4], [5, 7, 2, 3, 5]], distribution_inertia=0.3, edge_inertia=0.25 )
Out[18]:
Another way of regularizing your model is to add pseudocounts to your edges (which have non-zero probabilities). When updating your edges in the future, you add this pseudocount to the count of transitions across that edge in the future. This gives a more Bayesian estimate of the edge probability, and is useful if you have a large model and don't expect to cross most of the edges with your training data. An example might be a complicated profile HMM, where you don't expect to see deletes or inserts at all in your training data, but don't want to change from the default values.
In pomegranate, pseudocounts default to the initial probabilities, so that if you don't see data, the edge values simply aren't updated. You can define both edge specific pseudocounts when you define the transition. When you train, you must define use_pseudocount=True
.
In [19]:
s1 = State( NormalDistribution( 3, 1 ), name="s1" )
s2 = State( NormalDistribution( 6, 2 ), name="s2" )
model = HiddenMarkovModel()
model.add_states( [s1, s2] )
model.add_transition( model.start, s1, 0.5, pseudocount=4.2 )
model.add_transition( model.start, s2, 0.5, pseudocount=1.3 )
model.add_transition( s1, s2, 0.5, pseudocount=5.2 )
model.add_transition( s2, s1, 0.5, pseudocount=0.9 )
model.bake()
model.fit( [[5, 2, 3, 4], [5, 7, 2, 3, 5]], max_iterations=5, use_pseudocount=True )
Out[19]:
The other way is to put a blanket pseudocount on all edges.
In [20]:
s1 = State( NormalDistribution( 3, 1 ), name="s1" )
s2 = State( NormalDistribution( 6, 2 ), name="s2" )
model = HiddenMarkovModel()
model.add_states( [s1, s2] )
model.add_transition( model.start, s1, 0.5 )
model.add_transition( model.start, s2, 0.5 )
model.add_transition( s1, s2, 0.5 )
model.add_transition( s2, s1, 0.5 )
model.bake()
model.fit( [[5, 2, 3, 4], [5, 7, 2, 3, 5]], max_iterations=5, transition_pseudocount=20, use_pseudocount=True )
Out[20]:
We can see that there isn't as much of an improvement. This is part of regularization, though. We sacrifice fitting the data exactly in order for our model to generalize better to future data. The majority of the training improvement is likely coming from the emissions better fitting the data, though.
Since pomegranate is implemented in cython, the majority of functions are written with the GIL released. A benefit of doing this is that we can use multithreading in order to make some computationally intensive tasks take less time. However, a downside is that python doesn't play nicely with multithreading, and so there are some cases where training using multithreading can make your model training take significantly longer. I investigate this in an early multithreading pull request here. Things have improved since then, but the gist is that if you have a small model (less than 15 states), it may be detrimental, but the larger your model is, the more it scales towards getting a speed improvement exactly the number of threads you use. You can specify multithreading using the n_jobs
keyword. All structures in pomegranate are thread safe, so you don't need to worry about race conditions.
In [21]:
s1 = State( NormalDistribution( 3, 1 ), name="s1" )
s2 = State( NormalDistribution( 6, 2 ), name="s2" )
model = HiddenMarkovModel()
model.add_states( [s1, s2] )
model.add_transition( model.start, s1, 0.5 )
model.add_transition( model.start, s2, 0.5 )
model.add_transition( s1, s2, 0.5 )
model.add_transition( s2, s1, 0.5 )
model.bake()
model.fit( [[5, 2, 3, 4, 7, 3, 6, 3, 5, 2, 4], [5, 7, 2, 3, 5, 1, 3, 5, 6, 2]], max_iterations=5 )
Out[21]:
In [22]:
s1 = State( NormalDistribution( 3, 1 ), name="s1" )
s2 = State( NormalDistribution( 6, 2 ), name="s2" )
model = HiddenMarkovModel()
model.add_states( [s1, s2] )
model.add_transition( model.start, s1, 0.5 )
model.add_transition( model.start, s2, 0.5 )
model.add_transition( s1, s2, 0.5 )
model.add_transition( s2, s1, 0.5 )
model.bake()
model.fit( [[5, 2, 3, 4, 7, 3, 6, 3, 5, 2, 4], [5, 7, 2, 3, 5, 1, 3, 5, 6, 2]], max_iterations=5, n_jobs=4 )
Out[22]:
General Mixture Models support serialization to JSONs using to_json()
and from_json( json )
. This is useful is you want to train a GMM on large amounts of data, taking a significant amount of time, and then use this model in the future without having to repeat this computationally intensive step (sounds familiar by now). Lets look at the original CG island model, since it's significantly smaller.
In [23]:
seq = list('CGACTACTGACTACTCGCCGACGCGACTGCCGTCTATACTGCGCATACGGC')
d1 = DiscreteDistribution({'A': 0.25, 'C': 0.25, 'G': 0.25, 'T': 0.25})
d2 = DiscreteDistribution({'A': 0.10, 'C': 0.40, 'G': 0.40, 'T': 0.10})
s1 = State( d1, name='background' )
s2 = State( d2, name='CG island' )
hmm = HiddenMarkovModel()
hmm.add_states(s1, s2)
hmm.add_transition( hmm.start, s1, 0.5 )
hmm.add_transition( hmm.start, s2, 0.5 )
hmm.add_transition( s1, s1, 0.5 )
hmm.add_transition( s1, s2, 0.5 )
hmm.add_transition( s2, s1, 0.5 )
hmm.add_transition( s2, s2, 0.5 )
hmm.bake()
In [24]:
print(hmm.to_json())
In [25]:
seq = list('CGACTACTGACTACTCGCCGACGCGACTGCCGTCTATACTGCGCATACGGC')
print(hmm.log_probability( seq ))
In [26]:
hmm_2 = HiddenMarkovModel.from_json( hmm.to_json() )
print(hmm_2.log_probability( seq ))