HMMs are probabilistic sequence classifiers. A sequence classifier is a model whose job is to assign some label or class to each unit in a sequence. Given a sequence of units (words, letters, morphemes, sentences, whatever) their job is to compute a probability distribution over possible labels and choose the best label sequence.
For example, in a part of speech tagger, the $w_i$s are the words and the $t_i$s are the parts of speech we wish to assign to these words.
Markov Assumption: Current state $t_i$ only depends on previous k tags for a k-th order HMM. Assuming an HMM of order 1,
$$\begin{equation} P(t_i \mid t_1, t_2, \ldots t_{i-1}) = P(t_i \mid t_{i-1}) \end{equation}$$Output Independence Assumption: Current output state $w_i$ only depends on the current hidden state $t_i$. Assuming an HMM of order 1,
$$\begin{equation} P(w_1, w_2,\ldots w_i, t_1, t_2, \ldots t_i) = P(w_i \mid t_i) \end{equation}$$Next, we'll describe the different components of an HMM using the Ice Cream task.
Imagine that you are a climatologist in the year 2799 studying the history of global warming. You cannot find any records of the weather for the summer of 2007, but you do find Jason Eisner’s diary, which lists how many ice creams Jason ate every day that summer. Our goal is to use these observations to estimate the temperature every day.
We’ll simplify this weather task by assuming there are only two kinds of days: cold (C) and hot (H). Also, Jason cares about his health so he only eats 1-3 icecreams. So the Eisner task is as follows:
Given a sequence of observations, each observation an integer between 1-3, corresponding to the number of ice creams eaten on a given day, figure out the correct ‘hidden’ sequence of weather states (H or C) which caused Jason to eat the ice cream.
Hidden Markov Models are characterized by three fundamental problems:
Problem 1 (Computing Likelihood): Given an HMM $\lambda = (A, B)$ and an observation sequence O, determine the likelihood $P(O \mid \lambda)$.
Problem 2 (Decoding): Given an HMM $\lambda = (A, B)$ and an observation sequence O, find the most likely sequence Q.
Problem 3 (Learning): Given an HMM $\lambda = (A, B)$ and an observation sequence O, learn the HMM parameters A and B.
Let's calculate the probability of the observation equence {3, 1} from the ice-cream HMM. This is the marginal probability
$P(O) = \sum_Q P(O, Q) = P(O \mid Q) \times P(Q) = \sum_Q [\prod_{i=1}^T P(o_i \mid q_i) \times \prod_{i=1}^T P(q_i | q_{i-1})] $
$P(3,1) = P(3, 1, H, H) + P(3, 1, H, C) + P(3, 1, C, H) + P(3, 1, C, C)$
$P(3, 1, H, H) = P(3 \mid H) \times P(1 \mid H) \times P(H \mid H) \times P(H) = 0.4 \times 0.2 \times 0.7 \times 0.5 = 0.028$
$P(3, 1, H, C) = P(3 \mid H) \times P(1 \mid C) \times P(H \mid C) \times P(H) = 0.4 \times 0.5 \times 0.4 \times 0.5 = 0.04$
$P(3, 1, C, H) = P(3 \mid C) \times P(1 \mid H) \times P(C \mid H) \times P(C) = 0.1 \times 0.2 \times 0.3 \times 0.5 = 0.003$
$P(3, 1, C, C) = P(3 \mid C) \times P(1 \mid C) \times P(C \mid C) \times P(C) = 0.1 \times 0.5 \times 0.6 \times 0.5 = 0.015$
Finally, $P(3, 1) = 0.028 + 0.04 + 0.003 + 0.015 = 0.086$
While computing $P(O \mid \lambda)$ is this way is simple, for an HMM with T steps and N values for the hidden state, this computation take $O(T \times N^T)$ time.
The Dynamic Programming solution to this problem is called the Forward Algorithm.
What's the complexity of the forward algorithm?
Since we already know how to calculate the likelihood of a given observation sequence $P(O \mid Q)$, the corresponding most likely hidden sequence is $argmax_{Q} P(O \mid Q)$
However, since the Forward algorithm takes $O(T \times N^2)$ time to calculate the likelihood of a given observation sequence and there are $T^N$ subsequences to evaluate, the direct evaluation method again has exponential time complexity.
The solution is to look for the most likely sequence directly using Dynamic Programming.
The example and diagrams are from Speech and Language Processing: An introduction to natural language processing, computational linguistics, and speech recognition. Daniel Jurafsky & James H. Martin.
In [1]:
import numpy as np
import pandas as pd
init_probs = pd.Series(index=['HOT', 'COLD'], data=[0.5, 0.5], name='pi')
init_probs
Out[1]:
In [2]:
trans_probs = pd.DataFrame(index=['HOT', 'COLD'], data={'HOT': [0.7, 0.4], 'COLD': [0.3, 0.6]})
trans_probs
Out[2]:
In [3]:
emission_probs = pd.concat({
'HOT': pd.Series(index=[1, 2, 3], data=[0.2, 0.4, 0.4]),
'COLD': pd.Series(index=[1, 2, 3], data=[0.5, 0.4, 0.1]),
}, axis=1)
emission_probs
Out[3]:
In [4]:
class SimpleHMM:
def __init__(self, init_probs, trans_probs, emission_probs):
self.pi = init_probs
self.A = trans_probs
self.B = emission_probs
def train(self):
pass
def forward_probs(self, obs):
"""Calculate forward probability of output sequence `obs`."""
hidden_states = self.A.columns
fwd = pd.DataFrame(columns=hidden_states)
for i, o in enumerate(obs):
for s in hidden_states:
if i == 0:
prob_s = self.pi[s] * self.B[s][o]
else:
prob_s = 0
for s_prime in hidden_states:
temp = self.A[s_prime][s] * self.B[s][o]
prob_s += fwd.loc[i-1, s_prime] * temp
fwd.loc[i, s] = prob_s
return fwd
def decode(self, obs):
"""Given a sequence of obaservations `obs`, return the most likely sequence of
hidden states using the Viterbi algorithm.
"""
hidden_states = self.A.columns
n_states = len(hidden_states)
viterbi_probs = pd.DataFrame(columns=hidden_states)
state_transitions = []
for i, o in enumerate(obs):
if i == 0:
for s in hidden_states:
viterbi_probs.loc[0, s] = self.pi[s] * self.B[s][o]
else:
for s in hidden_states:
current_state_probs = []
for s_prime in hidden_states:
temp = viterbi_probs.loc[i-1, s_prime] * self.A[s_prime][s]
current_state_probs.append([(s_prime, s), temp])
current_state_probs = pd.DataFrame(current_state_probs, columns=['transition', 'probs'])
current_state_probs = current_state_probs.set_index('transition')
max_prob = current_state_probs['probs'].max()
max_prob_transition = current_state_probs['probs'].argmax()
state_transitions.append([i] + list(max_prob_transition))
viterbi_probs.loc[i, s] = max_prob * self.B[s][o]
state_transitions = pd.DataFrame(state_transitions, columns=['seq', 'start_state', 'end_state'])
state_transitions = state_transitions.set_index('seq')
return self._find_most_likely_path(obs, viterbi_probs, state_transitions)
def _find_most_likely_path(self, obs, viterbi_probs, state_transitions):
"""For observed sequence `obs`, use the Viterbi probability and the state transition matrices
to backtrace the most likely sequence path."""
final_state = viterbi_probs.tail(1).T.squeeze().argmax()
n_obs = len(obs)
end_state = final_state
states = [end_state, ]
for i in range(n_obs-1, 0, -1):
possible_trans = state_transitions.loc[i, :]
# print('For i = %s, end_state = %s' % (i, end_state))
start_state = possible_trans.loc[possible_trans.end_state == end_state, 'start_state'].values[0]
# print('For i = %s, start_state=%s' % (i, start_state))
states.append(start_state)
end_state = start_state
most_likely_seq = list(reversed(states))
return most_likely_seq
In [5]:
# Use the HMM to calculate the output probability of [3, 1]
hmm = SimpleHMM(init_probs, trans_probs, emission_probs)
fwd_probs = hmm.forward_probs([3, 1])
fwd_probs
Out[5]:
In [6]:
fwd_probs.tail(1).sum(axis=1).values[0]
Out[6]:
In [7]:
hmm.decode([3, 1, 1, 3])
Out[7]:
In [ ]: