HMMs are probabilistic sequence classifiers. A sequence classifier is a model whose job is to assign some label or class to each unit in a sequence. Given a sequence of units (words, letters, morphemes, sentences, whatever) their job is to compute a probability distribution over possible labels and choose the best label sequence.

  • The $t_i$s are instances of the hidden or latent variable.
  • The $w_i$s are instances of the tangible 'output' variable.

For example, in a part of speech tagger, the $w_i$s are the words and the $t_i$s are the parts of speech we wish to assign to these words.

Markov Assumption: Current state $t_i$ only depends on previous k tags for a k-th order HMM. Assuming an HMM of order 1,

$$\begin{equation} P(t_i \mid t_1, t_2, \ldots t_{i-1}) = P(t_i \mid t_{i-1}) \end{equation}$$

Output Independence Assumption: Current output state $w_i$ only depends on the current hidden state $t_i$. Assuming an HMM of order 1,

$$\begin{equation} P(w_1, w_2,\ldots w_i, t_1, t_2, \ldots t_i) = P(w_i \mid t_i) \end{equation}$$

Components of a Hidden Markov Model

Next, we'll describe the different components of an HMM using the Ice Cream task.

Imagine that you are a climatologist in the year 2799 studying the history of global warming. You cannot find any records of the weather for the summer of 2007, but you do find Jason Eisner’s diary, which lists how many ice creams Jason ate every day that summer. Our goal is to use these observations to estimate the temperature every day. 

We’ll simplify this weather task by assuming there are only two kinds of days: cold (C) and hot (H). Also, Jason cares about his health so he only eats 1-3 icecreams. So the Eisner task is as follows:

Given a sequence of observations, each observation an integer between 1-3, corresponding to the number of ice creams eaten on a given day, figure out the correct ‘hidden’ sequence of weather states (H or C) which caused Jason to eat the ice cream.

The 3 HMM Problems

Hidden Markov Models are characterized by three fundamental problems:

Problem 1 (Computing Likelihood): Given an HMM $\lambda = (A, B)$ and an observation sequence O, determine the likelihood $P(O \mid \lambda)$.

Problem 2 (Decoding): Given an HMM $\lambda = (A, B)$ and an observation sequence O, find the most likely sequence Q.

Problem 3 (Learning): Given an HMM $\lambda = (A, B)$ and an observation sequence O, learn the HMM parameters A and B.

Computing Likelihoods

Let's calculate the probability of the observation equence {3, 1} from the ice-cream HMM. This is the marginal probability

$P(O) = \sum_Q P(O, Q) = P(O \mid Q) \times P(Q) = \sum_Q [\prod_{i=1}^T P(o_i \mid q_i) \times \prod_{i=1}^T P(q_i | q_{i-1})] $

$P(3,1) = P(3, 1, H, H) + P(3, 1, H, C) + P(3, 1, C, H) + P(3, 1, C, C)$

$P(3, 1, H, H) = P(3 \mid H) \times P(1 \mid H) \times P(H \mid H) \times P(H) = 0.4 \times 0.2 \times 0.7 \times 0.5 = 0.028$

$P(3, 1, H, C) = P(3 \mid H) \times P(1 \mid C) \times P(H \mid C) \times P(H) = 0.4 \times 0.5 \times 0.4 \times 0.5 = 0.04$

$P(3, 1, C, H) = P(3 \mid C) \times P(1 \mid H) \times P(C \mid H) \times P(C) = 0.1 \times 0.2 \times 0.3 \times 0.5 = 0.003$

$P(3, 1, C, C) = P(3 \mid C) \times P(1 \mid C) \times P(C \mid C) \times P(C) = 0.1 \times 0.5 \times 0.6 \times 0.5 = 0.015$

Finally, $P(3, 1) = 0.028 + 0.04 + 0.003 + 0.015 = 0.086$

While computing $P(O \mid \lambda)$ is this way is simple, for an HMM with T steps and N values for the hidden state, this computation take $O(T \times N^T)$ time.

The Dynamic Programming solution to this problem is called the Forward Algorithm.

What's the complexity of the forward algorithm?

Decoding: The Viterbi Algorithm

Since we already know how to calculate the likelihood of a given observation sequence $P(O \mid Q)$, the corresponding most likely hidden sequence is $argmax_{Q} P(O \mid Q)$

However, since the Forward algorithm takes $O(T \times N^2)$ time to calculate the likelihood of a given observation sequence and there are $T^N$ subsequences to evaluate, the direct evaluation method again has exponential time complexity.

The solution is to look for the most likely sequence directly using Dynamic Programming.

Learning: The Baum-Welch Algorithm

In Code

In the section below, let's try to code a bare-bones HMM implementation for a toy problem.


In [1]:
import numpy as np
import pandas as pd

init_probs = pd.Series(index=['HOT', 'COLD'], data=[0.5, 0.5], name='pi')
init_probs


Out[1]:
HOT     0.5
COLD    0.5
Name: pi, dtype: float64

In [2]:
trans_probs = pd.DataFrame(index=['HOT', 'COLD'], data={'HOT': [0.7, 0.4], 'COLD': [0.3, 0.6]})
trans_probs


Out[2]:
COLD HOT
HOT 0.3 0.7
COLD 0.6 0.4

In [3]:
emission_probs = pd.concat({
    'HOT': pd.Series(index=[1, 2, 3], data=[0.2, 0.4, 0.4]),
    'COLD': pd.Series(index=[1, 2, 3], data=[0.5, 0.4, 0.1]),
}, axis=1)

emission_probs


Out[3]:
COLD HOT
1 0.5 0.2
2 0.4 0.4
3 0.1 0.4

In [4]:
class SimpleHMM:
    
    def __init__(self, init_probs, trans_probs, emission_probs):
        self.pi = init_probs
        self.A = trans_probs
        self.B = emission_probs
        
    def train(self):
        pass
    
    def forward_probs(self, obs):
        """Calculate forward probability of output sequence `obs`."""
        hidden_states = self.A.columns
        fwd = pd.DataFrame(columns=hidden_states)
        
        for i, o in enumerate(obs):
            for s in hidden_states:
                if i == 0:
                    prob_s = self.pi[s] * self.B[s][o]
                else:
                    prob_s = 0
                    for s_prime in hidden_states:
                        temp = self.A[s_prime][s] * self.B[s][o]       
                        prob_s += fwd.loc[i-1, s_prime] * temp                        
                    
                fwd.loc[i, s] = prob_s
                
        return fwd
    
    def decode(self, obs):
        """Given a sequence of obaservations `obs`, return the most likely sequence of 
            hidden states using the Viterbi algorithm.
        """
        hidden_states = self.A.columns
        n_states = len(hidden_states)
        viterbi_probs = pd.DataFrame(columns=hidden_states)
        
        state_transitions = []
        
        for i, o in enumerate(obs):
            if i == 0:
                for s in hidden_states:                
                    viterbi_probs.loc[0, s] = self.pi[s] * self.B[s][o]
            else:
                for s in hidden_states:
                    current_state_probs = []
                    
                    for s_prime in hidden_states:
                        temp = viterbi_probs.loc[i-1, s_prime] * self.A[s_prime][s]
                        current_state_probs.append([(s_prime, s), temp])
                    
                    current_state_probs = pd.DataFrame(current_state_probs, columns=['transition', 'probs'])
                    current_state_probs = current_state_probs.set_index('transition')
                    
                    max_prob = current_state_probs['probs'].max()
                    max_prob_transition = current_state_probs['probs'].argmax()                    
                    state_transitions.append([i] + list(max_prob_transition))
                    viterbi_probs.loc[i, s] = max_prob * self.B[s][o]
        
        state_transitions = pd.DataFrame(state_transitions, columns=['seq', 'start_state', 'end_state'])
        state_transitions = state_transitions.set_index('seq')
        
        return self._find_most_likely_path(obs, viterbi_probs, state_transitions)
        
    def _find_most_likely_path(self, obs, viterbi_probs, state_transitions):
        """For observed sequence `obs`, use the Viterbi probability and the state transition matrices 
            to backtrace the most likely sequence path."""
        
        final_state = viterbi_probs.tail(1).T.squeeze().argmax()
        n_obs = len(obs)
        
        end_state = final_state
        states = [end_state, ]
        
        for i in range(n_obs-1, 0, -1):
            possible_trans = state_transitions.loc[i, :]
            # print('For i = %s, end_state = %s' % (i, end_state))
            start_state = possible_trans.loc[possible_trans.end_state == end_state, 'start_state'].values[0]
            # print('For i = %s, start_state=%s' % (i, start_state))
            states.append(start_state)
            end_state = start_state
            
        most_likely_seq = list(reversed(states))
        
        return most_likely_seq

In [5]:
# Use the HMM to calculate the output probability of [3, 1]
hmm = SimpleHMM(init_probs, trans_probs, emission_probs)
fwd_probs = hmm.forward_probs([3, 1])
fwd_probs


Out[5]:
COLD HOT
0 0.05 0.2
1 0.055 0.031

In [6]:
fwd_probs.tail(1).sum(axis=1).values[0]


Out[6]:
0.086000000000000007

In [7]:
hmm.decode([3, 1, 1, 3])


Out[7]:
['HOT', 'COLD', 'COLD', 'HOT']

In [ ]: