$$ \LaTeX \text{ command declarations here.} \newcommand{\R}{\mathbb{R}} \renewcommand{\vec}[1]{\mathbf{#1}} \newcommand{\X}{\mathcal{X}} \newcommand{\D}{\mathcal{D}} \newcommand{\G}{\mathcal{G}} \newcommand{\L}{\mathcal{L}} \newcommand{\X}{\mathcal{X}} \newcommand{\Parents}{\mathrm{Parents}} \newcommand{\NonDesc}{\mathrm{NonDesc}} \newcommand{\I}{\mathcal{I}} \newcommand{\dsep}{\text{d-sep}} \newcommand{\Cat}{\mathrm{Categorical}} \newcommand{\Bin}{\mathrm{Binomial}} $$

HMMs and the Forward Backward Algorithm

Derivations

Let's start by deriving the forward and backward algorithms from lecture. It's really important you understand how the recursion comes into play. We covered it in lecture, please try not to cheat during this exercise, you won't get anything out of it if you don't try!

Review

Recall that a Hidden Markov Model (HMM) is a particular factorization of a joint distribution representing noisy observations $X_k$ generated from discrete hidden Markov chain $Z_k$. $$ P(\vec{X}, \vec{Z}) = P(Z_1) P(X_1 \mid Z_1) \prod_{k=2}^T P(Z_k \mid Z_{k-1}) P(X_k \mid Z_k) $$

and as a bayesian network:

HMM: Parameters

For a Hidden Markov Model with $N$ hidden states and $M$ observed states, there are three row-stochastic parameters $\theta=(A,B,\pi)$,

  • Transition matrix $A \in \R^{N \times N}$ $$ A_{ij} = p(Z_t = j | Z_{t-1} = i) $$
  • Emission matrix $B \in \R^{N \times M}$ $$ B_{jk} = p(X_t = k | Z_t = j) $$
  • Initial distribution $\pi \in \R^N$, $$ \pi_j = p(Z_1 = j) $$

HMM: Filtering Problem

Filtering means to compute the current belief state $p(z_t | x_1, \dots, x_t,\theta)$. $$ p(z_t | x_1,\dots,x_t) = \frac{p(x_1,\dots,x_t,z_t)}{p(x_1,\dots,x_t)} $$

  • Given observations $x_{1:t}$ so far, infer $z_t$.

Solved by the forward algorithm.

How do we infer values of hidden variables?

  • One of the most challenging part of HMMs is to try to "predict" what are the values of the hidden variables $z_t$, having observed all the $x_1, \ldots, x_T$.
  • Computing $p(z_t \mid \X)$ is known on smoothing. More on this soon.
  • But it turns out that this probability can be computed from two other quantities:
    • $p(x_1,\dots,x_t,z_t)$, which we are going to label $\alpha_t(z_t)$
    • $p(x_{t+1},\dots,x_{T} | z_t)$, which we are going to label $\beta_t(z_t)$

Problem: Derive the Forward Algorithm

The forward algorithm computes $\alpha_t(z_t) \equiv p(x_1,\dots,x_t,z_t)$.

The challenge is to frame this in terms of $\alpha_{t-1}(z_{t-1})$. We'll get you started by marginalizing over "one step back". You need to fill in the rest!

$$ \begin{align} \alpha_t(z_t) &= \sum_{z_{t-1}} p(x_1, \dots, x_t, z_{t-1}, z_t) \\ \dots \\ &= B_{z_t,x_t} \sum_{z_{t-1}} \alpha_{t-1}(z_{t-1}) A_{z_{t-1}, z_t} \end{align} $$

Hints:

  1. You are trying to pull out $\alpha_{t-1}(z_{t-1}) = p(x_1, \dots, x_{k-1}, z_{t-1})$ Can you factor out $p(x_k)$ and $p(z_k)$ using bayes theorem ($P(A,B) = P(A|B)P(B)$)?
  2. Once you do, conditional independence (look for d-separation!) should help simplify

Problem: Derive the Backward Algorithm

The backward algorithm computes $\beta_t(z_t) \equiv p(x_{t+1},\dots,x_{T} | z_t)$

$$ \begin{align} \beta(z_t) &= \sum_{z_{t+1}} p(x_{t+1},\dots,x_{T},z_{t+1} | z_t) \\ \dots \\ &= \sum_{z_{t+1}} A_{z_t, z_{t+1}} B_{z_{t+1}, x_{t+1}} \beta_{t+1}(z_{t+1}) \end{align} $$

Similar to deriving the forward algorithm, we've gotten you started by marginalizing over "one step forward". Use applications of bayes rule $P(A,B) = P(A|B)P(B)$ and simplifications from conditional independence to get the rest of the way there.

Code example: part of speech tagger

Now that we're comfortable with the theory behind the forward and backward algorithm, let's set up a real example and implement both procedures.

In this example, we observe a sequence of words backed by a latent part of speech variable.

$X$: discrete distribution over bag of words

$Z$: discrete distribution over parts of speech

$A$: the probability of a part of speech given a previous part of speech, e.g, what do we expect to see after a noun?

$B$: the distribution of words given a particular part of speech, e.g, what words are we likely to see if we know it is a verb?

$x_{i}s$ a sequence of observed words (a sentence). Note: in for both variables we have a special "end" outcome that signals the end of a sentence. This makes sense as a part of speech tagger would like to have a sense of sentence boundaries.


In [1]:
import numpy as np

parts_of_speech = DETERMINER, NOUN, VERB, END = 0, 1, 2, 3
words = THE, DOG, WALKED, IN, PARK, END = 0, 1, 2, 3, 4, 5

# transition probabilities
A = np.array([
        # D     N   V   E
        [0.1, 0.8, 0.1, 0.0],  # D: determiner most likely to go to noun
        [0.1, 0.1, 0.6, 0.2],  # N: noun most likely to go to verb
        [0.4, 0.3, 0.2, 0.1],  # V 
        [0.0, 0.0, 0.0, 1.0]]) # E: end always goes to end

# distribution of parts of speech for the first word of a sentence
pi = np.array([0.4, 0.3, 0.3, 0.0])

# emission probabilities
B = np.array([
        # the   dog  walked  in   park  end
        [ 0.8,  0.1,  0.0,  1.0 , 0.0,  0.0], # D
        [ 0.1,  0.8,  0.0,  0.0,  0.1,  0.0], # N
        [ 0.1,  0.1,  1.0,  0.0,  0.9,  0.0], # V
        [ 0. ,  0.0,  0.0,  0.0,  0.0,  1.0]  # E
    ])

Problem: Implement the Forward Algorithm

Now it's time to put it all together. We create a table to hold the results and build them up from the front to back.

Along with the results, we return the marginal probability that can be compared with the backward algorithm's below.


In [2]:
import numpy as np

np.set_printoptions(suppress=True)

def forward(params, observations):
    pi, A, B = params
    N = len(observations)
    S = pi.shape[0]
    
    alpha = np.zeros((N, S))
    
    # base case
    alpha[0, :] = pi * B[:,observations[0]]
    
    # recursive case - YOUR CODE GOES HERE
    
    return (alpha, np.sum(alpha[N-1,:]))

forward((pi, A, B), [THE, DOG, WALKED, IN, THE, PARK, END])


Out[2]:
(array([[ 0.32,  0.03,  0.03,  0.  ],
        [ 0.  ,  0.  ,  0.  ,  0.  ],
        [ 0.  ,  0.  ,  0.  ,  0.  ],
        [ 0.  ,  0.  ,  0.  ,  0.  ],
        [ 0.  ,  0.  ,  0.  ,  0.  ],
        [ 0.  ,  0.  ,  0.  ,  0.  ],
        [ 0.  ,  0.  ,  0.  ,  0.  ]]), 0.0)

Problem: Implement the Backward Algorithm

If you implemented both correctly, the second return value from each method should match.


In [3]:
def backward(params, observations):
    pi, A, B = params
    N = len(observations)
    S = pi.shape[0]
    
    beta = np.zeros((N, S))
    
    # base case
    beta[N-1, :] = 1
    
    # recursive case -- YOUR CODE GOES HERE!    
    
    return (beta, np.sum(pi * B[:, observations[0]] * beta[0,:]))

backward((pi, A, B), [THE, DOG, WALKED, IN, THE, PARK, END])


Out[3]:
(array([[ 0.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  0.],
        [ 1.,  1.,  1.,  1.]]), 0.0)