Let's start by deriving the forward and backward algorithms from lecture. It's really important you understand how the recursion comes into play. We covered it in lecture, please try not to cheat during this exercise, you won't get anything out of it if you don't try!
Recall that a Hidden Markov Model (HMM) is a particular factorization of a joint distribution representing noisy observations $X_k$ generated from discrete hidden Markov chain $Z_k$. $$ P(\vec{X}, \vec{Z}) = P(Z_1) P(X_1 \mid Z_1) \prod_{k=2}^T P(Z_k \mid Z_{k-1}) P(X_k \mid Z_k) $$
and as a bayesian network:
For a Hidden Markov Model with $N$ hidden states and $M$ observed states, there are three row-stochastic parameters $\theta=(A,B,\pi)$,
The forward algorithm computes $\alpha_t(z_t) \equiv p(x_1,\dots,x_t,z_t)$.
The challenge is to frame this in terms of $\alpha_{t-1}(z_{t-1})$. We'll get you started by marginalizing over "one step back". You need to fill in the rest!
$$ \begin{align} \alpha_t(z_t) &= \sum_{z_{t-1}} p(x_1, \dots, x_t, z_{t-1}, z_t) \\ \dots \\ &= B_{z_t,x_t} \sum_{z_{t-1}} \alpha_{t-1}(z_{t-1}) A_{z_{t-1}, z_t} \end{align} $$Hints:
The backward algorithm computes $\beta_t(z_t) \equiv p(x_{t+1},\dots,x_{T} | z_t)$
$$ \begin{align} \beta(z_t) &= \sum_{z_{t+1}} p(x_{t+1},\dots,x_{T},z_{t+1} | z_t) \\ \dots \\ &= \sum_{z_{t+1}} A_{z_t, z_{t+1}} B_{z_{t+1}, x_{t+1}} \beta_{t+1}(z_{t+1}) \end{align} $$Similar to deriving the forward algorithm, we've gotten you started by marginalizing over "one step forward". Use applications of bayes rule $P(A,B) = P(A|B)P(B)$ and simplifications from conditional independence to get the rest of the way there.
$X$: discrete distribution over bag of words
$Z$: discrete distribution over parts of speech
$A$: the probability of a part of speech given a previous part of speech, e.g, what do we expect to see after a noun?
$B$: the distribution of words given a particular part of speech, e.g, what words are we likely to see if we know it is a verb?
$x_{i}s$ a sequence of observed words (a sentence). Note: in for both variables we have a special "end" outcome that signals the end of a sentence. This makes sense as a part of speech tagger would like to have a sense of sentence boundaries.
In [1]:
import numpy as np
parts_of_speech = DETERMINER, NOUN, VERB, END = 0, 1, 2, 3
words = THE, DOG, WALKED, IN, PARK, END = 0, 1, 2, 3, 4, 5
# transition probabilities
A = np.array([
# D N V E
[0.1, 0.8, 0.1, 0.0], # D: determiner most likely to go to noun
[0.1, 0.1, 0.6, 0.2], # N: noun most likely to go to verb
[0.4, 0.3, 0.2, 0.1], # V
[0.0, 0.0, 0.0, 1.0]]) # E: end always goes to end
# distribution of parts of speech for the first word of a sentence
pi = np.array([0.4, 0.3, 0.3, 0.0])
# emission probabilities
B = np.array([
# the dog walked in park end
[ 0.8, 0.1, 0.0, 1.0 , 0.0, 0.0], # D
[ 0.1, 0.8, 0.0, 0.0, 0.1, 0.0], # N
[ 0.1, 0.1, 1.0, 0.0, 0.9, 0.0], # V
[ 0. , 0.0, 0.0, 0.0, 0.0, 1.0] # E
])
In [2]:
import numpy as np
np.set_printoptions(suppress=True)
def forward(params, observations):
pi, A, B = params
N = len(observations)
S = pi.shape[0]
alpha = np.zeros((N, S))
# base case
alpha[0, :] = pi * B[:,observations[0]]
# recursive case - YOUR CODE GOES HERE
return (alpha, np.sum(alpha[N-1,:]))
forward((pi, A, B), [THE, DOG, WALKED, IN, THE, PARK, END])
Out[2]:
In [3]:
def backward(params, observations):
pi, A, B = params
N = len(observations)
S = pi.shape[0]
beta = np.zeros((N, S))
# base case
beta[N-1, :] = 1
# recursive case -- YOUR CODE GOES HERE!
return (beta, np.sum(pi * B[:, observations[0]] * beta[0,:]))
backward((pi, A, B), [THE, DOG, WALKED, IN, THE, PARK, END])
Out[3]: