In this class, we relax the assumption that the data points are independently and identically distributed (i.i.d.) by moving to a scenario of structured prediction, where the inputs are assumed to have temporal or spacial dependencies. We start by considering sequential models, which correspond to a chain structure: for instance, the words in a sentence. In this lecture, we will use part-of-speech tagging as our example task. We start by defining the notation for this lecture in Section 2.1. Afterwards, in section 2.2, we focus on the well known Hidden Markov Models and in Section 2.3 we describe how to estimate its parameters from labeled data. In Section 2.4 we explain the inference algorithms (Viterbi and Forward-Backward) for sequence models. These inference algorithms will be fundamental for the rest of this lecture, as well as for the next lecture on discriminative training of sequence models. In Section 2.6 we describe the task of Part-of-Speech tagging, and how the Hidden Markov Models are suitable for this task. Finally, in Section 2.7 we address unsupervised learning of Hidden Markov Models through the Expectation Maximization algorithm
Hidden Markov Models (HMMs) are one of the most common sequence probabilistic models, and have been applied to a wide variety of tasks. HMMs are particular instances of directed probabilistic graphical models (or Bayesian networks) which have a chain topology. In a Bayesian network, every random variable is represented as a node in a graph, and the edges in the graph are directed and represent probabilistic dependencies between the random variables. For an HMM, the random variables are divided into two sets, the observed variables, X = X1 . . . XN, and the hidden variables Y = Y1 . . . YN. In the HMM terminology, the observed variables are called observations, and the hidden variables are called states. The states are generated according to a first order Markov process, in which the ith state Yi depends only of the previous state Yi−1. Two special states are the start symbol, which starts the sequence, and the stop symbol, which ends the sequence. In addition, states emit observation symbols. In an HMM, it is assumed that all observations are independent given the states that generated them. As you may find out with today’s assignment, implementing the inference routines of the HMM can be challenging. We start with a small and very simple (also very unrealistic!) example. The idea is that you may compute the desired quantities by hand and check if your implementation yields the correct result.
Exercise 2.1 Load the simple sequence dataset. From the ipython command line create a simple sequence object and look at the training and test set.
In [9]:
import sys
sys.path.append('../../../')
import lxmls.readers.simple_sequence as ssr
simple = ssr.SimpleSequence()
print simple.train
In [2]:
print simple.test
In [3]:
for sequence in simple.train.seq_list:
print sequence
In [4]:
for sequence in simple.train.seq_list:
print sequence.x, sequence.y
Exercise 2.2 The provided function train supervised from the hmm.py file implements the above parameter estimates. Run this function given the simple dataset above and look at the estimated probabilities. Are they correct? You can also check the variables ending in counts instead of probs to see the raw counts (for example, typing hmm.initial counts will show you the raw counts of initial states). How are the counts related to the probabilities?
In [10]:
def update_counts(self, sequence, state_posteriors, transition_posteriors):
""" Used in the E-step in EM."""
##########################
# Solution to Exercise 2.2
##########################
num_states = self.get_num_states() # Number of states.
length = len(sequence.x) # Length of the sequence.
# Take care of initial probs
for y in xrange(num_states):
self.initial_counts[y] += state_posteriors[0, y]
for pos in xrange(length):
x = sequence.x[pos]
for y in xrange(num_states):
self.emission_counts[x, y] += state_posteriors[pos, y]
if pos > 0:
for y_prev in xrange(num_states):
self.transition_counts[y, y_prev] += transition_posteriors[pos-1, y, y_prev]
# Final position
for y in xrange(num_states):
self.final_counts[y] += state_posteriors[length-1, y]
#################################
# End of solution to Exercise 2.2
#################################
In [11]:
import lxmls.sequences.hmm as hmmc
hmm = hmmc.HMM(simple.x_dict, simple.y_dict)
hmm.train_supervised(simple.train)
print "Initial Probabilities:", hmm.initial_probs
In [12]:
print "Transition Probabilities:", hmm.transition_probs
In [13]:
print "Final Probabilities:", hmm.final_probs
In [14]:
print "Emission Probabilities", hmm.emission_probs
Exercise 2.3 Convince yourself that the score of a path in the trellis (summing over the scores above) is equivalent to the log-probability log P(X = x,Y = y), as defined in Eq. 2.2. Use the given function compute scores on the first training sequence and confirm that the values are correct. You should get the same values as presented below
Exercise 2.8 Implement a method for performing Viterbi decoding in file sequence classification decoder.py.
In [16]:
def run_viterbi(self, initial_scores, transition_scores, final_scores, emission_scores):
##########################
# Solution to Exercise 2.8
##########################
length = np.size(emission_scores, 0) # Length of the sequence.
num_states = np.size(initial_scores) # Number of states.
# Variables storing the Viterbi scores.
viterbi_scores = np.zeros([length, num_states]) + logzero()
# Variables storing the paths to backtrack.
viterbi_paths = -np.ones([length, num_states], dtype=int)
# Most likely sequence.
best_path = -np.ones(length, dtype=int)
# Initialization.
viterbi_scores[0, :] = emission_scores[0, :] + initial_scores
# Viterbi loop.
for pos in xrange(1, length):
for current_state in xrange(num_states):
viterbi_scores[pos, current_state] = \
np.max(viterbi_scores[pos-1, :] + transition_scores[pos-1, current_state, :])
viterbi_scores[pos, current_state] += emission_scores[pos, current_state]
viterbi_paths[pos, current_state] = \
np.argmax(viterbi_scores[pos-1, :] + transition_scores[pos-1, current_state, :])
# Termination.
best_score = np.max(viterbi_scores[length-1, :] + final_scores)
best_path[length-1] = np.argmax(viterbi_scores[length-1, :] + final_scores)
# Backtrack.
for pos in xrange(length-2, -1, -1):
best_path[pos] = viterbi_paths[pos+1, best_path[pos+1]]
return best_path, best_score
#################################
# End of solution to Exercise 2.8
#################################
In [17]:
y_pred, score = hmm.viterbi_decode(simple.test.seq_list[0])
print "Viterbi decoding Prediction test 0 with smoothing"
print y_pred, score
print "Truth test 0"
print simple.test.seq_list[0]
y_pred, score = hmm.viterbi_decode(simple.test.seq_list[1])
print "Viterbi decoding Prediction test 1 with smoothing"
print y_pred, score
print "Truth test 1"
print simple.test.seq_list[1]
Exercise 2.9 Test the model using both posterior decoding and Viterbi decoding on both the train and test set, using the methods in class HMM:
In [21]:
import lxmls.readers.pos_corpus as pcc
import lxmls.sequences.confusion_matrix as cm
corpus = pcc.PostagCorpus()
train_seq = corpus.read_sequence_list_conll("../../../data/train-02-21.conll", max_sent_len=15, max_nr_sent=1000)
test_seq = corpus.read_sequence_list_conll("../../../data/test-23.conll", max_sent_len=15, max_nr_sent=1000)
dev_seq = corpus.read_sequence_list_conll("../../../data/dev-22.conll", max_sent_len=15, max_nr_sent=1000)
hmm = hmmc.HMM(corpus.word_dict, corpus.tag_dict)
hmm.train_supervised(train_seq)
hmm.print_transition_matrix()
viterbi_pred_train = hmm.viterbi_decode_corpus(train_seq)
posterior_pred_train = hmm.posterior_decode_corpus(train_seq)
eval_viterbi_train = hmm.evaluate_corpus(train_seq, viterbi_pred_train)
eval_posterior_train = hmm.evaluate_corpus(train_seq, posterior_pred_train)
print "Train Set Accuracy: Posterior Decode %.3f, Viterbi Decode: %.3f" % (eval_posterior_train, eval_viterbi_train)
viterbi_pred_test = hmm.viterbi_decode_corpus(test_seq)
posterior_pred_test = hmm.posterior_decode_corpus(test_seq)
eval_viterbi_test = hmm.evaluate_corpus(test_seq, viterbi_pred_test)
eval_posterior_test = hmm.evaluate_corpus(test_seq, posterior_pred_test)
print "Test Set Accuracy: Posterior Decode %.3f, Viterbi Decode: %.3f" % (eval_posterior_test, eval_viterbi_test)
best_smothing = hmm.pick_best_smoothing(train_seq, dev_seq, [10, 1, 0.1, 0])
hmm.train_supervised(train_seq, smoothing=best_smothing)
viterbi_pred_test = hmm.viterbi_decode_corpus(test_seq)
posterior_pred_test = hmm.posterior_decode_corpus(test_seq)
eval_viterbi_test = hmm.evaluate_corpus(test_seq, viterbi_pred_test)
eval_posterior_test = hmm.evaluate_corpus(test_seq, posterior_pred_test)
print "Best Smoothing %f -- Test Set Accuracy: Posterior Decode %.3f, Viterbi Decode: %.3f" % (best_smothing, eval_posterior_test, eval_viterbi_test)
confusion_matrix = cm.build_confusion_matrix(test_seq.seq_list, viterbi_pred_test,
len(corpus.tag_dict), hmm.get_num_states())
cm.plot_confusion_bar_graph(confusion_matrix, corpus.tag_dict,
range(hmm.get_num_states()), 'Confusion matrix')
Exercise 2.10 Implement the method to update the counts given the state and transition posteriors
In [22]:
# Train with EM.
hmm.train_EM(train_seq, 0.1, 20, evaluate=True)
viterbi_pred_test = hmm.viterbi_decode_corpus(test_seq)
posterior_pred_test = hmm.posterior_decode_corpus(test_seq)
eval_viterbi_test = hmm.evaluate_corpus(test_seq, viterbi_pred_test)
eval_posterior_test = hmm.evaluate_corpus(test_seq, posterior_pred_test)
confusion_matrix = cm.build_confusion_matrix(test_seq.seq_list, viterbi_pred_test,
len(corpus.tag_dict), hmm.get_num_states())
cm.plot_confusion_bar_graph(confusion_matrix, corpus.tag_dict,
xrange(hmm.get_num_states()), 'Confusion matrix')