LXMLS 2017 - Day 2

Sequence Models

In this class, we relax the assumption that the data points are independently and identically distributed (i.i.d.) by moving to a scenario of structured prediction, where the inputs are assumed to have temporal or spacial dependencies. We start by considering sequential models, which correspond to a chain structure: for instance, the words in a sentence. In this lecture, we will use part-of-speech tagging as our example task. We start by defining the notation for this lecture in Section 2.1. Afterwards, in section 2.2, we focus on the well known Hidden Markov Models and in Section 2.3 we describe how to estimate its parameters from labeled data. In Section 2.4 we explain the inference algorithms (Viterbi and Forward-Backward) for sequence models. These inference algorithms will be fundamental for the rest of this lecture, as well as for the next lecture on discriminative training of sequence models. In Section 2.6 we describe the task of Part-of-Speech tagging, and how the Hidden Markov Models are suitable for this task. Finally, in Section 2.7 we address unsupervised learning of Hidden Markov Models through the Expectation Maximization algorithm

2.2 Hidden Markov Models

Hidden Markov Models (HMMs) are one of the most common sequence probabilistic models, and have been applied to a wide variety of tasks. HMMs are particular instances of directed probabilistic graphical models (or Bayesian networks) which have a chain topology. In a Bayesian network, every random variable is represented as a node in a graph, and the edges in the graph are directed and represent probabilistic dependencies between the random variables. For an HMM, the random variables are divided into two sets, the observed variables, X = X1 . . . XN, and the hidden variables Y = Y1 . . . YN. In the HMM terminology, the observed variables are called observations, and the hidden variables are called states. The states are generated according to a first order Markov process, in which the ith state Yi depends only of the previous state Yi−1. Two special states are the start symbol, which starts the sequence, and the stop symbol, which ends the sequence. In addition, states emit observation symbols. In an HMM, it is assumed that all observations are independent given the states that generated them. As you may find out with today’s assignment, implementing the inference routines of the HMM can be challenging. We start with a small and very simple (also very unrealistic!) example. The idea is that you may compute the desired quantities by hand and check if your implementation yields the correct result.

Exercise 2.1 Load the simple sequence dataset. From the ipython command line create a simple sequence object and look at the training and test set.



In [9]:

    
import sys
sys.path.append('../../../')
import lxmls.readers.simple_sequence as ssr
simple = ssr.SimpleSequence()
print simple.train









    



[walk/rainy walk/sunny shop/sunny clean/sunny , walk/rainy walk/rainy shop/rainy clean/sunny , walk/sunny shop/sunny shop/sunny clean/sunny ]



In [2]:

    
print simple.test









    



[walk/rainy walk/sunny shop/sunny clean/sunny , clean/sunny walk/sunny tennis/sunny walk/sunny ]



In [3]:

    
for sequence in simple.train.seq_list:
    print sequence









    



walk/rainy walk/sunny shop/sunny clean/sunny 
walk/rainy walk/rainy shop/rainy clean/sunny 
walk/sunny shop/sunny shop/sunny clean/sunny



In [4]:

    
for sequence in simple.train.seq_list:
    print sequence.x, sequence.y









    



[0, 0, 1, 2] [0, 1, 1, 1]
[0, 0, 1, 2] [0, 0, 0, 1]
[0, 1, 1, 2] [1, 1, 1, 1]

Exercise 2.2 The provided function train supervised from the hmm.py file implements the above parameter estimates. Run this function given the simple dataset above and look at the estimated probabilities. Are they correct? You can also check the variables ending in counts instead of probs to see the raw counts (for example, typing hmm.initial counts will show you the raw counts of initial states). How are the counts related to the probabilities?



In [10]:

    
def update_counts(self, sequence, state_posteriors, transition_posteriors):
    """ Used in the E-step in EM."""

    ##########################
    # Solution to Exercise 2.2
    ##########################

    num_states = self.get_num_states()  # Number of states.
    length = len(sequence.x)  # Length of the sequence.

    # Take care of initial probs
    for y in xrange(num_states):
        self.initial_counts[y] += state_posteriors[0, y]
    for pos in xrange(length):
        x = sequence.x[pos]
        for y in xrange(num_states):
            self.emission_counts[x, y] += state_posteriors[pos, y]
            if pos > 0:
                for y_prev in xrange(num_states):
                    self.transition_counts[y, y_prev] += transition_posteriors[pos-1, y, y_prev]

    # Final position
    for y in xrange(num_states):
        self.final_counts[y] += state_posteriors[length-1, y]

    #################################
    # End of solution to Exercise 2.2
    #################################



In [11]:

    
import lxmls.sequences.hmm as hmmc
hmm = hmmc.HMM(simple.x_dict, simple.y_dict)
hmm.train_supervised(simple.train)
print "Initial Probabilities:", hmm.initial_probs









    



Initial Probabilities: [ 0.66666667  0.33333333]



In [12]:

    
print "Transition Probabilities:", hmm.transition_probs









    



Transition Probabilities: [[ 0.5    0.   ]
 [ 0.5    0.625]]



In [13]:

    
print "Final Probabilities:", hmm.final_probs









    



Final Probabilities: [ 0.     0.375]



In [14]:

    
print "Emission Probabilities", hmm.emission_probs









    



Emission Probabilities [[ 0.75   0.25 ]
 [ 0.25   0.375]
 [ 0.     0.375]
 [ 0.     0.   ]]

Exercise 2.3 Convince yourself that the score of a path in the trellis (summing over the scores above) is equivalent to the log-probability log P(X = x,Y = y), as defined in Eq. 2.2. Use the given function compute scores on the first training sequence and confirm that the values are correct. You should get the same values as presented below

2.4.3 Viterbi Decoding

Exercise 2.8 Implement a method for performing Viterbi decoding in file sequence classification decoder.py.



In [16]:

    
def run_viterbi(self, initial_scores, transition_scores, final_scores, emission_scores):

    ##########################
    # Solution to Exercise 2.8
    ##########################
    
    length = np.size(emission_scores, 0)  # Length of the sequence.
    num_states = np.size(initial_scores)  # Number of states.

    # Variables storing the Viterbi scores.
    viterbi_scores = np.zeros([length, num_states]) + logzero()

    # Variables storing the paths to backtrack.
    viterbi_paths = -np.ones([length, num_states], dtype=int)

    # Most likely sequence.
    best_path = -np.ones(length, dtype=int)

    # Initialization.
    viterbi_scores[0, :] = emission_scores[0, :] + initial_scores

    # Viterbi loop.
    for pos in xrange(1, length):
        for current_state in xrange(num_states):
            viterbi_scores[pos, current_state] = \
                np.max(viterbi_scores[pos-1, :] + transition_scores[pos-1, current_state, :])
            viterbi_scores[pos, current_state] += emission_scores[pos, current_state]
            viterbi_paths[pos, current_state] = \
                np.argmax(viterbi_scores[pos-1, :] + transition_scores[pos-1, current_state, :])
    # Termination.
    best_score = np.max(viterbi_scores[length-1, :] + final_scores)
    best_path[length-1] = np.argmax(viterbi_scores[length-1, :] + final_scores)

    # Backtrack.
    for pos in xrange(length-2, -1, -1):
        best_path[pos] = viterbi_paths[pos+1, best_path[pos+1]]

    return best_path, best_score
    
    #################################
    # End of solution to Exercise 2.8
    #################################



In [17]:

    
y_pred, score = hmm.viterbi_decode(simple.test.seq_list[0])
print "Viterbi decoding Prediction test 0 with smoothing"
print y_pred, score
print "Truth test 0"
print simple.test.seq_list[0]

y_pred, score = hmm.viterbi_decode(simple.test.seq_list[1])
print "Viterbi decoding Prediction test 1 with smoothing"
print y_pred, score
print "Truth test 1"
print simple.test.seq_list[1]









    



Viterbi decoding Prediction test 0 with smoothing
walk/rainy walk/rainy shop/sunny clean/sunny  -5.77961500241
Truth test 0
walk/rainy walk/sunny shop/sunny clean/sunny 
Viterbi decoding Prediction test 1 with smoothing
clean/rainy walk/rainy tennis/rainy walk/rainy  -inf
Truth test 1
clean/sunny walk/sunny tennis/sunny walk/sunny 






    



../../../lxmls/sequences/hmm.py:194: RuntimeWarning: divide by zero encountered in log
  emission_scores[pos, :] = np.log(self.emission_probs[sequence.x[pos], :])
../../../lxmls/sequences/hmm.py:192: RuntimeWarning: divide by zero encountered in log
  transition_scores = np.zeros([length-1, num_states, num_states]) + logzero()
../../../lxmls/sequences/hmm.py:197: RuntimeWarning: divide by zero encountered in log

Exercise 2.9 Test the model using both posterior decoding and Viterbi decoding on both the train and test set, using the methods in class HMM:



In [21]:

    
import lxmls.readers.pos_corpus as pcc
import lxmls.sequences.confusion_matrix as cm

corpus = pcc.PostagCorpus()
train_seq = corpus.read_sequence_list_conll("../../../data/train-02-21.conll", max_sent_len=15, max_nr_sent=1000)
test_seq = corpus.read_sequence_list_conll("../../../data/test-23.conll", max_sent_len=15, max_nr_sent=1000)
dev_seq = corpus.read_sequence_list_conll("../../../data/dev-22.conll", max_sent_len=15, max_nr_sent=1000)
hmm = hmmc.HMM(corpus.word_dict, corpus.tag_dict)
hmm.train_supervised(train_seq)
hmm.print_transition_matrix()

viterbi_pred_train = hmm.viterbi_decode_corpus(train_seq)
posterior_pred_train = hmm.posterior_decode_corpus(train_seq)
eval_viterbi_train = hmm.evaluate_corpus(train_seq, viterbi_pred_train)
eval_posterior_train = hmm.evaluate_corpus(train_seq, posterior_pred_train)
print "Train Set Accuracy: Posterior Decode %.3f, Viterbi Decode: %.3f" % (eval_posterior_train, eval_viterbi_train)

viterbi_pred_test = hmm.viterbi_decode_corpus(test_seq)
posterior_pred_test = hmm.posterior_decode_corpus(test_seq)
eval_viterbi_test = hmm.evaluate_corpus(test_seq, viterbi_pred_test)
eval_posterior_test = hmm.evaluate_corpus(test_seq, posterior_pred_test)
print "Test Set Accuracy: Posterior Decode %.3f, Viterbi Decode: %.3f" % (eval_posterior_test, eval_viterbi_test)

best_smothing = hmm.pick_best_smoothing(train_seq, dev_seq, [10, 1, 0.1, 0])

hmm.train_supervised(train_seq, smoothing=best_smothing)
viterbi_pred_test = hmm.viterbi_decode_corpus(test_seq)
posterior_pred_test = hmm.posterior_decode_corpus(test_seq)
eval_viterbi_test = hmm.evaluate_corpus(test_seq, viterbi_pred_test)
eval_posterior_test = hmm.evaluate_corpus(test_seq, posterior_pred_test)
print "Best Smoothing %f --  Test Set Accuracy: Posterior Decode %.3f, Viterbi Decode: %.3f" % (best_smothing, eval_posterior_test, eval_viterbi_test)

confusion_matrix = cm.build_confusion_matrix(test_seq.seq_list, viterbi_pred_test,
                                             len(corpus.tag_dict), hmm.get_num_states())

cm.plot_confusion_bar_graph(confusion_matrix, corpus.tag_dict,
                            range(hmm.get_num_states()), 'Confusion matrix')









    












    



Train Set Accuracy: Posterior Decode 0.985, Viterbi Decode: 0.985
Test Set Accuracy: Posterior Decode 0.350, Viterbi Decode: 0.509
Smoothing 10.000000 --  Train Set Accuracy: Posterior Decode 0.731, Viterbi Decode: 0.691
Smoothing 10.000000 -- Test Set Accuracy: Posterior Decode 0.712, Viterbi Decode: 0.675
Smoothing 1.000000 --  Train Set Accuracy: Posterior Decode 0.887, Viterbi Decode: 0.865
Smoothing 1.000000 -- Test Set Accuracy: Posterior Decode 0.818, Viterbi Decode: 0.792
Smoothing 0.100000 --  Train Set Accuracy: Posterior Decode 0.968, Viterbi Decode: 0.965
Smoothing 0.100000 -- Test Set Accuracy: Posterior Decode 0.851, Viterbi Decode: 0.842
Smoothing 0.000000 --  Train Set Accuracy: Posterior Decode 0.985, Viterbi Decode: 0.985
Smoothing 0.000000 -- Test Set Accuracy: Posterior Decode 0.370, Viterbi Decode: 0.526
Best Smoothing 0.100000 --  Test Set Accuracy: Posterior Decode 0.837, Viterbi Decode: 0.827

Exercise 2.10 Implement the method to update the counts given the state and transition posteriors



In [22]:

    
# Train with EM.
hmm.train_EM(train_seq, 0.1, 20, evaluate=True)
viterbi_pred_test = hmm.viterbi_decode_corpus(test_seq)
posterior_pred_test = hmm.posterior_decode_corpus(test_seq)
eval_viterbi_test = hmm.evaluate_corpus(test_seq, viterbi_pred_test)
eval_posterior_test = hmm.evaluate_corpus(test_seq, posterior_pred_test)

confusion_matrix = cm.build_confusion_matrix(test_seq.seq_list, viterbi_pred_test,
                                             len(corpus.tag_dict), hmm.get_num_states())
cm.plot_confusion_bar_graph(confusion_matrix, corpus.tag_dict,
                            xrange(hmm.get_num_states()), 'Confusion matrix')









    



Initial accuracy: 0.308047
Iter: 1 Log Likelihood: -101711.222922
Iter: 1 Accuracy: 0.305041
Iter: 2 Log Likelihood: -78067.928544
Iter: 2 Accuracy: 0.288907
Iter: 3 Log Likelihood: -77863.834516
Iter: 3 Accuracy: 0.306844
Iter: 4 Log Likelihood: -77358.612450
Iter: 4 Accuracy: 0.285500
Iter: 5 Log Likelihood: -76521.324902
Iter: 5 Accuracy: 0.285500
Iter: 6 Log Likelihood: -75623.481301
Iter: 6 Accuracy: 0.297124
Iter: 7 Log Likelihood: -74682.105414
Iter: 7 Accuracy: 0.285500
Iter: 8 Log Likelihood: -73654.899477
Iter: 8 Accuracy: 0.290009
Iter: 9 Log Likelihood: -72725.996789
Iter: 9 Accuracy: 0.294819
Iter: 10 Log Likelihood: -71810.800760
Iter: 10 Accuracy: 0.301533
Iter: 11 Log Likelihood: -70854.130354
Iter: 11 Accuracy: 0.304640
Iter: 12 Log Likelihood: -69964.634649
Iter: 12 Accuracy: 0.304038
Iter: 13 Log Likelihood: -69244.099731
Iter: 13 Accuracy: 0.309149
Iter: 14 Log Likelihood: -68640.504492
Iter: 14 Accuracy: 0.311354
Iter: 15 Log Likelihood: -68130.478522
Iter: 15 Accuracy: 0.310352
Iter: 16 Log Likelihood: -67710.110731
Iter: 16 Accuracy: 0.312556
Iter: 17 Log Likelihood: -67375.515960
Iter: 17 Accuracy: 0.312657
Iter: 18 Log Likelihood: -67146.894045
Iter: 18 Accuracy: 0.311855
Iter: 19 Log Likelihood: -67008.535151
Iter: 19 Accuracy: 0.313458