In this class, we will continue to focus on sequence classification, but instead of following a generative approach (like in the previous chapter) we move towards discriminative approaches. Recall that the difference between these approaches is that generative approaches attempt to model the probability distribution of the data, P(X,Y), whereas discriminative ones only model the conditional probability of the sequence, given the observed data, P(Y|X).
Today’s Assignment The assignment of today’s class is to implement the structured version of the perceptron algorithm.
3.1 Classification vs Sequential Classification Table 3.1 shows how the models for classification have counterparts in sequential classification. In fact, in the last chapter we discussed the Hidden Markov model, which can be seen as a generalization of the Na¨ıve Bayes model for sequences. In this chapter, we will see a generalization of the Perceptron algorithm for sequence problems (yielding the Structured Perceptron, Collins 2002) and a generalization of Maximum Entropy model for sequences (yielding Conditional Random Fields, Lafferty et al. 2001). Note that both these generalizations are not specific for sequences and can be applied to a wide range of models (we will see in tomorrow’s lecture how these methods can be applied to parsing). Although we will not cover all the methods described in Chapter 1, bear in mind that all of those have a structured counterpart. It should be intuitive after this lecture how those methods could be extended to structured problems, given the perceptron example.
See Pag 66
Exercise 3.1 In this exercise you will train a CRF using different feature sets for Part-of-Speech Tagging. Start with the code below, which uses the ID feature set from table 3.2.
In [2]:
import sys
sys.path.append('../../../')
import lxmls.sequences.crf_online as crfo
import lxmls.sequences.structured_perceptron as spc
import lxmls.readers.pos_corpus as pcc
import lxmls.sequences.id_feature as idfc
import lxmls.sequences.extended_feature as exfc
print "CRF Exercise"
corpus = pcc.PostagCorpus( )
train_seq = corpus.read_sequence_list_conll("../../../data/train-02-21.conll", max_sent_len=10, max_nr_sent=1000)
test_seq = corpus.read_sequence_list_conll("../../../data/test-23.conll", max_sent_len=10, max_nr_sent=1000)
dev_seq = corpus.read_sequence_list_conll("../../../data/dev-22.conll", max_sent_len=10, max_nr_sent=1000)
feature_mapper = idfc.IDFeatures(train_seq)
feature_mapper.build_features()
In [4]:
crf_online = crfo.CRFOnline(corpus.word_dict, corpus.tag_dict, feature_mapper)
crf_online.num_epochs = 20
crf_online.train_supervised(train_seq)
pred_train = crf_online.viterbi_decode_corpus(train_seq)
pred_dev = crf_online.viterbi_decode_corpus(dev_seq)
pred_test = crf_online.viterbi_decode_corpus(test_seq)
eval_train = crf_online.evaluate_corpus(train_seq, pred_train)
eval_dev = crf_online.evaluate_corpus(dev_seq, pred_dev)
eval_test = crf_online.evaluate_corpus(test_seq, pred_test)
print "CRF - ID Features Accuracy Train: %.3f Dev: %.3f Test: %.3f" % (eval_train, eval_dev, eval_test)
Exercise 3.2 Repeat the previous exercise using the extended feature set. Compare the results
In [7]:
feature_mapper = exfc.ExtendedFeatures(train_seq)
feature_mapper.build_features()
crf_online = crfo.CRFOnline(corpus.word_dict, corpus.tag_dict, feature_mapper)
crf_online.num_epochs = 20
crf_online.train_supervised(train_seq)
pred_train = crf_online.viterbi_decode_corpus(train_seq)
pred_dev = crf_online.viterbi_decode_corpus(dev_seq)
pred_test = crf_online.viterbi_decode_corpus(test_seq)
eval_train = crf_online.evaluate_corpus(train_seq, pred_train)
eval_dev = crf_online.evaluate_corpus(dev_seq, pred_dev)
eval_test = crf_online.evaluate_corpus(test_seq, pred_test)
print "CRF - Extended Features Accuracy Train: %.3f Dev: %.3f Test: %.3f" % (eval_train, eval_dev, eval_test)
3.5 Structured Perceptron
The structured perceptron (Collins, 2002), namely its averaged version, is a very simple algorithm that relies on Viterbi decoding and very simple additive updates. In practice this algorithm is very easy to implement and behaves remarkably well in a variety of problems. These two characteristics make the structured perceptron algorithm a natural first choice to try and test a new problem or a new feature set. Recall what you learned about the perceptron algorithm (Section 1.4.3) and compare it against the structured perceptron (Algorithm 11).
Exercise 3.3 Implement the structured perceptron algorithm.
In [5]:
def perceptron_update(self, sequence):
##########################
# Solution to Exercise 3.3
##########################
num_labels = 0
num_mistakes = 0
predicted_sequence, _ = self.viterbi_decode(sequence)
y_hat = predicted_sequence.y
# Update initial features.
y_t_true = sequence.y[0]
y_t_hat = y_hat[0]
if y_t_true != y_t_hat:
true_initial_features = self.feature_mapper.get_initial_features(sequence, y_t_true)
self.parameters[true_initial_features] += self.learning_rate
hat_initial_features = self.feature_mapper.get_initial_features(sequence, y_t_hat)
self.parameters[hat_initial_features] -= self.learning_rate
for pos in xrange(len(sequence.x)):
y_t_true = sequence.y[pos]
y_t_hat = y_hat[pos]
# Update emission features.
num_labels += 1
if y_t_true != y_t_hat:
num_mistakes += 1
true_emission_features = self.feature_mapper.get_emission_features(sequence, pos, y_t_true)
self.parameters[true_emission_features] += self.learning_rate
hat_emission_features = self.feature_mapper.get_emission_features(sequence, pos, y_t_hat)
self.parameters[hat_emission_features] -= self.learning_rate
if pos > 0:
# update bigram features
# If true bigram != predicted bigram update bigram features
prev_y_t_true = sequence.y[pos-1]
prev_y_t_hat = y_hat[pos-1]
if y_t_true != y_t_hat or prev_y_t_true != prev_y_t_hat:
true_transition_features = self.feature_mapper.get_transition_features(
sequence, pos-1, y_t_true, prev_y_t_true)
self.parameters[true_transition_features] += self.learning_rate
hat_transition_features = self.feature_mapper.get_transition_features(
sequence, pos-1, y_t_hat, prev_y_t_hat)
self.parameters[hat_transition_features] -= self.learning_rate
pos = len(sequence.x)
y_t_true = sequence.y[pos-1]
y_t_hat = y_hat[pos-1]
if y_t_true != y_t_hat:
true_final_features = self.feature_mapper.get_final_features(sequence, y_t_true)
self.parameters[true_final_features] += self.learning_rate
hat_final_features = self.feature_mapper.get_final_features(sequence, y_t_hat)
self.parameters[hat_final_features] -= self.learning_rate
return num_labels, num_mistakes
#################################
# End of solution to Exercise 3.3
#################################
Exercise 3.4 Repeat Exercises 3.1–3.2 using the structured perceptron algorithm instead of a CRF. Report the results. Here is the code for the simple feature set:
In [6]:
print "Perceptron Exercise"
feature_mapper = idfc.IDFeatures(train_seq)
feature_mapper.build_features()
sp = spc.StructuredPerceptron(corpus.word_dict, corpus.tag_dict, feature_mapper)
sp.num_epochs = 20
sp.train_supervised(train_seq)
pred_train = sp.viterbi_decode_corpus(train_seq)
pred_dev = sp.viterbi_decode_corpus(dev_seq)
pred_test = sp.viterbi_decode_corpus(test_seq)
eval_train = sp.evaluate_corpus(train_seq, pred_train)
eval_dev = sp.evaluate_corpus(dev_seq, pred_dev)
eval_test = sp.evaluate_corpus(test_seq, pred_test)
print "Structured Perceptron - ID Features Accuracy Train: %.3f Dev: %.3f Test: %.3f" % (eval_train, eval_dev, eval_test)
feature_mapper = exfc.ExtendedFeatures(train_seq)
feature_mapper.build_features()
sp = spc.StructuredPerceptron(corpus.word_dict, corpus.tag_dict, feature_mapper)
sp.num_epochs = 20
sp.train_supervised(train_seq)
pred_train = sp.viterbi_decode_corpus(train_seq)
pred_dev = sp.viterbi_decode_corpus(dev_seq)
pred_test = sp.viterbi_decode_corpus(test_seq)
eval_train = sp.evaluate_corpus(train_seq, pred_train)
eval_dev = sp.evaluate_corpus(dev_seq, pred_dev)
eval_test = sp.evaluate_corpus(test_seq, pred_test)
print "Structured Perceptron - Extended Features Accuracy Train: %.3f Dev: %.3f Test: %.3f" % (eval_train, eval_dev, eval_test)