Understand the CRF code and the feature_mapper code.

In [ ]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

In [ ]:
from lxmls import DATA_PATH
import lxmls
import lxmls.sequences.crf_online as crfo
import lxmls.readers.pos_corpus as pcc
import lxmls.sequences.id_feature as idfc
import lxmls.sequences.extended_feature as exfc
from lxmls.readers import pos_corpus

Load data from the conll task

In [ ]:
corpus = lxmls.readers.pos_corpus.PostagCorpus()

train_seq = corpus.read_sequence_list_conll(DATA_PATH + "/train-02-21.conll", 
                                            max_sent_len=10, max_nr_sent=1000)

test_seq = corpus.read_sequence_list_conll(DATA_PATH + "/test-23.conll", 
                                           max_sent_len=10, max_nr_sent=1000)

dev_seq = corpus.read_sequence_list_conll(DATA_PATH + "/dev-22.conll", 
                                          max_sent_len=10, max_nr_sent=1000)

In [ ]:
print("There are", len(train_seq), "examples in train_seq")
print("First example:", train_seq[0])

Feature generation

Given a dataset, in order to build the features

  • An instance from lxmls.sequences.id_feature.IDFeatures(train_data) must be instanciated. We will call feature_mapper this instanciated object.
  • Then feature_mapper.build_features() must be executed

In [ ]:
## Building features
feature_mapper = idfc.IDFeatures(train_seq)

About feature_mappers

A feature_mapper will contain the following attributes:

  • the dataset in .dataset
    • if we instantiate the feature mapper with a dataset X then feature_mapper.datasetwill be a copy of X
  • a dictionary of features in .feature_dict
    • this dictionary will default to {}.
    • In order to build the features the feature mapper must call .build_features() function.
  • a list of features in .feature_list
    • this list will default to [].
    • In order to build the list of features the feature mapper must call .build_features() function.

A feature_mapper will contain the method

  • A method to generate features, .build_features
    • this method will create features using the `.dataset.
    • This method will also fill .feature_dict and `.feature_list

In [ ]:

In [ ]:
## Let's see the features for the first training example

In [ ]:
## The previous features can be classified into:

print("\nInitial features:",     feature_mapper.feature_list[0][0])
print("\nTransition features:",  feature_mapper.feature_list[0][1])
print("\nFinal features:",       feature_mapper.feature_list[0][2])
print("\nEmission features:",    feature_mapper.feature_list[0][3])

An observation on the features for a given example

All features for all the training examples in are saved in train_seq will be saved in feature_mapper.feature_list.

  • If feature_mapper.feature_list[m] is our feature vector for training example m... why it's not a vector?

    • Good point! In order to make the algorithm fast, the code is written using dicts, so if we access only a few positions from the dict and compute substractions it will be much faster than computing the substraction of two huge weight vectors. Notice that there are len(feature_mapper.feature_dict) features.

In [ ]:
len(train_seq), len(feature_mapper.feature_list)

Codification of the features

Features are identifyed by init_tag:, prev_tag:, final_prev_tag:, id:

  • init_tag: when they are Initial features
    • Example: init_tag:noun is an initial feature that describes that the first word is a noun
  • prev_tag: when they are transition features
    • Example: prev_tag:noun::noun is an transition feature that describes that the previous word was a noun and the current word is a noun.
    • Example: prev_tag:noun:. is an transition feature that describes that the previous word was a noun and the current word is a . (this is usually foud as the last transition feature since most phrases will end up with a dot)
  • final_prev_tag: when they are final features
    • Example: final_prev_tag:. is a final feature stating that the last "word" in the sentence was a dot.
  • id: when they are emission features
    • Example: id:plays::verb is an emission feature, describing that the current word is plays and the current hidden state is a verb.
    • Example: id:Feb.::noun is an emission feature, describing that the current word is "Feb." and the current hidden state is a noun.

In [ ]:
inv_feature_dict = {word: pos for pos, word in feature_mapper.feature_dict.items()}

In [ ]:

In [ ]:
[inv_feature_dict[x[0]] for x in feature_mapper.feature_list[0][0]]

In [ ]:
[inv_feature_dict[x[0]] for x in feature_mapper.feature_list[0][1]]

In [ ]:
[inv_feature_dict[x[0]] for x in feature_mapper.feature_list[0][2]]

In [ ]:

Train a crf

In [ ]:
## Train crf
crf_online = crfo.CRFOnline(corpus.word_dict, corpus.tag_dict, feature_mapper)
crf_online.num_epochs = 20

In [ ]:
## You will receive feedback when each epoch is finished,
## note that running the 20 epochs might take a while. After training is done,
## evaluate the learned model on the training, development and test sets.

pred_train = crf_online.viterbi_decode_corpus(train_seq)
pred_dev = crf_online.viterbi_decode_corpus(dev_seq)
pred_test = crf_online.viterbi_decode_corpus(test_seq)

eval_train = crf_online.evaluate_corpus(train_seq, pred_train)
eval_dev = crf_online.evaluate_corpus(dev_seq, pred_dev)
eval_test = crf_online.evaluate_corpus(test_seq, pred_test)

In [ ]:
print("CRF - ID Features Accuracy Train: %.3f Dev: %.3f Test: %.3f" \
       %(eval_train,eval_dev, eval_test))

In [ ]: