Understand the CRF code and the feature_mapper code.



In [ ]:

    
%matplotlib inline
%load_ext autoreload
%autoreload 2



In [ ]:

    
from lxmls import DATA_PATH
import lxmls
import lxmls.sequences.crf_online as crfo
import lxmls.readers.pos_corpus as pcc
import lxmls.sequences.id_feature as idfc
import lxmls.sequences.extended_feature as exfc
from lxmls.readers import pos_corpus

Load data from the conll task



In [ ]:

    
corpus = lxmls.readers.pos_corpus.PostagCorpus()

train_seq = corpus.read_sequence_list_conll(DATA_PATH + "/train-02-21.conll", 
                                            max_sent_len=10, max_nr_sent=1000)

test_seq = corpus.read_sequence_list_conll(DATA_PATH + "/test-23.conll", 
                                           max_sent_len=10, max_nr_sent=1000)

dev_seq = corpus.read_sequence_list_conll(DATA_PATH + "/dev-22.conll", 
                                          max_sent_len=10, max_nr_sent=1000)



In [ ]:

    
print("There are", len(train_seq), "examples in train_seq")
print("First example:", train_seq[0])

Feature generation

Given a dataset, in order to build the features

An instance from lxmls.sequences.id_feature.IDFeatures(train_data) must be instanciated. We will call feature_mapper this instanciated object.
Then feature_mapper.build_features() must be executed



In [ ]:

    
## Building features
feature_mapper = idfc.IDFeatures(train_seq)
feature_mapper.build_features()

About feature_mappers

A feature_mapper will contain the following attributes:

the dataset in .dataset
- if we instantiate the feature mapper with a dataset X then feature_mapper.datasetwill be a copy of X

a dictionary of features in .feature_dict
- this dictionary will default to {}.
- In order to build the features the feature mapper must call .build_features() function.

a list of features in .feature_list
- this list will default to [].
- In order to build the list of features the feature mapper must call .build_features() function.

A feature_mapper will contain the method

A method to generate features, .build_features
- this method will create features using the `.dataset.
- This method will also fill .feature_dict and `.feature_list



In [ ]:

    
len(feature_mapper.feature_list)



In [ ]:

    
## Let's see the features for the first training example
feature_mapper.feature_list[0]



In [ ]:

    
## The previous features can be classified into:

print("\nInitial features:",     feature_mapper.feature_list[0][0])
print("\nTransition features:",  feature_mapper.feature_list[0][1])
print("\nFinal features:",       feature_mapper.feature_list[0][2])
print("\nEmission features:",    feature_mapper.feature_list[0][3])

An observation on the features for a given example

All features for all the training examples in are saved in train_seq will be saved in feature_mapper.feature_list.

If feature_mapper.feature_list[m] is our feature vector for training example m... why it's not a vector?
- Good point! In order to make the algorithm fast, the code is written using dicts, so if we access only a few positions from the dict and compute substractions it will be much faster than computing the substraction of two huge weight vectors. Notice that there are len(feature_mapper.feature_dict) features.



In [ ]:

    
len(train_seq), len(feature_mapper.feature_list)

Codification of the features

Features are identifyed by init_tag:, prev_tag:, final_prev_tag:, id:

init_tag: when they are Initial features
- Example: init_tag:noun is an initial feature that describes that the first word is a noun

prev_tag: when they are transition features
- Example: prev_tag:noun::noun is an transition feature that describes that the previous word was a noun and the current word is a noun.
- Example: prev_tag:noun:. is an transition feature that describes that the previous word was a noun and the current word is a . (this is usually foud as the last transition feature since most phrases will end up with a dot)

final_prev_tag: when they are final features
- Example: final_prev_tag:. is a final feature stating that the last "word" in the sentence was a dot.

id: when they are emission features
- Example: id:plays::verb is an emission feature, describing that the current word is plays and the current hidden state is a verb.
- Example: id:Feb.::noun is an emission feature, describing that the current word is "Feb." and the current hidden state is a noun.



In [ ]:

    
inv_feature_dict = {word: pos for pos, word in feature_mapper.feature_dict.items()}



In [ ]:

    
feature_mapper.feature_list[0][0]



In [ ]:

    
[inv_feature_dict[x[0]] for x in feature_mapper.feature_list[0][0]]



In [ ]:

    
[inv_feature_dict[x[0]] for x in feature_mapper.feature_list[0][1]]



In [ ]:

    
[inv_feature_dict[x[0]] for x in feature_mapper.feature_list[0][2]]



In [ ]:

    
len(train_seq.x_dict)

Train a crf



In [ ]:

    
## Train crf
crf_online = crfo.CRFOnline(corpus.word_dict, corpus.tag_dict, feature_mapper)
crf_online.num_epochs = 20
crf_online.train_supervised(train_seq)



In [ ]:

    
## You will receive feedback when each epoch is finished,
## note that running the 20 epochs might take a while. After training is done,
## evaluate the learned model on the training, development and test sets.

pred_train = crf_online.viterbi_decode_corpus(train_seq)
pred_dev = crf_online.viterbi_decode_corpus(dev_seq)
pred_test = crf_online.viterbi_decode_corpus(test_seq)

eval_train = crf_online.evaluate_corpus(train_seq, pred_train)
eval_dev = crf_online.evaluate_corpus(dev_seq, pred_dev)
eval_test = crf_online.evaluate_corpus(test_seq, pred_test)



In [ ]:

    
print("CRF - ID Features Accuracy Train: %.3f Dev: %.3f Test: %.3f" \
       %(eval_train,eval_dev, eval_test))



In [ ]: