In [ ]:
%matplotlib inline
%load_ext autoreload
%autoreload 2
In [ ]:
from lxmls import DATA_PATH
import lxmls
import lxmls.sequences.crf_online as crfo
import lxmls.readers.pos_corpus as pcc
import lxmls.sequences.id_feature as idfc
import lxmls.sequences.extended_feature as exfc
from lxmls.readers import pos_corpus
Load data from the conll task
In [ ]:
corpus = lxmls.readers.pos_corpus.PostagCorpus()
train_seq = corpus.read_sequence_list_conll(DATA_PATH + "/train-02-21.conll",
max_sent_len=10, max_nr_sent=1000)
test_seq = corpus.read_sequence_list_conll(DATA_PATH + "/test-23.conll",
max_sent_len=10, max_nr_sent=1000)
dev_seq = corpus.read_sequence_list_conll(DATA_PATH + "/dev-22.conll",
max_sent_len=10, max_nr_sent=1000)
In [ ]:
print("There are", len(train_seq), "examples in train_seq")
print("First example:", train_seq[0])
In [ ]:
## Building features
feature_mapper = idfc.IDFeatures(train_seq)
feature_mapper.build_features()
A feature_mapper
will contain the following attributes:
.dataset
feature_mapper.dataset
will be a copy of X.feature_dict
{}
. .build_features()
function..feature_list
[]
. .build_features()
function.A feature_mapper
will contain the method
.build_features
`.dataset
..feature_dict
and `.feature_list
In [ ]:
len(feature_mapper.feature_list)
In [ ]:
## Let's see the features for the first training example
feature_mapper.feature_list[0]
In [ ]:
## The previous features can be classified into:
print("\nInitial features:", feature_mapper.feature_list[0][0])
print("\nTransition features:", feature_mapper.feature_list[0][1])
print("\nFinal features:", feature_mapper.feature_list[0][2])
print("\nEmission features:", feature_mapper.feature_list[0][3])
All features for all the training examples in are saved in train_seq
will be saved in feature_mapper.feature_list
.
If feature_mapper.feature_list[m]
is our feature vector for training example m
... why it's not a vector?
len(feature_mapper.feature_dict)
features.
In [ ]:
len(train_seq), len(feature_mapper.feature_list)
Features are identifyed by init_tag:, prev_tag:, final_prev_tag:, id:
init_tag:noun
is an initial feature that describes that the first word is a nounprev_tag:noun::noun
is an transition feature that describes that the previous word was
a noun and the current word is a noun.prev_tag:noun:.
is an transition feature that describes that the previous word was
a noun and the current word is a .
(this is usually foud as the last transition feature since most phrases will end up with a dot)final_prev_tag:.
is a final feature stating that the last "word" in the sentence was a dot.id:plays::verb
is an emission feature, describing that the current word is plays and the current hidden state is a verb.id:Feb.::noun
is an emission feature, describing that the current word is "Feb." and the current hidden state is a noun.
In [ ]:
inv_feature_dict = {word: pos for pos, word in feature_mapper.feature_dict.items()}
In [ ]:
feature_mapper.feature_list[0][0]
In [ ]:
[inv_feature_dict[x[0]] for x in feature_mapper.feature_list[0][0]]
In [ ]:
[inv_feature_dict[x[0]] for x in feature_mapper.feature_list[0][1]]
In [ ]:
[inv_feature_dict[x[0]] for x in feature_mapper.feature_list[0][2]]
In [ ]:
len(train_seq.x_dict)
In [ ]:
## Train crf
crf_online = crfo.CRFOnline(corpus.word_dict, corpus.tag_dict, feature_mapper)
crf_online.num_epochs = 20
crf_online.train_supervised(train_seq)
In [ ]:
## You will receive feedback when each epoch is finished,
## note that running the 20 epochs might take a while. After training is done,
## evaluate the learned model on the training, development and test sets.
pred_train = crf_online.viterbi_decode_corpus(train_seq)
pred_dev = crf_online.viterbi_decode_corpus(dev_seq)
pred_test = crf_online.viterbi_decode_corpus(test_seq)
eval_train = crf_online.evaluate_corpus(train_seq, pred_train)
eval_dev = crf_online.evaluate_corpus(dev_seq, pred_dev)
eval_test = crf_online.evaluate_corpus(test_seq, pred_test)
In [ ]:
print("CRF - ID Features Accuracy Train: %.3f Dev: %.3f Test: %.3f" \
%(eval_train,eval_dev, eval_test))
In [ ]: