Bigram tagging with HMMs

Implementation of bigram part-of speech (POS) tagger based on first-order hidden Markov models from scratch.



In [1]:

    
!pwd









    



/home/mapologo/projects/hmm_tagging



In [2]:

    
%load_ext ipycache









    



/usr/local/lib/python2.7/dist-packages/IPython/config.py:13: ShimWarning: The `IPython.config` package has been deprecated. You should import from traitlets.config instead.
  "You should import from traitlets.config instead.", ShimWarning)
/usr/local/lib/python2.7/dist-packages/IPython/utils/traitlets.py:5: UserWarning: IPython.utils.traitlets has moved to a top-level traitlets package.
  warn("IPython.utils.traitlets has moved to a top-level traitlets package.")

Let's try with some toy data

Example taken from Borodovsky & Ekisheva (2006), pp 80-81



In [3]:

    
from bigram_tagging import testing_viterbi
testing_viterbi()









    



[u'H' u'H' u'H' u'L' u'L' u'L' u'L' u'L' u'L']
[[u'H' u'H' u'H' u'H' u'L' u'H' u'L' u'H' u'L']
 [u'L' u'H' u'H' u'H' u'L' u'L' u'L' u'L' u'L']]
[[ -1.89711998  -3.79423997  -5.69135995  -7.99394505  -9.70874348
  -12.01132857 -13.54380544 -15.84639053 -17.78433251]
 [ -2.30258509  -4.19970508  -6.09682506  -7.58847994  -9.70874348
  -11.4235419  -13.54380544 -15.25860387 -16.9734023 ]]

Everything seems to go Ok!

Model training

Now, estimate model parameters. Next step take some time, you can go for a coffee. Result is saved on de-model.npz a file in numpy zip format.



In [4]:

    
from bigram_tagging import train



In [5]:

    
%%cache model.pkl start_f, trans_f, emit_f, obs_states
start_f, trans_f, emit_f, obs_states = train("de-train.tt")









    



[Skipped the cell's code and loaded variables emit_f, obs_states, start_f, trans_f from file '/home/mapologo/projects/hmm_tagging/model.pkl'.]

Model Evaluation

Let's generate a tagged corpus for new data



In [6]:

    
from bigram_tagging import evaluate_model, add_one_smoothing



In [7]:

    
%%cache test_corpus.pkl sents, t_sents
sents, t_sents = evaluate_model("de-test.t", start_f, trans_f, emit_f, obs_states, add_one_smoothing)









    



[Skipped the cell's code and loaded variables sents, t_sents from file '/home/mapologo/projects/hmm_tagging/test_corpus.pkl'.]



In [8]:

    
from bigram_tagging import write_corpus



In [9]:

    
write_corpus("de-test.tt", sents, t_sents)

Now, try the evaluation script



In [10]:

    
!python eval.py de-test.tt de-eval.tt









    



Comparing gold file "de-test.tt" and system file "de-eval.tt"

Precision, recall, and F1 score:

  ADV 0.7553 0.8812 0.8134
 NOUN 0.8454 0.8933 0.8687
  ADP 0.9859 0.8360 0.9047
  PRT 0.7980 0.9459 0.8657
  DET 0.9874 0.6987 0.8184
    . 0.9807 0.9329 0.9562
 PRON 0.7788 0.8216 0.7996
 VERB 0.8571 0.8841 0.8704
  NUM 0.5556 1.0000 0.7143
 CONJ 0.8515 0.9670 0.9055
  ADJ 0.6102 0.7835 0.6861

Accuracy: 0.8601



In [ ]: