Morphological segmentation

Judit Ács

judit@sch.bme.hu

ML Jeju Camp 2017

Mentor: Lucy Park

28 July 2017

Introduction

Budapest University of Technology and Economics

  • BSc and MSc in electrical engineering
  • PhD student in CS (2014 - ?)
  • junior lecturer from this September
    • TA for 3 years
    • jointly preparing a new course on Python and natural language processing
  • member of the Human Language Technology Group, Hungarian Academy of Sciences
  • hlt.bme.hu

  • born and raised in Budapest, Hungary
  • hobbies: martial arts (Taekwondo and Karate), handcraft, learning languages, ping-pong?

Motivation

  1. linguistic motivation
  2. machine learning

Hungarian language

  • Uralic language, not related to most European languages
    • 56% of Uralic speakers are Hungarian (~13 million)
    • Finnish, Estonian
  • many noun cases (18 in Hungarian)
  • non-tonal
  • no grammatical gender - one word for he and she
  • vowel harmony
  • Latin script + 9 additional vowels
  • agglutinative (large number of suffixes)

ML motivation

  • morphological analysis is necessary for downstream tasks: machine translation, information extraction etc.
    • word-level approach is not sufficient for Hungarian
  • rule-based analyzers
    • handcrafted rules by linguists
    • large finite-state transducers
    • not available for most languages
  • deep learning has yet to conquer morphology

Tasks

1. morphological segmentation

Segment word into morphemes (smallest units with meaning and grammatical function)

input output English translation
autóval autó + val with (a) car
államtitkárságba állam + titkár + ság + ba into (the) undersecretary office

Different from part-of-speech tagging

2. morphological reinflection

Lemma + CASE/TENSE/etc. --> inflected word

lemma target inflection inflected word English translation
autó INS (instrumental) autóval with (a) car
eszik PAST evett (he/she) ate

Datasets

Hungarian Webcorpus (Halacsy et al. 2004)

  • web crawl, 600M tokens
  • preprocessing
    • lowercasing
    • filter English words
    • filter anything outside the Hungarian alphabet
    • filter long words
  • 6M types after heavy filtering
  • run rule-based morphological analyzer

Segmentation data

  • autó + val
  • sequence tagging problem
    • binary classification for each character: is it the start of a new segment
  • autóval - BEEEBEE

Korean segmentation data

  • Sejong corpus
  • manually annotated morphological tags
input output
이건 이거 ㄴ
  • not a sequence tagging problem :(

Reinflection data - instrumental case

input output meaning what happens
autó autóval with car
Peti PetivEl with Pete vowel harmony
fej fejJel with head assimilation
pálca pálcÁval with stick low vowel lengthening
kulcs kulCCSal with key digraph + assimilation

Baselines

Byte pair encoding

  1. find the most common subsequent character pairs
  2. replace them with a new symbol
  3. Repeat
  • unsupervised

Entropy-based segmentation

  • segment boundaries tend to have more variation in neighbors - high entropy
  • segment if a character boundary is similar to a morpeme boundary
  • weakly supervised

Morfessor

  • Virpioja et al. 2013
  • unsupervised and semi-supervised morphological segmenter

Models

Sequence tagging

  1. bidirectional stacked LSTM/GRU
  2. character-CNN + LSTM layers

Reinflection and Korean segmentation

sequence-to-sequence models

  1. legacy_seq2seq with Bahdanau attention
  2. newer seq2seq

Problems

  • overdesign
  • Make it work, make it right, make it fast
  • learning rate scheduling
  • (new) seq2seq with attention doesn't converge

Results - sequence tagging

  1. character-level precision, recall
  2. morpheme-level precision, recall
  3. word accuracy

Baselines

Method boundary prec boundary recall boundary F-score word accuracy
byte-pair encoding 0.3795 0.7646 0.5073 0.0822
entropy segmentation 0.5804 0.6071 0.5934 0.5049
Morfessor 0.6410 0.5632 0.5996 0.2432

Deep models

Method boundary F-score morpheme F-score word accuracy
CNN 0.911 0.859 0.756
LSTM 0.90 0.841 0.724

Best scoring parameters

CNN

  • 20 dim embedding
  • 0.2 dropout
  • 2 layers of 1D convolution (200 filters, 5 stride, sigmoid activation
  • 1 GRU layer with 128 cells

LSTM

  • 20 dim embedding
  • 3 layers of LSTM with 256 cells

Reinflection - instrumental case

Metric Result
word accuracy 0.9066

Korean segmentation

Metric Result
boundary detection F-score 0.958
morpheme detection F-score 0.9371
word accuracy 0.8838

Conclusion

  • implemented 3 supervised models for two tasks: morphological segmentation and reinflection
  • 2 datasets
    1. output of a rule-based analyzer
    2. Korean segmentation dataset
  • CNN-based architectures perform best
  • manual error analysis suggests that the training data is noisy

Future work

Dataset improvements

  • improve disambiguation
  • use context
  • train and test data should not have overlapping morphemes (thx Rishabh)

Model

  • fix seq2seq attention
  • add unidirectional LSTM on top of bidirectional encoder (thx Tommy)
  • try CBHG from the Tacotron paper (thx Ryan)
  • decent learning rate scheduling
  • better representation for assimilation

THE DREAM: unsupervised rule discovery from raw text

Many thanks

My mentor

The organizers

Camp participants

Thank you for your attention

Ács Judit (아치 유딧)

judit@sch.bme.hu

Github

艾奇尤迪特

アーチユヂッ卜