Morphological segmentation

Judit Ács

judit@sch.bme.hu

ML Jeju Camp 2017

Mentor: Lucy Park

28 July 2017

Introduction

Budapest University of Technology and Economics

BSc and MSc in electrical engineering
PhD student in CS (2014 - ?)
junior lecturer from this September
- TA for 3 years
- jointly preparing a new course on Python and natural language processing
member of the Human Language Technology Group, Hungarian Academy of Sciences
hlt.bme.hu

HLT Group

born and raised in Budapest, Hungary
hobbies: martial arts (Taekwondo and Karate), handcraft, learning languages, ping-pong?

Motivation

linguistic motivation
machine learning

Hungarian language

Uralic language, not related to most European languages
- 56% of Uralic speakers are Hungarian (~13 million)
- Finnish, Estonian
many noun cases (18 in Hungarian)
non-tonal
no grammatical gender - one word for he and she
vowel harmony
Latin script + 9 additional vowels
agglutinative (large number of suffixes)

ML motivation

morphological analysis is necessary for downstream tasks: machine translation, information extraction etc.
- word-level approach is not sufficient for Hungarian
rule-based analyzers
- handcrafted rules by linguists
- large finite-state transducers
- not available for most languages
deep learning has yet to conquer morphology

Tasks

1. morphological segmentation

Segment word into morphemes (smallest units with meaning and grammatical function)

input	output	English translation
autóval	autó + val	with (a) car
államtitkárságba	állam + titkár + ság + ba	into (the) undersecretary office

Different from part-of-speech tagging

2. morphological reinflection

Lemma + CASE/TENSE/etc. --> inflected word

lemma	target inflection	inflected word	English translation
autó	INS (instrumental)	autóval	with (a) car
eszik	PAST	evett	(he/she) ate

Datasets

Hungarian Webcorpus (Halacsy et al. 2004)

web crawl, 600M tokens
preprocessing
- lowercasing
- filter English words
- filter anything outside the Hungarian alphabet
- filter long words
6M types after heavy filtering
run rule-based morphological analyzer

Segmentation data

autó + val
sequence tagging problem
- binary classification for each character: is it the start of a new segment
autóval - BEEEBEE

Korean segmentation data

Sejong corpus
manually annotated morphological tags

input	output
이건	이거 ㄴ

not a sequence tagging problem :(

Reinflection data - instrumental case

input	output	meaning	what happens
autó	autóval	with car
Peti	PetivEl	with Pete	vowel harmony
fej	fejJel	with head	assimilation
pálca	pálcÁval	with stick	low vowel lengthening
kulcs	kulCCSal	with key	digraph + assimilation

Baselines

Byte pair encoding

find the most common subsequent character pairs
replace them with a new symbol
Repeat

unsupervised

Entropy-based segmentation

segment boundaries tend to have more variation in neighbors - high entropy
segment if a character boundary is similar to a morpeme boundary
weakly supervised

Morfessor

Virpioja et al. 2013
unsupervised and semi-supervised morphological segmenter

Models

Sequence tagging

bidirectional stacked LSTM/GRU
character-CNN + LSTM layers

Reinflection and Korean segmentation

sequence-to-sequence models

legacy_seq2seq with Bahdanau attention
newer seq2seq

Problems

overdesign
Make it work, make it right, make it fast
learning rate scheduling
(new) seq2seq with attention doesn't converge

Results - sequence tagging

character-level precision, recall
morpheme-level precision, recall
word accuracy

Baselines

Method	boundary prec	boundary recall	boundary F-score	word accuracy
byte-pair encoding	0.3795	0.7646	0.5073	0.0822
entropy segmentation	0.5804	0.6071	0.5934	0.5049
Morfessor	0.6410	0.5632	0.5996	0.2432

Deep models

Method	boundary F-score	morpheme F-score	word accuracy
CNN	0.911	0.859	0.756
LSTM	0.90	0.841	0.724

Best scoring parameters

CNN

20 dim embedding
0.2 dropout
2 layers of 1D convolution (200 filters, 5 stride, sigmoid activation
1 GRU layer with 128 cells

LSTM

20 dim embedding
3 layers of LSTM with 256 cells

Reinflection - instrumental case

Metric	Result
word accuracy	0.9066

Korean segmentation

Metric	Result
boundary detection F-score	0.958
morpheme detection F-score	0.9371
word accuracy	0.8838

Conclusion

implemented 3 supervised models for two tasks: morphological segmentation and reinflection
2 datasets
1. output of a rule-based analyzer
2. Korean segmentation dataset
CNN-based architectures perform best
manual error analysis suggests that the training data is noisy

Future work

Dataset improvements

improve disambiguation
use context
train and test data should not have overlapping morphemes (thx Rishabh)

Model

fix seq2seq attention
add unidirectional LSTM on top of bidirectional encoder (thx Tommy)
try CBHG from the Tacotron paper (thx Ryan)
decent learning rate scheduling
better representation for assimilation

THE DREAM: unsupervised rule discovery from raw text

Many thanks

My mentor

The organizers

Camp participants

Thank you for your attention

Ács Judit (아치 유딧)

judit@sch.bme.hu

Github

艾奇尤迪特

アーチユヂッ卜