Introduction to Python and Natural Language Technologies

Morphology

Lecture 08

25 October 2017

8.1 What is morphology?

8.2 Morphological phenomena

8.3 Computational morphology

8.4 Finite state

8.5 FSTs as morphological analysers

8.1 What is morphology?

From Ancient Greek: "the study of forms"

μορφή (morphé): form, shape
λόγος (lógos): word, study

Has a different meaning in different fields:

Astronomy: variations of stars and galaxies
Biology: form and structure of organisms
Geology: the formation of landscapes
Linguistics: composition and structure of words

What does it do?

Morphology is the study of how words are formed in terms of the minimal meaning-bearing unit: morphemes.

Parse / features	`space[/N]`	`ship[/N]`	`tok[Poss.2Pl]`	`ból[Ela]`
Surface form	űr	+hajó	+tok	+ból
Translation	"from your spaceship"

https://github.com/jupyter/notebook/issues/534

Meanings of the morphemes:

[/N]: noun
[Poss.2Pl]: possessive 2nd person plural (your)
[Ela]: elassive case (from)

The strings next to the morphemes ([/N], [Poss.2Pl] and [Ela]) are called tags or features

They signify the syntactic role of the morpheme
This is the Universal Dependency tagset
Obsolete tagsets for Hungarian: KR, MSD

Computational morphology usually works in both ways:

Analysis: parses a surface form to morphological features
Generation: given a list of features, reproduce the surface form

Why do we need it?

Several tasks require it:

spell checking
lemmatization
information retrieval (IR):
- singular and plural nouns: finite state automaton / automata
- word forms: minimize automata / automata minimization

Morphology serves as the first level in linguistic pipelines:

Syntax: parsers need to know the part-of-speech (and more)
Semantics: a large part of the meaning is contained in the words
Features for tasks:
- Named Entity Recognition (NER)
- Anaphora resolution
- etc.

Isn't a list enough?

One might ask if a single list of all word forms is not enough

for spell checking: a list of correct word forms
in the sequence tagging example, we did not use morphology
a usual practice in language modeling, speech recognition
a list of a few hundred thousands / million elements is cheap

Works well for English:

Nouns have only two forms: singular and plural
Verbs at most five (-s, -ed {1,2}, -ing)

No, it isn't

The list method quickly breaks down:

Agglutinative languages have additive and branching inflection
- Hungarian nouns have about 700 forms
- Turkish verbs have 40,000 possible forms

Derivation is problematic even in English
- English: -ize $\rightarrow$ -ization
- Hungarian: kelkáposztástalaníthatatlanságotoknak

Complex morphologies

8.2 Morphological phenomena

Types of morphemes

Free morphemes can stand alone in the sentence.
- root / stem / lemma: the "main" morpheme (e.g. quick, brown, fox)
Bound morphemes only occur attached to a stem.
- affixes:
  - megnéz (looks $\rightarrow$ takes a look)
  - smallest

However, tmesis might occur:

megnéz $\rightarrow$ nézd meg (take a look [Subj.Def.2Sg])
anfangen $\rightarrow$ Wann fängt es an? (when does it start?)

Affix types

derivational: change part-of-speech or meaning of the word
- English unpreparedness; computerize
- Hungarian: kékség (blue-ness), fellelhető (can be found), felél (live $\rightarrow$ exhaust)

inflectional: fill syntactic functions
- English looked ([Pst]); seas ([Pl])
- Hungarian: meglátta (saw (it), [Pst.Def.3Sg]), asztalon (on the table, [Supe])

Affix placement

prefix:
- English: extraterrestrial, disappear
- Hungarian: előír (prescribe), átmegy (go $\rightarrow$ cross something)

suffix:
- English: listening, shortest
- Hungarian: nagyobb (larger), sast (eagle [Acc])

Affix placement

circumfix:
- Hungarian: legrövidebb (shortest)
- German: gehabt (have $\rightarrow$ had)

infix:
- Tagalog: humingi (I borrow)
- English: abso-fucking-lutely

Other phenomena

Compounds: words (usually nouns) with more than one stem
- Hungarian: űrhajó (spaceship), zöldeskék (greenish blue, turquoise)
- German: Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz (the law concerning the delegation of duties for the supervision of cattle marking and the labelling of beef)

German example from https://www.theguardian.com/world/2013/jun/03/indfleischetikettierungsberwachungsaufgabenbertragungsgesetz-word-germany

Clitics are words reduced in form that are phonologically dependent on another word
- English: we'll, it's
- French: c'est la vie, l'heure

Nonconcatenative morphology

Grammatical / semantic changes are reflected by modifications to the stem rather than affixation.

Reduplication: the stem (or part of it) is repeated
- Japanese plural (one type): 山々, yamayama (mountains)
- Tagalog: lulutuin (will cook)
- Hungarian (?): husi-musi ("little meaty"), ici-pici (itsy-bitsy)

Templatic: a consonant template is filled with vowels.
- Arabic: k-t-b can become
  - kitāb: book
  - kutub: books
  - kātib: writer
  - ...

Ablaut: stem sounds change that confers grammatical information
- English: foot $\rightarrow$ feet, run $\rightarrow$ ran
- German: sehen $\rightarrow$ siehst (see [2Sg])

Language types

Isolating languages: no inflection to speak of
- Mandarin Chinese

Analytic languages: some inflection, but prefers word order, adposition
- English, Vietnamese

Synthetic languages: inflected, high morpheme-per-word ratio
- Fusional: use a single suffix that encodes several grammatical functions
  - Romance and Slavic languages
- Agglutinative: highly concatenative, separate "slots" for different grammatical functions
  - Uralic languages, Basque, Japanese, Turkish
- Polysynthetic: "sentence-words"
  - Indigenous languages of America, Australia; Inuit languages

Don't pay too much attention to this:

no clear-cut boundaries
most languages have phenomena from several types

8.3 Computational Morphology

The established view of morphology is that morphemes and their intra-word interactions can be formalized as rules. A morphological analyzer has the following three main components:

Lexicon: lists the morphemes with basic information (e.g. part of speech)
Morphotactics: specifies morpheme ordering and their interactions (usually on an abstract level)
- Hungarian nouns: stem+ ... (plural) case
- English: sheep doesn't have a plural
Orthographic/Phonetic rules: model how the morphemes are mapped to letters / sounds
- English: run + ing becomes running
- Hungarian: asztal + val (with the table) becomes asztallal

These components are usually implemented with finite state methods: automata and transducers.

History

Pāṇini (around 500BC): Aṣṭādhyāyī, a sūtra of 3,959 verses, which includes the morphological description of Classical Sanskrit
Johnson (1972): models phonological rewrite rules as finite state transducers
Koskenniemi (1983): two-level rules
Karttunen et al. (1992): composed all components to a single transducer

Papers cited above:

Johnson, C. D. (1972). Formal aspects of phonological description. The Hague: Mouton.
Koskenniemi, K. (1983). Two-level morphology. A General Computational Model for Word-Form Recognition and Production. Department of General Linguistics. University of Helsinki, Finland.
Karttunen, L., Kaplan, R. M., & Zaenen, A. (1992, August). Two-level morphology with composition. In Proceedings of the 14th conference on Computational linguistics-Volume 1 (pp. 141-148). Association for Computational Linguistics.

Deep learning

Most fields of NLP have become dominated by deep learning methods. However, computational morphology still uses the methods described above.

Deep learning is slow. Morphology has to be fast
Neural networks are not good at learning hard constraints and associations
Words and the phenomena present in any language are arbitrary
Dictionaries and lexicographers have been available since antiquity

Still, machine learning methods start to appear:

Morfessor (Virpioja et al, 2013): unsupervised and semi-supervised morphological segmentation
Sigmorphon challenge (Cotterell et al. 2016): supervised morphological reinflection
- Winner: LMU system with 96% accuracy (Kann and Schütze, 2016)

Papers cited above:

Cotterell, R., Kirov, C., Sylak-Glassman, J., Yarowsky, D., Eisner, J., & Hulden, M. (2016, August). The SIGMORPHON 2016 shared task—morphological reinflection. In Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology (pp. 10-22).
Katharina Kann and Hinrich Schütze. 2016. MED: The LMU system for the SIGMORPHON 2016 shared task on morphological reinflection. In Proceedings of the 2016 Meeting of SIGMORPHON, Berlin, Germany. Association for Computational Linguistics.
Sami Virpioja, Peter Smit, Stig-Arne Grönroos, and Mikko Kurimo. Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline. Aalto University publication series SCIENCE + TECHNOLOGY, 25/2013. Aalto University, Helsinki, 2013. ISBN 978-952-60-5501-5.

Hungarian morphological analysers

Historically, several systems have been created for Hungarian morphological analysis:

HuMor (Prószéky et al, 1999)
Hunspell (László Németh)
Hunmorph (Trón et al, 2005)
Jmorph (ibid.)

The last two (or three) is backed by the lexical resource

Morphdb.hu (Trón et al, 2006)

These have been merged into and superceded by an HFST (foma)-based system, which is also integrated into the e-magyar toolchain.

emMorph (Novák, 2016)

Papers cited above:

Attila Novák; Borbála Siklósi; Csaba Oravecz (2016): A New Integrated Open-source Morphological Analyzer for Hungarian In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). Portorož, pp. 1315–1322.
Prószéky, G., Kis, B.: A unification-based approach to morpho-syntactic parsing of agglutinative and other (highly) inflectional languages. In: Proceedings of ACL ‘99. pp. 261–268. Association for Computational Linguistics, Stroudsburg, PA, USA (1999).
Trón, V., Németh, L., Halácsy, P., Kornai, A., Gyepesi, G., and Varga, D. (2005) "Hunmorph: open source word analysis". Proceedings of the ACL 2005 Workshop on Software. pp. 77--85
Trón, V., Halácsy, P., Rebrus, P., Rung, A., Vajda, P., Simon, E.: Morphdb.hu: Hungarian lexical database and morphological grammar. In: Proceedings of LREC 2006. pp. 1670–1673 (2006)

8.4 Finite state

Two types of finite state machines are of interest:

Finite state automaton (FSA):
- internal state: one from a finite set of possible states
- input tape
- reads the tape one symbol at a time and may move to a new state after each

Finite state transducer (FST):
- FSA with an additional output tape
- outputs a symbol after each state transition

Example

An FSA that recognizes the names of three animals.

Note how it is very similar to a trie: an ideal format for a lexicon.

Example

An FST that also prints the sounds the animal makes.

Mathematical description

The mathematical model of a finite state automaton is a 5-tuple:

$Q$: the set of states
$\Sigma$: the input alphabet
$q_0$: the initial state
$F \subseteq Q$: the accepting states. The FSA accepts and input, if it stops here after reading the input.
$\delta(q, w)$: the state transition function ($Q \times \Sigma \times Q$)

The model of an FST is a 6-tuple, with these changes:

$\Gamma$: the output alphabet
$\delta(q, w, w')$: the state transition function ($Q \times \Sigma \times \Gamma \times Q$)

Mathematical description

The set of symbol sequences ($\Sigma^*$) a machine accepts is its language ($\mathcal{L}$). FSTs have input and output languages ($\mathcal{L}$ and $\mathcal{L'}$).

Languages accepted by FSA are regular languages. FSTs implement regular relations or functions.

regular expressions also generate the regular languages
the two formalisms (FSA and re) are equivalent
some regular expression engines, e.g. grep are implemented as FSA

Regular languages are memory-less, and therefore very limited:

the language $a^*b^*$ is regular, but $a^ib^i$ is not
these restrictions make FSA/FSTs simple and fast

We won't go into further details (determinism, minimalism, closure properties, etc). The interested reader is referred to textbooks on the subject, e.g.

Bach Iván: Formális nyelvek, TypoTex, 2001.
Rónyai Lajos, Ivanyos Gábor, Szabó Réka: Algoritmusok, TypoTex, 1999.
Michael Sipser: Introduction to the theory of computation, Thomson Course Techn., 2006.

or the course Languages and Automata in this faculty.

Useful operations on FSTs

Inversion

the direction of the transduction can be changed by swapping the input and output tapes
hence, we use the terms upper and lower (tape, alphabet, side, ...) instead of input and output
the FST can be used for both analysis and generation

Useful operations on FSTs

Composition

FSTs can be used in cascade:
- $Lex$: Lexicon FSA
- $MT$: Morphotactics FST
- $Orto$: Ortography FST
Since they are relations (or functions), they can be composed into a single FST: $Morph = Lex \circ MT \circ Orto$
Caveat: the resulting transducer might be much larger than its parts

Useful operations on FSTs

Projection

The lower or upper language (and associated FSA) of a FST
In the example:
- lower projection: names of animals (cat, cow, dog)
- upper projection: animal sounds (MEOW, MOO, WOOF)

8.5 FSTs as morphological analysers

The three components of a morphological analyser can be implemented as FSA / FSTs:

$Lex$: Lexicon FSA / FST
$MT$: Morphotactics FST
$Orto$: Ortography FST

These three can be used in cascade, or composed into a single morphological FST $Morph = Lex \circ MT \circ Orto$:

Lower side: surface form (string of characters)
Upper side: morpheme structure (characters (e.g. lemma) + morphological features)

Analysis / generation

Inversion allows the lexical FST $Morph$ to work in both directions.

Applied "up" (lower $\rightarrow$ upper): analysis
Applied "down" (upper $\rightarrow$ lower): generation

Both directions can be ambiguous:

Up: a surface form can have multiple valid analyses
- Hungarian: várnak $\rightarrow$
  - vár[/N]nak[Dat] (for/to the castle)
  - vár[/V]nak[Prs.NDef.3Pl] (they are waiting)
Down: a morpheme might have multiple realizations
- Hungarian: ül (sit) + [2Sg] imperative $\rightarrow$ ülj / üljél
- English: fish + [Pl] $\rightarrow$ fish / fishes

Dative in Hungarian can also play a part in a possessive construct, e.g. the tower of the castle:

a vár tornya
a várnak a tornya

FSTs enumerate all candidates via backtracking.

Defining a transducer: XFST / foma

Xerox FST (XFST) is a finite state toolkit developed by Xerox Research $-$ 1997

Geared towards morphological analysis
Several formalisms to define (lexical) FSTs
Proprietary and not free

Foma is a reimplementation of XFST by Mans Hulden $-$ 2010

Free and open source
Not as fast as XFST and a might be a bit buggy

Helsinki Finite State Technology (HFST) $-$ 2009

Also FOSS
Supports weighted automata
Can use foma as its backend

Defining a transducer: formalisms

XFST / foma support two formalisms to define FSTs

Regular expressions
- Somewhat similar to REs in Python, but can also define transducers
- Good for defining morphotactic / orthographic rules
- Hard to read and not a good fit for defining a lexicon

Lexc
- Specifically designed for writing lexicons
- Aimed at lexicographers

It is usual to build the lexicon in lexc and the rest with REs. However, we shall only cover lexc in this class.

XFST vs Python regular expressions

"Does this"	Python	XFST
Verbatim	`abc`	`[a b c]` or `{abc}`
Matching $0-1$	`(abc)?`	`{abc}^<2`
Kleene star	`(abc)*`	`{abc}*`
Kleene plus	`(abc)+`	`{abc}+`
Any character	`.`	`?`
Disjunction	`a`\|`b`\|`c`	`[A` \| `B` \| `C]`
Containment	`.a.`	`$a`
Splice `x` into `ab`	`xaxbx*`	`[{ab} / x]`
Conjunction		`[A & B & C]`
Complementation		`~A`

Also: regular expressions for transducers, which are obviously missing in Python.

Lexc primer

The basic unit of lexc is the LEXICON

Each LEXICON corresponds to a morpheme class / grammatical function (verbs, nouns; plural, possessive, etc.)
Each entry in a LEXICON defines
- a morpheme
- its continuation class: the LEXICON(s) that generate the next morpheme

A morpheme can be

a character string, e.g. look. These appear on both sides of the FST.
a transduction, that converts characters in the surface form to morphological tags, e.g. [3Sg]:s

Animal sounds example

Multichar_Symbols MEOW MOO WOOF

LEXICON Root
cat   CatSound ;
kitty CatSound ;
cowMOO:cow   # ;  ! In-place transduction; # = end-of-word
dog   DogSound ;
puppy DogSound ;

LEXICON CatSound
MEOW:0       # ;  ! MEOW on the upper tape; nothing on the lower

LEXICON DogSound
WOOF:0       # ;  ! WOOF on the upper tape; nothing on the lower

A few examples for tasks related to morphological analysis, and ways of solving them with lexical FSTs.

Morphological disambiguation

A morphological analyzer will assign multiple possible analyses to certain surface forms. Morphological disambiguation aims to select the correct analysis.

Disambiguation needs context (preceding words, the whole sentence, ...)
This is a sequence classification task

Spell checking

A spell checker finds words that are not spelled correctly.

The lower projection of a lexical transducer enumerates the existing word forms.
Hunspell for Hungarian was built from the same lexical resource as the morphological analyzer Hunmorph

Lemmatization and stemming

Lemmatization is the task of finding the lemma ("dictionary form") of a word:

Same as morphological analysis without the morphological tags on the upper side
- Hungarian: kutyáknak $\rightarrow$ kutya (for dogs $\rightarrow$ dog)
- German: gegessen $\rightarrow$ essen (ate $\rightarrow$ eat)
With FSTs: $Morph \circ DelTags$, where $DelTags$ is an FST that deletes tags

Stemming is the process of reducing a word form to a "stem" using solely orthographic transformations.

Faster, as it does not require morphological analysis
Collapses inflected forms of the same word to a common string
Stems are not necessarily meaningful
- English (Porter stemmer):
  - exactly $\rightarrow$ exactli
  - origins $\rightarrow$ origi (lemma would be origin)

Word and sentence segmentation

Word segmentation (tokenization) is the task of dividing a string into its components tokens (words).

Sounds simple, but isn't
- English: I saw it, did you too Mr. Jones? $\rightarrow$ ['I', 'saw', 'it', ',', 'did', 'you', 'too', 'Mr.', 'Jones', '?']
- Japanese: 進み続けてさえいれば、遅くとも関係ない。 $\rightarrow$ ['進み続けて', 'さえ', 'いれば', '、', '遅く', 'とも', '関係', 'ない', '。']
Usually done with sequence tagging, but rule-based (FST) solutions also exist
A simple solution: make the (lower projection of the) analyzer circular: $Morph^+$

Japanese quote from https://www.linguajunkie.com/japanese/motivational-quotes-inspirational.

Sentence segmentation: same for sentences.

Introduction to Python and Natural Language Technologies

Morphology

Lecture 08

8.1 What is morphology?

8.2 Morphological phenomena

8.3 Computational morphology

8.4 Finite state

8.5 FSTs as morphological analysers

8.6 Related tasks

8.1 What is morphology?

What does it do?

Why do we need it?

Isn't a list enough?

No, it isn't

8.2 Morphological phenomena

Types of morphemes

Affix types

Affix placement

Affix placement

Other phenomena

Nonconcatenative morphology

Language types

8.3 Computational Morphology

History

Deep learning

Hungarian morphological analysers

8.4 Finite state

Example

Example

Mathematical description

Mathematical description

Useful operations on FSTs

Useful operations on FSTs

Useful operations on FSTs

8.5 FSTs as morphological analysers

Analysis / generation

Defining a transducer: XFST / foma

Defining a transducer: formalisms

XFST vs Python regular expressions

Lexc primer

Animal sounds example

8.6 Related tasks

Morphological disambiguation

Spell checking

Lemmatization and stemming

Word and sentence segmentation