Introduction to Python and Natural Language Technologies

Morphology

Lecture 08

25 October 2017

8.1 What is morphology?

From Ancient Greek: "the study of forms"

  • μορφή (morphé): form, shape
  • λόγος (lógos): word, study

Has a different meaning in different fields:

  • Astronomy: variations of stars and galaxies
  • Biology: form and structure of organisms
  • Geology: the formation of landscapes
  • Linguistics: composition and structure of words

What does it do?

Morphology is the study of how words are formed in terms of the minimal meaning-bearing unit: morphemes.

Parse / featuresspace[/N]ship[/N]tok[Poss.2Pl]ból[Ela]
Surface forműr +hajó +tok +ból
Translation"from your spaceship"

Meanings of the morphemes:

  • [/N]: noun
  • [Poss.2Pl]: possessive 2nd person plural (your)
  • [Ela]: elassive case (from)

The strings next to the morphemes ([/N], [Poss.2Pl] and [Ela]) are called tags or features

  • They signify the syntactic role of the morpheme
  • This is the Universal Dependency tagset
  • Obsolete tagsets for Hungarian: KR, MSD

Computational morphology usually works in both ways:

  • Analysis: parses a surface form to morphological features
  • Generation: given a list of features, reproduce the surface form

Why do we need it?

Several tasks require it:

  • spell checking
  • lemmatization
  • information retrieval (IR):
    • singular and plural nouns: finite state automaton / automata
    • word forms: minimize automata / automata minimization

Morphology serves as the first level in linguistic pipelines:

  • Syntax: parsers need to know the part-of-speech (and more)
  • Semantics: a large part of the meaning is contained in the words
  • Features for tasks:
    • Named Entity Recognition (NER)
    • Anaphora resolution
    • etc.

Isn't a list enough?

One might ask if a single list of all word forms is not enough

  • for spell checking: a list of correct word forms
  • in the sequence tagging example, we did not use morphology
  • a usual practice in language modeling, speech recognition
  • a list of a few hundred thousands / million elements is cheap

Works well for English:

  • Nouns have only two forms: singular and plural
  • Verbs at most five (-s, -ed {1,2}, -ing)

No, it isn't

The list method quickly breaks down:

  • Agglutinative languages have additive and branching inflection
    • Hungarian nouns have about 700 forms
    • Turkish verbs have 40,000 possible forms
  • Derivation is problematic even in English
    • English: -ize $\rightarrow$ -ization
    • Hungarian: kelkáposztástalaníthatatlanságotoknak
  • Complex morphologies

8.2 Morphological phenomena

Types of morphemes

  • Free morphemes can stand alone in the sentence.
    • root / stem / lemma: the "main" morpheme (e.g. quick, brown, fox)
  • Bound morphemes only occur attached to a stem.
    • affixes:
      • megnéz (looks $\rightarrow$ takes a look)
      • smallest

However, tmesis might occur:

  • megnéz $\rightarrow$ nézd meg (take a look [Subj.Def.2Sg])
  • anfangen $\rightarrow$ Wann fängt es an? (when does it start?)

Affix types

  • derivational: change part-of-speech or meaning of the word
    • English unpreparedness; computerize
    • Hungarian: kékség (blue-ness), fellelhető (can be found), felél (live $\rightarrow$ exhaust)
  • inflectional: fill syntactic functions
    • English looked ([Pst]); seas ([Pl])
    • Hungarian: meglátta (saw (it), [Pst.Def.3Sg]), asztalon (on the table, [Supe])

Affix placement

  • prefix:
    • English: extraterrestrial, disappear
    • Hungarian: előír (prescribe), átmegy (go $\rightarrow$ cross something)
  • suffix:
    • English: listening, shortest
    • Hungarian: nagyobb (larger), sast (eagle [Acc])

Affix placement

  • circumfix:
    • Hungarian: legrövidebb (shortest)
    • German: gehabt (have $\rightarrow$ had)
  • infix:
    • Tagalog: humingi (I borrow)
    • English: abso-fucking-lutely

Other phenomena

  • Compounds: words (usually nouns) with more than one stem
    • Hungarian: űrhajó (spaceship), zöldeskék (greenish blue, turquoise)
    • German: Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz (the law concerning the delegation of duties for the supervision of cattle marking and the labelling of beef)
  • Clitics are words reduced in form that are phonologically dependent on another word
    • English: we'll, it's
    • French: c'est la vie, l'heure

Nonconcatenative morphology

Grammatical / semantic changes are reflected by modifications to the stem rather than affixation.

  • Reduplication: the stem (or part of it) is repeated
    • Japanese plural (one type): 山々, yamayama (mountains)
    • Tagalog: lulutuin (will cook)
    • Hungarian (?): husi-musi ("little meaty"), ici-pici (itsy-bitsy)
  • Templatic: a consonant template is filled with vowels.
    • Arabic: k-t-b can become
      • kitāb: book
      • kutub: books
      • kātib: writer
      • ...
  • Ablaut: stem sounds change that confers grammatical information
    • English: foot $\rightarrow$ feet, run $\rightarrow$ ran
    • German: sehen $\rightarrow$ siehst (see [2Sg])

Language types

  • Isolating languages: no inflection to speak of
    • Mandarin Chinese
  • Analytic languages: some inflection, but prefers word order, adposition
    • English, Vietnamese
  • Synthetic languages: inflected, high morpheme-per-word ratio
    • Fusional: use a single suffix that encodes several grammatical functions
      • Romance and Slavic languages
    • Agglutinative: highly concatenative, separate "slots" for different grammatical functions
      • Uralic languages, Basque, Japanese, Turkish
    • Polysynthetic: "sentence-words"
      • Indigenous languages of America, Australia; Inuit languages

Don't pay too much attention to this:

  • no clear-cut boundaries
  • most languages have phenomena from several types

8.3 Computational Morphology

The established view of morphology is that morphemes and their intra-word interactions can be formalized as rules. A morphological analyzer has the following three main components:

  • Lexicon: lists the morphemes with basic information (e.g. part of speech)
  • Morphotactics: specifies morpheme ordering and their interactions (usually on an abstract level)
    • Hungarian nouns: stem+ ... (plural) case
    • English: sheep doesn't have a plural
  • Orthographic/Phonetic rules: model how the morphemes are mapped to letters / sounds
    • English: run + ing becomes running
    • Hungarian: asztal + val (with the table) becomes asztallal

These components are usually implemented with finite state methods: automata and transducers.

History

  • Pāṇini (around 500BC): Aṣṭādhyāyī, a sūtra of 3,959 verses, which includes the morphological description of Classical Sanskrit
  • Johnson (1972): models phonological rewrite rules as finite state transducers
  • Koskenniemi (1983): two-level rules
  • Karttunen et al. (1992): composed all components to a single transducer

Papers cited above:

  • Johnson, C. D. (1972). Formal aspects of phonological description. The Hague: Mouton.
  • Koskenniemi, K. (1983). Two-level morphology. A General Computational Model for Word-Form Recognition and Production. Department of General Linguistics. University of Helsinki, Finland.
  • Karttunen, L., Kaplan, R. M., & Zaenen, A. (1992, August). Two-level morphology with composition. In Proceedings of the 14th conference on Computational linguistics-Volume 1 (pp. 141-148). Association for Computational Linguistics.

Deep learning

Most fields of NLP have become dominated by deep learning methods. However, computational morphology still uses the methods described above.

  • Deep learning is slow. Morphology has to be fast
  • Neural networks are not good at learning hard constraints and associations
  • Words and the phenomena present in any language are arbitrary
  • Dictionaries and lexicographers have been available since antiquity

Still, machine learning methods start to appear:

  • Morfessor (Virpioja et al, 2013): unsupervised and semi-supervised morphological segmentation
  • Sigmorphon challenge (Cotterell et al. 2016): supervised morphological reinflection
    • Winner: LMU system with 96% accuracy (Kann and Schütze, 2016)

Papers cited above:

  • Cotterell, R., Kirov, C., Sylak-Glassman, J., Yarowsky, D., Eisner, J., & Hulden, M. (2016, August). The SIGMORPHON 2016 shared task—morphological reinflection. In Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology (pp. 10-22).
  • Katharina Kann and Hinrich Schütze. 2016. MED: The LMU system for the SIGMORPHON 2016 shared task on morphological reinflection. In Proceedings of the 2016 Meeting of SIGMORPHON, Berlin, Germany. Association for Computational Linguistics.
  • Sami Virpioja, Peter Smit, Stig-Arne Grönroos, and Mikko Kurimo. Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline. Aalto University publication series SCIENCE + TECHNOLOGY, 25/2013. Aalto University, Helsinki, 2013. ISBN 978-952-60-5501-5.

Hungarian morphological analysers

Historically, several systems have been created for Hungarian morphological analysis:

The last two (or three) is backed by the lexical resource

These have been merged into and superceded by an HFST (foma)-based system, which is also integrated into the e-magyar toolchain.

Papers cited above:

  • Attila Novák; Borbála Siklósi; Csaba Oravecz (2016): A New Integrated Open-source Morphological Analyzer for Hungarian In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). Portorož, pp. 1315–1322.
  • Prószéky, G., Kis, B.: A unification-based approach to morpho-syntactic parsing of agglutinative and other (highly) inflectional languages. In: Proceedings of ACL ‘99. pp. 261–268. Association for Computational Linguistics, Stroudsburg, PA, USA (1999).
  • Trón, V., Németh, L., Halácsy, P., Kornai, A., Gyepesi, G., and Varga, D. (2005) "Hunmorph: open source word analysis". Proceedings of the ACL 2005 Workshop on Software. pp. 77--85
  • Trón, V., Halácsy, P., Rebrus, P., Rung, A., Vajda, P., Simon, E.: Morphdb.hu: Hungarian lexical database and morphological grammar. In: Proceedings of LREC 2006. pp. 1670–1673 (2006)

8.4 Finite state

Two types of finite state machines are of interest:

  • Finite state automaton (FSA):
    • internal state: one from a finite set of possible states
    • input tape
    • reads the tape one symbol at a time and may move to a new state after each
  • Finite state transducer (FST):
    • FSA with an additional output tape
    • outputs a symbol after each state transition

Example

An FSA that recognizes the names of three animals.

Note how it is very similar to a trie: an ideal format for a lexicon.

Example

An FST that also prints the sounds the animal makes.

Mathematical description

The mathematical model of a finite state automaton is a 5-tuple:

  • $Q$: the set of states
  • $\Sigma$: the input alphabet
  • $q_0$: the initial state
  • $F \subseteq Q$: the accepting states. The FSA accepts and input, if it stops here after reading the input.
  • $\delta(q, w)$: the state transition function ($Q \times \Sigma \times Q$)

The model of an FST is a 6-tuple, with these changes:

  • $\Gamma$: the output alphabet
  • $\delta(q, w, w')$: the state transition function ($Q \times \Sigma \times \Gamma \times Q$)

Mathematical description

The set of symbol sequences ($\Sigma^*$) a machine accepts is its language ($\mathcal{L}$). FSTs have input and output languages ($\mathcal{L}$ and $\mathcal{L'}$).

Languages accepted by FSA are regular languages. FSTs implement regular relations or functions.

  • regular expressions also generate the regular languages
  • the two formalisms (FSA and re) are equivalent
  • some regular expression engines, e.g. grep are implemented as FSA

Regular languages are memory-less, and therefore very limited:

  • the language $a^*b^*$ is regular, but $a^ib^i$ is not
  • these restrictions make FSA/FSTs simple and fast

We won't go into further details (determinism, minimalism, closure properties, etc). The interested reader is referred to textbooks on the subject, e.g.

  • Bach Iván: Formális nyelvek, TypoTex, 2001.
  • Rónyai Lajos, Ivanyos Gábor, Szabó Réka: Algoritmusok, TypoTex, 1999.
  • Michael Sipser: Introduction to the theory of computation, Thomson Course Techn., 2006.

or the course Languages and Automata in this faculty.

Useful operations on FSTs

Inversion

  • the direction of the transduction can be changed by swapping the input and output tapes
  • hence, we use the terms upper and lower (tape, alphabet, side, ...) instead of input and output
  • the FST can be used for both analysis and generation

Useful operations on FSTs

Composition

  • FSTs can be used in cascade:
    • $Lex$: Lexicon FSA
    • $MT$: Morphotactics FST
    • $Orto$: Ortography FST
  • Since they are relations (or functions), they can be composed into a single FST: $Morph = Lex \circ MT \circ Orto$
  • Caveat: the resulting transducer might be much larger than its parts

Useful operations on FSTs

Projection

  • The lower or upper language (and associated FSA) of a FST
  • In the example:
    • lower projection: names of animals (cat, cow, dog)
    • upper projection: animal sounds (MEOW, MOO, WOOF)

8.5 FSTs as morphological analysers

The three components of a morphological analyser can be implemented as FSA / FSTs:

  • $Lex$: Lexicon FSA / FST
  • $MT$: Morphotactics FST
  • $Orto$: Ortography FST

These three can be used in cascade, or composed into a single morphological FST $Morph = Lex \circ MT \circ Orto$:

  • Lower side: surface form (string of characters)
  • Upper side: morpheme structure (characters (e.g. lemma) + morphological features)

Analysis / generation

Inversion allows the lexical FST $Morph$ to work in both directions.

  • Applied "up" (lower $\rightarrow$ upper): analysis
  • Applied "down" (upper $\rightarrow$ lower): generation

Both directions can be ambiguous:

  • Up: a surface form can have multiple valid analyses
    • Hungarian: várnak $\rightarrow$
      • vár[/N]nak[Dat] (for/to the castle)
      • vár[/V]nak[Prs.NDef.3Pl] (they are waiting)
  • Down: a morpheme might have multiple realizations
    • Hungarian: ül (sit) + [2Sg] imperative $\rightarrow$ ülj / üljél
    • English: fish + [Pl] $\rightarrow$ fish / fishes

Dative in Hungarian can also play a part in a possessive construct, e.g. the tower of the castle:

  • a vár tornya
  • a várnak a tornya

FSTs enumerate all candidates via backtracking.

Defining a transducer: XFST / foma

Xerox FST (XFST) is a finite state toolkit developed by Xerox Research $-$ 1997

  • Geared towards morphological analysis
  • Several formalisms to define (lexical) FSTs
  • Proprietary and not free

Foma is a reimplementation of XFST by Mans Hulden $-$ 2010

  • Free and open source
  • Not as fast as XFST and a might be a bit buggy

Helsinki Finite State Technology (HFST) $-$ 2009

  • Also FOSS
  • Supports weighted automata
  • Can use foma as its backend

Defining a transducer: formalisms

XFST / foma support two formalisms to define FSTs

  • Regular expressions
    • Somewhat similar to REs in Python, but can also define transducers
    • Good for defining morphotactic / orthographic rules
    • Hard to read and not a good fit for defining a lexicon
  • Lexc
    • Specifically designed for writing lexicons
    • Aimed at lexicographers

It is usual to build the lexicon in lexc and the rest with REs. However, we shall only cover lexc in this class.

XFST vs Python regular expressions

"Does this" Python XFST
Verbatim abc [a b c] or {abc}
Matching $0-1$ (abc)? {abc}^<2
Kleene star (abc)* {abc}*
Kleene plus (abc)+ {abc}+
Any character . ?
Disjunction a|b|c [A | B | C]
Containment .*a.* $a
Splice x into ab x*ax*bx* [{ab} / x]
Conjunction [A & B & C]
Complementation ~A

Also: regular expressions for transducers, which are obviously missing in Python.

Lexc primer

The basic unit of lexc is the LEXICON

  • Each LEXICON corresponds to a morpheme class / grammatical function (verbs, nouns; plural, possessive, etc.)
  • Each entry in a LEXICON defines
    • a morpheme
    • its continuation class: the LEXICON(s) that generate the next morpheme

A morpheme can be

  • a character string, e.g. look. These appear on both sides of the FST.
  • a transduction, that converts characters in the surface form to morphological tags, e.g. [3Sg]:s

Animal sounds example

Multichar_Symbols MEOW MOO WOOF

LEXICON Root
cat   CatSound ;
kitty CatSound ;
cowMOO:cow   # ;  ! In-place transduction; # = end-of-word
dog   DogSound ;
puppy DogSound ;

LEXICON CatSound
MEOW:0       # ;  ! MEOW on the upper tape; nothing on the lower

LEXICON DogSound
WOOF:0       # ;  ! WOOF on the upper tape; nothing on the lower

A few examples for tasks related to morphological analysis, and ways of solving them with lexical FSTs.

Morphological disambiguation

A morphological analyzer will assign multiple possible analyses to certain surface forms. Morphological disambiguation aims to select the correct analysis.

  • Disambiguation needs context (preceding words, the whole sentence, ...)
  • This is a sequence classification task

Spell checking

A spell checker finds words that are not spelled correctly.

  • The lower projection of a lexical transducer enumerates the existing word forms.
  • Hunspell for Hungarian was built from the same lexical resource as the morphological analyzer Hunmorph

Lemmatization and stemming

Lemmatization is the task of finding the lemma ("dictionary form") of a word:

  • Same as morphological analysis without the morphological tags on the upper side
    • Hungarian: kutyáknak $\rightarrow$ kutya (for dogs $\rightarrow$ dog)
    • German: gegessen $\rightarrow$ essen (ate $\rightarrow$ eat)
  • With FSTs: $Morph \circ DelTags$, where $DelTags$ is an FST that deletes tags

Stemming is the process of reducing a word form to a "stem" using solely orthographic transformations.

  • Faster, as it does not require morphological analysis
  • Collapses inflected forms of the same word to a common string
  • Stems are not necessarily meaningful
    • English (Porter stemmer):
      • exactly $\rightarrow$ exactli
      • origins $\rightarrow$ origi (lemma would be origin)

Word and sentence segmentation

Word segmentation (tokenization) is the task of dividing a string into its components tokens (words).

  • Sounds simple, but isn't
    • English: I saw it, did you too Mr. Jones? $\rightarrow$ ['I', 'saw', 'it', ',', 'did', 'you', 'too', 'Mr.', 'Jones', '?']
    • Japanese: 進み続けてさえいれば、遅くとも関係ない。 $\rightarrow$ ['進み続けて', 'さえ', 'いれば', '、', '遅く', 'とも', '関係', 'ない', '。']
  • Usually done with sequence tagging, but rule-based (FST) solutions also exist
  • A simple solution: make the (lower projection of the) analyzer circular: $Morph^+$

Sentence segmentation: same for sentences.