From Ancient Greek: "the study of forms"
Has a different meaning in different fields:
Morphology is the study of how words are formed in terms of the minimal meaning-bearing unit: morphemes.
Parse / features | space[/N] | ship[/N] | tok[Poss.2Pl] | ból[Ela] |
Surface form | űr | +hajó | +tok | +ból |
Translation | "from your spaceship" |
Meanings of the morphemes:
[/N]
: noun[Poss.2Pl]
: possessive 2nd person plural (your)[Ela]
: elassive case (from)The strings next to the morphemes ([/N]
, [Poss.2Pl]
and [Ela]
) are called tags or features
Computational morphology usually works in both ways:
Several tasks require it:
Morphology serves as the first level in linguistic pipelines:
One might ask if a single list of all word forms is not enough
Works well for English:
The list method quickly breaks down:
However, tmesis might occur:
[Subj.Def.2Sg]
)[Pst]
); seas ([Pl]
)[Pst.Def.3Sg]
), asztalon (on the table, [Supe]
)[Acc]
)Grammatical / semantic changes are reflected by modifications to the stem rather than affixation.
k-t-b
can become[2Sg]
)Don't pay too much attention to this:
The established view of morphology is that morphemes and their intra-word interactions can be formalized as rules. A morphological analyzer has the following three main components:
These components are usually implemented with finite state methods: automata and transducers.
Papers cited above:
Still, machine learning methods start to appear:
Papers cited above:
The last two (or three) is backed by the lexical resource
Papers cited above:
Two types of finite state machines are of interest:
Note how it is very similar to a trie: an ideal format for a lexicon.
The mathematical model of a finite state automaton is a 5-tuple:
The model of an FST is a 6-tuple, with these changes:
The set of symbol sequences ($\Sigma^*$) a machine accepts is its language ($\mathcal{L}$). FSTs have input and output languages ($\mathcal{L}$ and $\mathcal{L'}$).
Languages accepted by FSA are regular languages. FSTs implement regular relations or functions.
re
) are equivalentgrep
are implemented as FSARegular languages are memory-less, and therefore very limited:
We won't go into further details (determinism, minimalism, closure properties, etc). The interested reader is referred to textbooks on the subject, e.g.
or the course Languages and Automata in this faculty.
Inversion
Composition
Projection
The three components of a morphological analyser can be implemented as FSA / FSTs:
These three can be used in cascade, or composed into a single morphological FST $Morph = Lex \circ MT \circ Orto$:
Inversion allows the lexical FST $Morph$ to work in both directions.
Both directions can be ambiguous:
vár[/N]nak[Dat]
(for/to the castle)vár[/V]nak[Prs.NDef.3Pl]
(they are waiting)[2Sg]
imperative $\rightarrow$ ülj / üljél[Pl]
$\rightarrow$ fish / fishesDative in Hungarian can also play a part in a possessive construct, e.g. the tower of the castle:
FSTs enumerate all candidates via backtracking.
Xerox FST (XFST) is a finite state toolkit developed by Xerox Research $-$ 1997
Foma is a reimplementation of XFST by Mans Hulden $-$ 2010
Helsinki Finite State Technology (HFST) $-$ 2009
XFST / foma support two formalisms to define FSTs
It is usual to build the lexicon in lexc and the rest with REs. However, we shall only cover lexc in this class.
"Does this" | Python | XFST |
---|---|---|
Verbatim | abc |
[a b c] or {abc} |
Matching $0-1$ | (abc)? |
{abc}^<2 |
Kleene star | (abc)* |
{abc}* |
Kleene plus | (abc)+ |
{abc}+ |
Any character | . |
? |
Disjunction | a |b |c |
[A | B | C] |
Containment | .*a.* |
$a |
Splice x into ab |
x*ax*bx* |
[{ab} / x] |
Conjunction | [A & B & C] |
|
Complementation | ~A |
Also: regular expressions for transducers, which are obviously missing in Python.
The basic unit of lexc is the LEXICON
LEXICON
corresponds to a morpheme class / grammatical function (verbs, nouns; plural, possessive, etc.)LEXICON
definesLEXICON
(s) that generate the next morphemeA morpheme can be
[3Sg]:s
Multichar_Symbols MEOW MOO WOOF
LEXICON Root
cat CatSound ;
kitty CatSound ;
cowMOO:cow # ; ! In-place transduction; # = end-of-word
dog DogSound ;
puppy DogSound ;
LEXICON CatSound
MEOW:0 # ; ! MEOW on the upper tape; nothing on the lower
LEXICON DogSound
WOOF:0 # ; ! WOOF on the upper tape; nothing on the lower
A few examples for tasks related to morphological analysis, and ways of solving them with lexical FSTs.
A morphological analyzer will assign multiple possible analyses to certain surface forms. Morphological disambiguation aims to select the correct analysis.
A spell checker finds words that are not spelled correctly.
Lemmatization is the task of finding the lemma ("dictionary form") of a word:
Stemming is the process of reducing a word form to a "stem" using solely orthographic transformations.
Word segmentation (tokenization) is the task of dividing a string into its components tokens (words).
['I', 'saw', 'it', ',', 'did', 'you', 'too', 'Mr.', 'Jones', '?']
['進み続けて', 'さえ', 'いれば', '、', '遅く', 'とも', '関係', 'ない', '。']
Japanese quote from https://www.linguajunkie.com/japanese/motivational-quotes-inspirational.
Sentence segmentation: same for sentences.