8. Morphology — Lexc Intuition

Lexc Intuition

The best analogue to the task of modeling a language with agglutinative morphology might be the Lego blocks we all used to play with as children. One builds the ground floor (the root lexicon) first, then adds the storeys (inflection slots) one-by-one. However, it is not always clear how to do this; after all, Lego also comes with illustrated assembly instructions. I added something similar below.

How many lexicons do I need?

While the lexc format is very simple, it is easy to lose track of what's going on when the number of lexicons grow (as they will in the Hungarian adjective tasks). Luckily, with a concatenative morphology, one can sketch the interconnections of the lexicons on paper. What follows is the flowchart for task H1 (as in regular expressions, ? here marks optionality).

Just by looking at the chart, it becomes clear we will need three lexicons:

  • Adjectives
  • Plural
  • Accusative

We didn't put the lemmas under Root, because later we might want to extend our grammar with nouns, numerals, etc. as well.

Observe how different this is from the English example, where the verb endings occupy the same (only) slot: verb forms such as talkeds do not exist.

Where should I put the # (end-of-word)?

The question of how to "cut the word short", i.e. skip lexicons comes up very early in the Hungarian exercise. For example, when we analyze the word mély, we don't need the contributions from the Plural and Accusative lexicons. To the novice lexc user, it might not be readily apparent where to put the # (end-of-word marker) in such cases. A few possible solutions are listed below:

  • add the # to all lexicons (e.g. mély # ; and mély Plural ; in Adjectives, ek # ; and ek Accusative ; in Plural, etc.)
  • jump to the last lexicon (e.g. mély Accusative ; and mély Plural ; in Adjectives, etc.)
  • only end (#) the word in the last lexicon (Case), and add "fall-through" entries (e.g. 0 Case ; in Plural) to the preceeding lexicons.

Either solution is acceptable, and will generate equivalent FSTs. However, the third method is probably the simplest and the most linguistically motivated. To see why, refer to the figure below, which presents a fully linguistically valid alternative to the previous Hungarian flowchart:

It turns out that, after all, we don't really just add the plural or accusative markers: we choose between the plural and singular numbers, and the nominative and accusative (etc.) cases. Both numbers and each of the cases have marker(s); only for the singular number and the nominative case it is the empty string! Going back to the Lego example: when we are dealing with a [Sg][Nom] word, we don't just build the ground floor and then leave the building like that: we build the next two storeys as well, only they are invisible. In linguistics, such "invisible storeys" are called the zero morpheme, and that is why in lexc, no output is represented by 0.

According to the argument above, the best solution would probably be:

LEXICON Root
            Adjectives ;

LEXICON Adjectives
csendes     Number ;
egészséges  Number ;
...

LEXICON Number
0           Case ;  ! Singular
ek          Case ;  ! Plural

LEXICON Case
0           # ;  ! Nominative
et          # ;  ! Accusative

Separate lexicons for a single tag?

In task H2, you were presented with two ways of organizing transductions: adding the whole upper-lower string pair into a single lexicon, or keeping the common part in one and just printing the tag (in this case) to the upper tape in the next. You might wonder if one solution is better than the other. Well, the answer is: it depends. After all, if you need a 4x2 Lego block, you can just use one or use two 2x2 blocks. It is a matter of taste and convenience.