The maximum score of this homework is 100+20 points. Grading is listed in this table:
Grade | Score range |
---|---|
5 | 85+ |
4 | 70-84 |
3 | 55-69 |
2 | 40-54 |
1 | 0-39 |
Most exercises include tests which should pass if your solution is correct. However, successful tests do not guarantee that your solution is correct. You are free to add more tests.
In [ ]:
This exercise assumes that you have finished the lexc lab exercises. The grammar you hand in have to handle the adjectives listed in the lab exercises, as well as the number (singular / plural), the nominative and accusative cases, comparative and superlative forms, and vowel harmony.
There are two sub-tasks to this exercise. You can download them from
(Note: all letters in the Neptun code must be capitalized.)
Please don't solve the exercises in the downloaded notebooks, but copy the descriptions and starter code snippets to the cells below.
In [ ]:
In [ ]:
Implement morphological analysis with a CFG grammar. Requirements:
nlkt.CFG.fromstring()
cannot parse tags such as [Acc]
as nonterminals. You are free to call them whatever you wish, but descriptive names are encouraged.CFGMorphParser
class:__init__()
should accept no parameters (or at least provide defaults)parse_tree()
method, which accepts a word and returns the parse treeparse()
method, which accepts a word and returns the morphological parse in the same form as HFST. Refer to hfst_lookup
and the tests for the format. nltk.tree.Tree.pos()
is a good starting pointNote that having the same format as HFST doesn't mean you have to return the exact same output: for instance, we defined terhes as a genuine adjective, even though it is derived from the noun teher. So HFST would analyze it as teher[/N]es[_Adjz:s/Adj][Nom]
, but you only need to return terhes[/Adj]
. You also don't have to cover [Nom], because of this bug.
In [1]:
import nltk
from nltk.tree import Tree
tree = Tree('__NOM_1', [
Tree('__NOM_2', [
Tree('__SUPL_ADJ_2', [
Tree('[Supl]', ['leg']),
Tree('[/Adj]', ['nagy']),
Tree('[_Comp/Adj]', ['obb']),
]),
Tree('[Pl]', ['ak']),
]),
Tree('[Acc]', ['at']),
])
display(tree)
# If display() doesn't work, try this
# tree.pretty_print()
In [ ]:
# Your implementation here
# Tests
parser = CFGMorphParser()
assert parser.parse('unknown_word') is None
assert parser.parse('terhes') == 'terhes[/Adj]'
assert parser.parse('csendesebbet') == 'csendes[/Adj]ebb[_Comp/Adj]et[Acc]'
assert parser.parse('legfinomabbak') == 'leg[/Supl]finom[/Adj]abb[_Comp/Adj]ak[Pl]'
# It's OK here
assert parser.parse('legfinomabbek') == 'leg[/Supl]finom[/Adj]abb[_Comp/Adj]ek[Pl]'
Also handle vowel harmony. Write a function that traverses the tree manually (similarly to exercise 2.4 in the lab) and returns True
or False
, depending on whether the tree conforms to vowel harmony rules. Use this function in parse_tree
(and parse
) to filter invalid trees.
In [ ]:
# Tests
assert parser.parse('legfinomabbak') == 'leg[/Supl]finom[/Adj]abb[_Comp/Adj]ak[Pl]'
assert parser.parse('legfinomabbek') == None
Parse the treebank file en_lines-ud-train.s
in the notebook's directory. Write a generator function that reads the file and yields nltk.tree.Tree
objects. In particular,
Tree.fromstring()
function converts an s-expression into a treeOpen the file in an editor to see the formatting.
Note that the file was created by parsing the LinES dependency corpus with Stanford CoreNLP, so it is not a gold standard by any means, but it will suffice for now.
In [ ]:
from nltk.tree import Tree
def parse_treebank(treebank_file):
pass
# Tests
assert sum(1 for _ in parse_treebank('en_lines-ud-train.s')) == 2613
assert isinstance(next(parse_treebank('en_lines-ud-train.s')), Tree)
In order to avoid problems further down the line, we shall only handle a subset of the trees in the treebank. We call a tree valid, if
'S'
Write a function that returns True
for "valid" trees and False
for invalid ones. Filter the your generator with it.
In [ ]:
def is_tree_valid(tree):
pass
# Tests
assert sum(map(is_tree_valid, parse_treebank('en_lines-ud-train.s'))) == 2311
Now that you have the trees, it is time to induce (train) a PCFG grammar for it! Luckily, nltk
has a functions for just that: nltk.grammar.induce_pcfg
. Use it to acquire your PCFG grammar. You can find hints at how to use it in the grammar module.
Note: since we want to parse sentences with the PCKY algorithm, we need our grammar to be in CNF. Unfortunately, nlkt
cannot convert a grammar to CNF, so you have to ensure that the trees are in CNF before feeding them to the PCFG induction function. That way, we can be sure that our grammar will be also. There are two functions that ensure a tree is in CNF:
collapse_unary
. Make sure you call it with collapsePOS=True
!chomsky_normal_form
. Do not use any smoothing.
In [ ]:
def train_grammar(trees):
pass
def is_grammar_cnf(grammar):
for prod in grammar.productions():
rhs = prod.rhs()
if len(rhs) > 2 or (len(rhs) == 1 and isinstance(rhs[0], nltk.Nonterminal)):
return False
return True
# Tests
grammar = train_grammar(filter(is_tree_valid, parse_treebank('en_lines-ud-train.s')))
assert len(grammar.productions()) == 15000
assert is_grammar_cnf(grammar)
Implement the PCKY algorithm. Encapsulate it in a class called PCKYParser
. Extend your CKYParser
solution from the lab so that it creates trees with probabilities (ProbabilisticTree
). The parse()
method should also accept a parameter n
, and only return the most probable n
trees (as a generator).
Some pointers:
Evaluate your grammar on the test split of the treebank (en_lines-ud-dev.s
). Implement the unlabelled PARSEVAL metric. See the first answer for an example.
The functionality should be encapsulated in a class called FomaRegex
. The public API specification is as follows:
pattern
member fieldconvert
static method that does the pattern conversion. You can use pure Python or better yet, a CFG grammarfsa_file
field. The regex <regex> ;
command can be used to compile a regex in foma; for the rest, refer to the compile_lexc()
functionfsa_file
member set to None
You only need to account for the first six rows in the table comparing the two syntaxes. Additionally, you only need to cover the characters a-zA-Z0-9 (i.e. no punctuation). Note that there are two options for verbatim texts in XFST: [a b c]
or {abc}
. You are encouraged to use the latter; should you choose to use the former, update the assert statements accordingly.
You don't have to worry about applying the regex at this point.
In [ ]:
import os
class FomaRegex:
pass
# Tests
assert FomaRegex.convert('ab?c*d+') == '{a}{b}^<2{c}*{d}+'
assert FomaRegex.convert('a.b') == '{a}?{b}'
assert FomaRegex.convert('a+(bc|de).*') == '{a}+[{bc}|{de}]?*'
with FomaRegex('a.b') as fr:
assert fr.pattern == '{a}?{b}', 'Invalid pattern'
assert fr.fsa_file is not None, 'FSA file is None in with'
fsa_file = fr.fsa_file
assert fr.fsa_file is None, 'FSA file is not None after with'
assert not os.path.isfile(fsa_file), 'FSA file still exists after with'
In [ ]:
# Tests
with FomaRegex('a*(bc|de).+') as fr:
assert fr.match('aabcd') is True
assert fr.match('ade') is False
In [ ]:
# Tests
with FomaRegex('a') as a, FomaRegex('b') as b:
assert a.fsa_file != b.fsa_file