The maximum score of this homework is 100 points. Grading is listed in this table:
Grade | Score range |
---|---|
5 | 85+ |
4 | 70-84 |
3 | 55-69 |
2 | 40-54 |
1 | 0-39 |
Most exercises include tests which should pass if your solution is correct. However successful test do not guarantee that your solution is correct. You are free to add more tests.
2017 November 20th Monday 23:59
Implement the Viterbi algorithm fot $k=3$.
The input can be found here. You can use the 1, 10 or 100 million word corpus, it is advised to use the 1 million corpus while you are developing and you can try te larger ones afterwards.
Write a class called Viterbi
with the following attributes:
__init__
: has no arguments, except self
train
: one argument (aside self
), an iterable object which generates 2-tuples of strings [(word1, pos1), (word2, pos2), ...]
None
self
), an iterable object, a list of words.Don't use global variables!
In [ ]:
import numpy
In [ ]:
def Viterbi(object):
pass
with open("umbc.casesensitive.word_pos.1M.txt", "r", encoding="utf8", errors="ignore") as f:
generator1 = (line.strip().split("\t") for line in f)
generator2 = (line for line in generator1 if len(line) == 2)
viterbi = Viterbi()
viterbi.train(generator2)
In [ ]:
assert(viterbi.predict("You talk the talk .".split()) == ['PRP', 'VB', 'DT', 'NN', '.'])
print(viterbi.predict("The dog .".split()))
print(viterbi.predict("The dog runs .".split()))
print(viterbi.predict("The dog runs slowly .".split()))
print(viterbi.predict("The dog 's run was slow .".split()))
Modify the Viterbi class: use a logarithmic scale for probabilities.
In the Viterbi table instead of $$ \pi(k,u,v) = \max_{w\in L} \pi(k-1, w, u)\cdot \mathbb{P}(v \ | \ w,u)\cdot \mathbb{P}(w_k \ | \ v) $$ use $$ \hat\pi(k,u,v) = \max_{w\in L} \hat\pi(k−1,w,u) + \log\mathbb{P}(v \ | \ w,u) + \log\mathbb{P}(w_k \ | \ v) $$
Note that the minimum probability is $0$, but the minimum logarithm is $-\infty$. Both numpy and python float can deal with minus infinity.
Precalculate the log-probabilities in the initializer, not during the dymanic programming.
This should not affect the result, just the numbers in the viterbi table.
Name the log-scaled imlementation ViterbiLog
, it can inherit from Viterbi
or it can be a whole new class.
In [ ]:
with open("umbc.casesensitive.word_pos.1M.txt", "r", encoding="utf8", errors="ignore") as f:
generator1 = (line.strip().split("\t") for line in f)
generator2 = (line for line in generator1 if len(line) == 2)
viterbi_log = ViterbiLog()
viterbi_log.train(generator2)
In [ ]:
assert(viterbi.predict("The dog runs slowly .".split()) == viterbi_log.predict("The dog runs slowly .".split()))
Modify the Viterbi class: use a sparse storage for transition probabilities, not a 3-dimensional array.
Use a dict to store the frequencies of the 2 and 3 tuples of labels.
For example if you had "adjective noun" 10 times and "adjective noun determinant" 5 times, then store the following
In [ ]:
{('JJ', 'NN'): 10, ('JJ', 'NN', 'DT'): 5}
In the example $\mathbb{P}(DT \ | \ JJ, NN ) = 0.5$
Note that whenever $\#\{JJ, NN\} = 0$ or $\#\{JJ, NN, DT\} = 0$, then $\mathbb{P}(DT \ | \ JJ, NN ) = 0$.
Implement this in a new class ViterbiSparse
, it can inherit from the original one or it can be a new class.
Try to find a sparse representation (with dict
-s) which makes the inner for loop shorter. Note that you don't have to take the maximum over all the $w\in L$ elements, if you already know that some transition probabilities are zeros.
In [ ]: