In [1]:
from cfg import WCFG, read_grammar_rules
from parser import cky
from symbol import make_symbol, is_nonterminal
In [2]:
# let's use our ambiguous grammar this time
G = WCFG(read_grammar_rules(open('examples/ambiguous', 'r')))
print G
In [3]:
sentence = 'a + a * a'.split()
sentence
Out[3]:
In [4]:
forest = cky(G, sentence)
In [6]:
print forest
Reminder: the goal symbol after parsing is the original start symbol annotated from 0 to n (the length of the sentence).
In [7]:
goal = make_symbol('[E]', 0, len(sentence))
goal
Out[7]:
The inside recursion accumulates the weight of all subtrees under a certain node.
I(v) =
1 if v is terminal
0 if v is nonterminal and BS(v) is empty
\sum_{e \in BS(v)} w(e) \prod_{u \in tail(e)} I(u)
otherwise
Here we are going to compute inside weights for acyclic forests, for a more general treatment see Goodman's "Semiring Parsing" paper (1999).
Inside weights can be used, for instance, to answer the question:
It can also be used to find the best derivation and to sample derivations, as we will show below.
In [14]:
def inside(forest, start): # acyclic hypergraph
"""
The inside recursion for acyclic hypergraphs.
:param forest: an acyclic WCFG
:param start: the start symbol (str)
:returns: a dictionary mapping a symbol (terminal or noterminal) to its inside weight
"""
I = dict()
def get_inside(symbol):
"""computes inside recursively"""
w = I.get(symbol, None)
if w is not None: # already computed
return w
incoming = forest.get(symbol, set())
if len(incoming) == 0: # terminals have already been handled, this must be a nonterminal dead end
# store it to avoid repeating computation in the future
I[symbol] = 0.0
return 0.0
# accumulate the inside contribution of each incoming edge
w = 0.0
for rule in incoming:
k = rule.prob
for child in rule.rhs:
k *= get_inside(child)
w += k
# store it to avoid repeating computation in the future
I[symbol] = w
return w
# handles terminals
for sym in forest.terminals:
I[sym] = 1.0
# recursively solves the inside formula from the start symbol
get_inside(start)
return I
In [15]:
I = inside(forest, goal)
The inside at the root represents the probability of the sentence
p(x) = \sum_d p(x, d)
In [16]:
I[goal]
Out[16]:
Another interesting question is
This question is very simple to answer for acyclic hypergraphs and it turns out to be a special case of the inside recursion.
N(v) =
1 if v is terminal
0 if v is nonterminal and BS(v) is empty
\sum_{e \in BS(v)} 1 * \prod_{u \in tail(e)} N(u)
otherwise
Compare the definition above with the inside recursion presented earlier. Also compare the program below with the inside computation and comment on the differences.
Can you explain this recursion?
In [18]:
def counting(forest, start): # acyclic hypergraph
N = dict()
def get_count(symbol):
w = N.get(symbol, None)
if w is not None:
return w
incoming = forest.get(symbol, set())
if len(incoming) == 0: # terminals have already been handled, this must be a nonterminal dead end
N[symbol] = w
return 0
w = 0
for rule in incoming:
k = 1
for child in rule.rhs:
k *= get_count(child)
w += k
N[symbol] = w
return w
# handles terminals
for sym in forest.terminals:
N[sym] = 1
# handles nonterminals
#for sym in forest.nonterminals:
# get_inside(sym)
get_count(start)
return N
In [19]:
N = counting(forest, goal)
The total number of derivations is associated with the value of N at the root.
In [20]:
N[goal]
Out[20]:
We might want to know which analysis score highest. That is,
d* = argmax_d p(x, d)
where d ranges over all possible derivations.
Once we have computed inside weights, this is extremely simple to solve. However, we can also define a recursion which is specific for the computation of the Viterbi derivation. Do you think you can come up with its formula? Can you implement it?
Below, an implementation based on inside weights.
In [21]:
from collections import deque
def viterbi(forest, I, start):
Q = deque([start])
d = []
while Q:
parent = Q.popleft()
incoming = forest.get(parent)
# here we will find the distribution over edges
weights = [0.0] * len(incoming)
for i, rule in enumerate(incoming):
weights[i] = rule.prob
for child in rule.rhs:
weights[i] *= I[child]
# here we select the edge that is the maximum of this distribution
weight, selected = max(zip(weights, incoming))
# we also need to queue the nonterminals in the tail of the edge
for sym in selected.rhs:
if is_nonterminal(sym):
Q.append(sym)
# and finally, add the selected edge to the derivation
d.append(selected)
return d
In [23]:
d = viterbi(forest, I, goal)
d
Out[23]:
The joint probability p(d, x) of the derivation is given by the product over its rules
In [25]:
def joint_probability(d):
prob = 1.0
for r in d:
prob *= r.prob
return prob
In [27]:
joint_probability(d)
Out[27]:
The conditional probability p(d|x) is given by
p(d|x) = p(d,x)/p(x)
and we know that
p(x) = \sum_d p(d,x)
is given by the inside at the root.
Thus, the following is the conditional probability of the best derivation:
In [28]:
joint_probability(d)/I[goal]
Out[28]:
In [30]:
# We can use this utilitary method to draw trees
from util import make_nltk_tree
In [31]:
t = make_nltk_tree(d)
print t
the following will open a pop-up window and you need to find it ;)
In [32]:
t.draw()
In [ ]:
Often we are interested in drawing random samples from the distribution p(d|x).
We can do that by sampling from the inverted CDF associated with p(d|x). The conditional independence assumption central to PCFGs make them convenient for sampling by ancestral sampling.
The code below is very similar to the Viterbi code above, however, instead of maximising at each step, we draw a random edge from the distribution defined by their inside weights.
In [33]:
from collections import deque
import random
def sample(forest, I, start):
Q = deque([start])
d = []
while Q:
parent = Q.popleft()
incoming = forest.get(parent)
# here we compute the distribution over edges
weights = [0.0] * len(incoming)
for i, rule in enumerate(incoming):
weights[i] = rule.prob
for child in rule.rhs:
weights[i] *= I[child]
# here we draw a random threshold (think of it as sampling from the inverted CDF)
th = random.uniform(0, I[parent])
# here we compute the CDF step by step and check
# for which edge e whether cdf(e) > th
total = 0.0
selected = None
back = None
for w, rule in zip(weights, incoming):
total += w
if total > th:
selected = rule
break
else:
back = rule
if selected is None: # this is to deal with corner cases due to rounding problems
selected = back
# every nonterminal child of the selected edge must be added to the queue
for sym in selected.rhs:
if is_nonterminal(sym):
Q.append(sym)
d.append(selected)
return d
Here we draw a random sample
In [36]:
sample(forest, I, goal)
Out[36]:
A nice thing to do whe we can sample is to obtain an empirical distribution. That is, we draw a number of independent samples from the underlying distribution and approximate their probabilities by their relative frequency in the sample. The law of large numbers says that our estimates converge to true probabilities as we sample more. That also holds for expectations that we might want to compute based on the underlying distribution.
Note how our estimates are pretty close to the true probabilities. This is because we are drawing indepent samples from the exact conditional distribution p(d|x).
In [40]:
# an example of how to estimate an empirical distribution out of 100 samples
from collections import defaultdict
counts = defaultdict(int)
n_samples = 1000
# here we sample a number of derivations
for i in range(n_samples):
d = tuple(sample(forest, I, goal))
counts[d] += 1 # counting how often we they get sampled
# here we sort them by frequency from most frequent to least frequent
for d, n in sorted(counts.iteritems(), key=lambda (d, n): n, reverse=True):
# here we compute the exact probability (for comparison)
prob = joint_probability(d)/I[goal]
# here we compute an empirical estimate
estimate = float(n)/n_samples
t = make_nltk_tree(d)
print '# n=%d prob=%s estimate=%s\n%s' % (n, prob, estimate, str(t))
print
Note that our grammar, even though ambiguous, expresses a strong preference (a little over 65%) for the analysis that solves the product before the sum.
In [ ]:
In [ ]: