Inference

In this lab we will discuss some inference problems


In [1]:
from cfg import WCFG, read_grammar_rules
from parser import cky
from symbol import make_symbol, is_nonterminal

In [2]:
# let's use our ambiguous grammar this time
G = WCFG(read_grammar_rules(open('examples/ambiguous', 'r')))
print G


[T] -> [P] (0.5)
[T] -> [T] * [P] (0.4)
[T] -> [T] + [P] (0.1)
[E] -> [T] (0.5)
[E] -> [E] + [T] (0.45)
[E] -> [E] * [T] (0.05)
[P] -> a (1.0)

In [3]:
sentence = 'a + a * a'.split()
sentence


Out[3]:
['a', '+', 'a', '*', 'a']

In [4]:
forest = cky(G, sentence)

In [6]:
print forest


[E:2-3] -> [T:2-3] (0.5)
[E:4-5] -> [T:4-5] (0.5)
[T:0-3] -> [T:0-1] + [P:2-3] (0.1)
[T:0-5] -> [T:0-3] * [P:4-5] (0.4)
[E:0-5] -> [T:0-5] (0.5)
[E:0-5] -> [E:0-3] * [T:4-5] (0.05)
[E:0-5] -> [E:0-1] + [T:2-5] (0.45)
[T:0-1] -> [P:0-1] (0.5)
[E:0-1] -> [T:0-1] (0.5)
[E:0-3] -> [E:0-1] + [T:2-3] (0.45)
[E:0-3] -> [T:0-3] (0.5)
[E:2-5] -> [E:2-3] * [T:4-5] (0.05)
[E:2-5] -> [T:2-5] (0.5)
[P:0-1] -> a (1.0)
[T:4-5] -> [P:4-5] (0.5)
[P:2-3] -> a (1.0)
[T:2-5] -> [T:2-3] * [P:4-5] (0.4)
[T:2-3] -> [P:2-3] (0.5)
[P:4-5] -> a (1.0)

Reminder: the goal symbol after parsing is the original start symbol annotated from 0 to n (the length of the sentence).


In [7]:
goal = make_symbol('[E]', 0, len(sentence))
goal


Out[7]:
'[E:0-5]'

Inside weights

The inside recursion accumulates the weight of all subtrees under a certain node.

    I(v) = 
        1                           if v is terminal
        0                           if v is nonterminal and BS(v) is empty
        \sum_{e \in BS(v)} w(e) \prod_{u \in tail(e)} I(u)   
                                    otherwise

Here we are going to compute inside weights for acyclic forests, for a more general treatment see Goodman's "Semiring Parsing" paper (1999).

Inside weights can be used, for instance, to answer the question:

  • what is the probability of sentence x?

It can also be used to find the best derivation and to sample derivations, as we will show below.


In [14]:
def inside(forest, start):  # acyclic hypergraph
    """
    The inside recursion for acyclic hypergraphs.
    
    :param forest: an acyclic WCFG
    :param start: the start symbol (str)
    :returns: a dictionary mapping a symbol (terminal or noterminal) to its inside weight
    """
    I = dict()
    
    def get_inside(symbol):
        """computes inside recursively"""
        w = I.get(symbol, None)
        if w is not None:  # already computed
            return w
        incoming = forest.get(symbol, set())
        if len(incoming) == 0:  # terminals have already been handled, this must be a nonterminal dead end
            # store it to avoid repeating computation in the future
            I[symbol] = 0.0  
            return 0.0
        # accumulate the inside contribution of each incoming edge
        w = 0.0
        for rule in incoming:
            k = rule.prob
            for child in rule.rhs:
                k *= get_inside(child)
            w += k
        # store it to avoid repeating computation in the future
        I[symbol] = w
        return w
    
    # handles terminals
    for sym in forest.terminals:
        I[sym] = 1.0
    # recursively solves the inside formula from the start symbol
    get_inside(start)
        
    return I

In [15]:
I = inside(forest, goal)

The inside at the root represents the probability of the sentence

p(x) = \sum_d p(x, d)

In [16]:
I[goal]


Out[16]:
0.034531250000000006

Counting derivations

Another interesting question is

  • how many analyses of a given sentence do we have?

This question is very simple to answer for acyclic hypergraphs and it turns out to be a special case of the inside recursion.

    N(v) = 
        1                           if v is terminal
        0                           if v is nonterminal and BS(v) is empty
        \sum_{e \in BS(v)} 1 * \prod_{u \in tail(e)} N(u)   
                                    otherwise

Compare the definition above with the inside recursion presented earlier. Also compare the program below with the inside computation and comment on the differences.

Can you explain this recursion?


In [18]:
def counting(forest, start):  # acyclic hypergraph
    N = dict()
    
    def get_count(symbol):
        w = N.get(symbol, None)
        if w is not None:
            return w
        incoming = forest.get(symbol, set())
        if len(incoming) == 0:  # terminals have already been handled, this must be a nonterminal dead end
            N[symbol] = w
            return 0
        w = 0
        for rule in incoming:
            k = 1
            for child in rule.rhs:
                k *= get_count(child)
            w += k
        N[symbol] = w
        return w
    
    # handles terminals
    for sym in forest.terminals:
        N[sym] = 1
    # handles nonterminals
    #for sym in forest.nonterminals:
    #    get_inside(sym)
    get_count(start)
        
    return N

In [19]:
N = counting(forest, goal)

The total number of derivations is associated with the value of N at the root.


In [20]:
N[goal]


Out[20]:
4

Viterbi (best derivation)

We might want to know which analysis score highest. That is,

    d* = argmax_d p(x, d)

where d ranges over all possible derivations.

Once we have computed inside weights, this is extremely simple to solve. However, we can also define a recursion which is specific for the computation of the Viterbi derivation. Do you think you can come up with its formula? Can you implement it?

Below, an implementation based on inside weights.


In [21]:
from collections import deque
def viterbi(forest, I, start):
    Q = deque([start])
    d = []
    while Q:
        parent = Q.popleft()
        incoming = forest.get(parent)
        # here we will find the distribution over edges
        weights = [0.0] * len(incoming)
        for i, rule in enumerate(incoming):
            weights[i] = rule.prob
            for child in rule.rhs:
                weights[i] *= I[child]
        # here we select the edge that is the maximum of this distribution
        weight, selected = max(zip(weights, incoming))
        # we also need to queue the nonterminals in the tail of the edge
        for sym in selected.rhs:
            if is_nonterminal(sym):
                Q.append(sym)
        # and finally, add the selected edge to the derivation
        d.append(selected)
    return d

In [23]:
d = viterbi(forest, I, goal)
d


Out[23]:
[[E:0-5] -> [E:0-1] + [T:2-5] (0.45),
 [E:0-1] -> [T:0-1] (0.5),
 [T:2-5] -> [T:2-3] * [P:4-5] (0.4),
 [T:0-1] -> [P:0-1] (0.5),
 [T:2-3] -> [P:2-3] (0.5),
 [P:4-5] -> a (1.0),
 [P:0-1] -> a (1.0),
 [P:2-3] -> a (1.0)]

The joint probability p(d, x) of the derivation is given by the product over its rules


In [25]:
def joint_probability(d):
    prob = 1.0
    for r in d:
        prob *= r.prob
    return prob

In [27]:
joint_probability(d)


Out[27]:
0.022500000000000003

The conditional probability p(d|x) is given by

p(d|x) = p(d,x)/p(x)

and we know that

p(x) = \sum_d p(d,x)

is given by the inside at the root.

Thus, the following is the conditional probability of the best derivation:


In [28]:
joint_probability(d)/I[goal]


Out[28]:
0.6515837104072397

In [30]:
# We can use this utilitary method to draw trees
from util import make_nltk_tree

In [31]:
t = make_nltk_tree(d)
print t


(E:0-5
  (E:0-1 (T:0-1 (P:0-1 a)))
  +
  (T:2-5 (T:2-3 (P:2-3 a)) * (P:4-5 a)))

the following will open a pop-up window and you need to find it ;)


In [32]:
t.draw()

In [ ]:

Sampling

Often we are interested in drawing random samples from the distribution p(d|x).

We can do that by sampling from the inverted CDF associated with p(d|x). The conditional independence assumption central to PCFGs make them convenient for sampling by ancestral sampling.

The code below is very similar to the Viterbi code above, however, instead of maximising at each step, we draw a random edge from the distribution defined by their inside weights.


In [33]:
from collections import deque
import random
def sample(forest, I, start):
    Q = deque([start])
    d = []
    while Q:
        parent = Q.popleft()
        incoming = forest.get(parent)
        # here we compute the distribution over edges
        weights = [0.0] * len(incoming)
        for i, rule in enumerate(incoming):
            weights[i] = rule.prob
            for child in rule.rhs:
                weights[i] *= I[child]
        # here we draw a random threshold (think of it as sampling from the inverted CDF)
        th = random.uniform(0, I[parent])
        # here we compute the CDF step by step and check
        # for which edge e whether cdf(e) > th
        total = 0.0
        selected = None
        back = None
        for w, rule in zip(weights, incoming):
            total += w
            if total > th:
                selected = rule
                break
            else:
                back = rule
        if selected is None:  # this is to deal with corner cases due to rounding problems
            selected = back
        # every nonterminal child of the selected edge must be added to the queue
        for sym in selected.rhs:
            if is_nonterminal(sym):
                Q.append(sym)
        d.append(selected)
    return d

Here we draw a random sample


In [36]:
sample(forest, I, goal)


Out[36]:
[[E:0-5] -> [T:0-5] (0.5),
 [T:0-5] -> [T:0-3] * [P:4-5] (0.4),
 [T:0-3] -> [T:0-1] + [P:2-3] (0.1),
 [P:4-5] -> a (1.0),
 [T:0-1] -> [P:0-1] (0.5),
 [P:2-3] -> a (1.0),
 [P:0-1] -> a (1.0)]

Empirical distribution

A nice thing to do whe we can sample is to obtain an empirical distribution. That is, we draw a number of independent samples from the underlying distribution and approximate their probabilities by their relative frequency in the sample. The law of large numbers says that our estimates converge to true probabilities as we sample more. That also holds for expectations that we might want to compute based on the underlying distribution.

Note how our estimates are pretty close to the true probabilities. This is because we are drawing indepent samples from the exact conditional distribution p(d|x).


In [40]:
# an example of how to estimate an empirical distribution out of 100 samples
from collections import defaultdict
counts = defaultdict(int)
n_samples = 1000
# here we sample a number of derivations
for i in range(n_samples):
    d = tuple(sample(forest, I, goal))
    counts[d] += 1  # counting how often we they get sampled

# here we sort them by frequency from most frequent to least frequent
for d, n in sorted(counts.iteritems(), key=lambda (d, n): n, reverse=True):
    # here we compute the exact probability (for comparison)
    prob = joint_probability(d)/I[goal]
    # here we compute an empirical estimate
    estimate = float(n)/n_samples
    t = make_nltk_tree(d)
    print '# n=%d prob=%s estimate=%s\n%s' % (n, prob, estimate, str(t))
    print


# n=661 prob=0.651583710407 estimate=0.661
(E:0-5
  (E:0-1 (T:0-1 (P:0-1 a)))
  +
  (T:2-5 (T:2-3 (P:2-3 a)) * (P:4-5 a)))

# n=297 prob=0.289592760181 estimate=0.297
(E:0-5 (T:0-5 (T:0-3 (T:0-1 (P:0-1 a)) + (P:2-3 a)) * (P:4-5 a)))

# n=30 prob=0.0407239819005 estimate=0.03
(E:0-5
  (E:0-3 (E:0-1 (T:0-1 (P:0-1 a))) + (T:2-3 (P:2-3 a)))
  *
  (T:4-5 (P:4-5 a)))

# n=12 prob=0.0180995475113 estimate=0.012
(E:0-5
  (E:0-3 (T:0-3 (T:0-1 (P:0-1 a)) + (P:2-3 a)))
  *
  (T:4-5 (P:4-5 a)))

Note that our grammar, even though ambiguous, expresses a strong preference (a little over 65%) for the analysis that solves the product before the sum.


In [ ]:


In [ ]: