Input

Our working example is a grammar that analyses arithmetic operations such as sum and multiplication.



In [1]:

    
from cfg import read_grammar_rules
from cfg import WCFG
from rule import Rule
from symbol import is_terminal, is_nonterminal, make_symbol
# for convenience we will use `defaultdict` from the package collections
# which allows us to define a default constructor for values 
from collections import defaultdict



In [39]:

    
G1 = WCFG(read_grammar_rules(open('examples/arithmetic', 'r')))



In [40]:

    
print G1









    



[T] -> [P] (0.5)
[T] -> [T] * [P] (0.5)
[E] -> [T] (0.5)
[E] -> [E] + [T] (0.5)
[P] -> a (1.0)

CKY parser

We will implement a generalised CKY which can deal with an arbitrary epsilon-free PCFG.

Item

We are using dotted rules to represent items in CKY. A dotted rule is basically a context-free production agumented with a list of positions of the input sentence.



In [41]:

    
class Item(object):  # this class is also available in `item.py`
    """A dotted rule used in CKY/Earley."""
    
    def __init__(self, rule, dots):
        assert len(dots) > 0, 'I do not accept an empty list of dots'
        self.rule_ = rule
        self.dots_ = tuple(dots)
        
    def __eq__(self, other):
        return self.rule_ == other.rule_ and self.dots_ == other.dots_
    
    def __ne__(self, other):
        return not(self == other)
    
    def __hash__(self):
        return hash((self.rule_, self.dots_))
    
    def __repr__(self):
        return '{0} ||| {1}'.format(self.rule_, self.dots_)
    
    def __str__(self):
        return '{0} ||| {1}'.format(self.rule_, self.dots_)
    
    @property
    def lhs(self):
        return self.rule_.lhs
    
    @property
    def rule(self):
        return self.rule_
    
    @property
    def dot(self):
        return self.dots_[-1]
    
    @property
    def start(self):
        return self.dots_[0]
    
    @property
    def next(self):
        """return the symbol to the right of the dot (or None, if the item is complete)"""
        if self.is_complete():
            return None
        return self.rule_.rhs[len(self.dots_) - 1]
    
    def state(self, i):
        return self.dots_[i]

    def advance(self, dot):
        """return a new item with an extended sequence of dots"""
        return Item(self.rule_, self.dots_ + (dot,))
    
    def is_complete(self):
        """complete items are those whose dot reached the end of the RHS sequence"""
        return len(self.rule_.rhs) + 1 == len(self.dots_)



In [42]:

    
r = Rule('[S]', ['[X]'], 0.0)
i1 = Item(r, [0])
i2 = i1.advance(1)



In [43]:

    
print i1
print i2









    



[S] -> [X] (0.0) ||| (0,)
[S] -> [X] (0.0) ||| (0, 1)



In [44]:

    
i1 != i2









    Out[44]:





True



In [45]:

    
i1.is_complete()









    Out[45]:





False



In [46]:

    
i2.is_complete()









    Out[46]:





True



In [47]:

    
i1.next









    Out[47]:





'[X]'



In [48]:

    
i2.next

Agenda

We need an agenda of items. Some items are active, others are passive.

The active agenda is nothing but a stack or queue, whereas the passive agenda is simply a set (all items that have already been processed). However, we can further organise the passive items for easy/quick access within inference rules.



In [49]:

    
class Agenda(object):  # this class is also available in `agenda.py`
    
    def __init__(self):
        # we are organising active items in a stack (last in first out)
        self._active = []
        # an item should never queue twice, thus we will manage a set of items which we have already seen
        self._seen = set()
        # we organise incomplete items by the symbols they wait for at a certain position
        # that is, if the key is a pair (Y, i)
        # the value is a set of items of the form
        # [X -> alpha * Y beta, [...i]]
        self._incomplete = defaultdict(set)  
        # we organise complete items by their LHS symbol spanning from a certain position
        # if the key is a pair (X, i)
        # then the value is a set of items of the form
        # [X -> gamma *, [i ... j]]
        self._complete = defaultdict(set)
        
    def __len__(self):
        """return the number of active items"""
        return len(self._active)
        
    def push(self, item):
        """push an item into the queue of active items"""
        if item not in self._seen:  # if an item has been seen before, we simply ignore it
            self._active.append(item)
            self._seen.add(item)
            return True
        return False
    
    def pop(self):
        """pop an active item"""
        assert len(self._active) > 0, 'I have no items left.'
        return self._active.pop()
    
    def make_passive(self, item):
        if item.is_complete():  # complete items offer a way to rewrite a certain LHS from a certain position
            self._complete[(item.lhs, item.start)].add(item)
        else:  # incomplete items are waiting for the completion of the symbol to the right of the dot
            self._incomplete[(item.next, item.dot)].add(item)
            
    def waiting(self, symbol, dot):
        return self._incomplete.get((symbol, dot), set())
    
    def complete(self, lhs, start):
        return self._complete.get((lhs, start), set())
    
    def itercomplete(self):
        """an iterator over complete items in arbitrary order"""
        for items in self._complete.itervalues():
            for item in items:
                yield item

Let's see how it works



In [50]:

    
A = Agenda()



In [51]:

    
r1 = Rule('[S]', ['[S]', '[X]'], 1.0)
r1









    Out[51]:





[S] -> [S] [X] (1.0)

we can push items into the agenda



In [52]:

    
A.push(Item(r1, [0]))  # S -> S X, [0]  (earley axiom)









    Out[52]:





True

and the agenda will make sure there are no duplicates



In [53]:

    
A.push(Item(r1, [0]))









    Out[53]:





False



In [54]:

    
len(A)









    Out[54]:





1



In [55]:

    
i1 = Item(r1, [0])
i1









    Out[55]:





[S] -> [S] [X] (1.0) ||| (0,)



In [56]:

    
A.make_passive(i1)



In [57]:

    
A._incomplete









    Out[57]:





defaultdict(<type 'set'>, {('[S]', 0): set([[S] -> [S] [X] (1.0) ||| (0,)])})



In [58]:

    
A.push(Item(Rule('[S]', ['[X]'], 1.0), [0]))









    Out[58]:





True



In [59]:

    
A.make_passive(Item(Rule('[S]', ['[X]'], 1.0), [0]))



In [60]:

    
A._incomplete









    Out[60]:





defaultdict(<type 'set'>, {('[S]', 0): set([[S] -> [S] [X] (1.0) ||| (0,)]), ('[X]', 0): set([[S] -> [X] (1.0) ||| (0,)])})



In [61]:

    
A.push(Item(Rule('[S]', ['[X]'], 1.0), [0, 1]))









    Out[61]:





True



In [62]:

    
A.make_passive(Item(Rule('[S]', ['[X]'], 1.0), [0, 1]))



In [63]:

    
A._complete









    Out[63]:





defaultdict(<type 'set'>, {('[S]', 0): set([[S] -> [X] (1.0) ||| (0, 1)])})

Axioms (CKY)

Basic version

For every rule X -> alpha, and every input position (i) between 0 and n-1, we have an item of the kind:

[X -> * alpha, i, i]

which gets represented as

X -> alpha, [i]



In [64]:

    
def axioms(cfg, sentence):
    """
    :params cfg: a context-free grammar (an instance of WCFG)
    :params sentence: the input sentence (as a list or tuple)
    :returns: a list of items
    """
    items = []
    for rule in cfg:
        for i in range(len(sentence)):
            items.append(Item(rule, [i]))
    return items



In [67]:

    
sentence = 'a * a'.split()
axioms(G1, sentence)









    Out[67]:





[[E] -> [T] (0.5) ||| (0,),
 [E] -> [T] (0.5) ||| (1,),
 [E] -> [T] (0.5) ||| (2,),
 [E] -> [E] + [T] (0.5) ||| (0,),
 [E] -> [E] + [T] (0.5) ||| (1,),
 [E] -> [E] + [T] (0.5) ||| (2,),
 [T] -> [P] (0.5) ||| (0,),
 [T] -> [P] (0.5) ||| (1,),
 [T] -> [P] (0.5) ||| (2,),
 [T] -> [T] * [P] (0.5) ||| (0,),
 [T] -> [T] * [P] (0.5) ||| (1,),
 [T] -> [T] * [P] (0.5) ||| (2,),
 [P] -> a (1.0) ||| (0,),
 [P] -> a (1.0) ||| (1,),
 [P] -> a (1.0) ||| (2,)]

Optimised version

For every rule X -> w_i alpha such that w_i is an input word at position i, we have an item of the kind:

[X -> * w_i alpha, i - 1, i - 1]

which gets represented as

X -> w_i alpha, [i-1]

Don't forget that these optimised axioms require an additional inference rule (some sort of delayed axioms which resembles a combination of prediction with completion).



In [86]:

    
def axioms2(cfg, sentence):
    """
    :params cfg: a context-free grammar (an instance of WCFG)
    :params sentence: the input sentence (as a list or tuple)
    :returns: a list of items
    """
    by_rhs0 = defaultdict(list)
    for rule in cfg:
        by_rhs0[rule.rhs[0]].append(rule)        
    items = []
    for i, word in enumerate(sentence):
        for rule in by_rhs0.get(word, []):
            items.append(Item(rule, [i]))
    return items



In [87]:

    
axioms2(G1, sentence)









    Out[87]:





[[P] -> a (1.0) ||| (0,), [P] -> a (1.0) ||| (2,)]

Scan

If [A -> alpha * w_j beta, i, j - 1] is an item such that w_j is the jth input word, then we can prove

    [A -> alpha w_j * beta, i, j]



In [70]:

    
def scan(item, sentence):
    assert is_terminal(item.next), 'Only terminal symbols can be scanned, got %s' % item.next
    if item.dot < len(sentence) and sentence[item.dot] == item.next:
        new = item.advance(item.dot + 1)
        return new
    else:
        return None



In [71]:

    
S = []
for item in axioms2(G1, sentence):
    new = scan(item, sentence)
    if new is not None:
        S.append(new)
S









    Out[71]:





[[P] -> a (1.0) ||| (0, 1), [P] -> a (1.0) ||| (2, 3)]



In [ ]:

Complete

Here we let an active item interact with passive items:

either an active item is complete, then we try to advance incomplete passive items
or an active item is incomplete, in which case we try to advance the item itself by looking back to complete passive items

Either way, this is the inference rule:

     [X -> alpha * Y beta, [i ... k]] [Y -> gamma *, [k ... j]]
     ----------------------------------------------------------
             [X -> alpha Y * beta, [i ... j]]



In [82]:

    
def complete(item, agenda):
    items = []
    if item.is_complete():
        # advance the dot for incomplete items waiting for item.lhs spanning from item.start
        for incomplete in agenda.waiting(item.lhs, item.start):
            items.append(incomplete.advance(item.dot))
    else:
        # look for completions of item.next spanning from item.dot
        ends = set()
        for complete in agenda.complete(item.next, item.dot):
            ends.add(complete.dot)
        # advance the dot of the input item for each position that complete a span
        for end in ends:
            items.append(item.advance(end))
    return items

Forest from complete items

Each complete item in the (passive) agenda can be mapped to a new CFG rule (annotated with input spans). For example, an item such as

    [X -> A x B *, [0,1,2,3]]

results in the rule

    X:0-3 -> A:0-1 x B:2-3

observe how only nonterminal nodes get annotated.



In [88]:

    
def make_forest(complete_items):
    forest = WCFG()
    for item in complete_items:        
        lhs = make_symbol(item.lhs, item.start, item.dot)
        rhs = []
        for i, sym in enumerate(item.rule.rhs):
            rhs.append(make_symbol(sym, item.state(i), item.state(i + 1)))
        forest.add(Rule(lhs, rhs, item.rule.prob))
    return forest

Complete program

Start with axioms and exhaustively apply inference rules



In [79]:

    
def cky(cfg, sentence):
    A = Agenda()
    for item in axioms(cfg, sentence):
        A.push(item)
    while A:
        item = A.pop()
        if item.is_complete() or is_nonterminal(item.next):
            for new in complete(item, A):
                A.push(new)
        else:
            new = scan(item, sentence)
            if new is not None:
                A.push(new)
        A.make_passive(item)
    return make_forest(A.itercomplete())



In [80]:

    
forest = cky(G1, sentence)



In [81]:

    
print forest









    



[E,0-3] -> [T,0-3] (0.5)
[P,0-1] -> a (1.0)
[E,2-3] -> [T,2-3] (0.5)
[T,2-3] -> [P,2-3] (0.5)
[T,0-1] -> [P,0-1] (0.5)
[E,0-1] -> [T,0-1] (0.5)
[T,0-3] -> [T,0-1] * [P,2-3] (0.5)
[P,2-3] -> a (1.0)

Goal

We know that the goal symbol is basically the original start symbol spanning from 0 to n (the length of the sentence).



In [85]:

    
goal = make_symbol('[E]', 0, len(sentence))
goal









    Out[85]:





'[E,0-3]'



In [ ]: