CFG

This is how we represent a CFG

Symbols

Its symbols are plain python strings.

A nonterminal is a brackted string such as

    [S]

In [24]:
from symbol import is_terminal, is_nonterminal

In [25]:
is_terminal('a')


Out[25]:
True

In [26]:
is_nonterminal('[X]')


Out[26]:
True

Rules

Rules are objects made of a LHS symbol, a RHS sequence (of symbols) and a probability.


In [27]:
from rule import Rule

In [28]:
r = Rule('[S]', ['[X]', 'a'], 1.0)

You can print a rule, you can access its attributes, and you can hash rules with containers such as dict and set.


In [29]:
print r


[S] -> [X] a (1.0)

In [30]:
print r.prob


1.0

In [31]:
r in set([r])


Out[31]:
True

In [32]:
D = {r: 1}
D


Out[32]:
{[S] -> [X] a (1.0): 1}

Grammar

A PCFG is organised pretty much as a dictionary mapping from LHS symbols to their rewrite rules.


In [33]:
from cfg import WCFG

In [34]:
G = WCFG()

We can add rules


In [35]:
G.add(Rule('[S]', ['[X]'], 0.0))

In [36]:
G.add(Rule('[S]', ['[S]', '[X]'], 0.0))
G.add(Rule('[X]', ['a'], 0.0))

We can print the grammar


In [37]:
print G


[S] -> [X] (0.0)
[S] -> [S] [X] (0.0)
[X] -> a (0.0)

we can test whether there are rewrite rules for a certain LHS symbol


In [38]:
G.can_rewrite('[S]')


Out[38]:
True

In [39]:
G.can_rewrite('a')


Out[39]:
False

we can get the set of rewrite rules for a certain LHS symbol


In [40]:
G.get('[S]')


Out[40]:
[[S] -> [X] (0.0), [S] -> [S] [X] (0.0)]

In [41]:
G.get('[X]')


Out[41]:
[[X] -> a (0.0)]

and when a symbol cannot be rewritten, the grammar will return an empty set


In [42]:
G.get('a')


Out[42]:
frozenset()

We can also iterate through rules in the grammar.

Note that the followin is basically counting how many rules we have in the grammar.


In [43]:
sum(1 for r in G)


Out[43]:
3

which can also be done in a more efficient way


In [44]:
len(G)


Out[44]:
3

Finally we can have access to the set of terminals and nonterminals of the grammar


In [45]:
'[S]' in G.nonterminals


Out[45]:
True

In [46]:
'a' in G.terminals


Out[46]:
True

Read from file

Grammar files contain one rule per line. Each line is a triple with fields separated by '|||'.

The first field is the rule's LHS symbol, the second symbol is the rule's RHS sequence, and the last field is the rule's probability.

Example:

    [S] ||| [S] [X] ||| 0.5

In [51]:
from cfg import read_grammar_rules

We hava a simple grammar for arithmetic operations. This grammar knows that products have precedence over summation.


In [52]:
# first we open a file
istream = open('examples/arithmetic')

In [53]:
# then we read rules from this file initialising a WCFG object
G1 = WCFG(read_grammar_rules(istream))

In [54]:
print G1


[T] -> [P] (0.5)
[T] -> [T] * [P] (0.5)
[E] -> [T] (0.5)
[E] -> [E] + [T] (0.5)
[P] -> a (1.0)

We also have an ambiguous grammar which discourage solving sums before solving products, but still allows it.


In [55]:
G2 = WCFG(read_grammar_rules(open('examples/ambiguous')))
print G2


[T] -> [P] (0.5)
[T] -> [T] * [P] (0.4)
[T] -> [T] + [P] (0.1)
[E] -> [T] (0.5)
[E] -> [E] + [T] (0.45)
[E] -> [E] * [T] (0.05)
[P] -> a (1.0)

In [ ]: