Context-free grammars


In [1]:
from tock import *

Creating CFGs

You can create a CFG either by reading from a file (using Grammar.from_file) or a list of strings (using Grammar.from_lines). The first rule's left-hand side is assumed to be the start symbol.


In [2]:
g = Grammar.from_lines(["S -> a T b",
                        "S -> b",
                        "T -> T a",
                        "T -> &"])
g


Out[2]:
nonterminals: {S,T}
start: S
S → a T b
S → b
T → T a
T → ε

In [3]:
g.nonterminals


Out[3]:
{'S', 'T'}

A Grammar can be any unrestricted (type-0) grammar, but currently the grammar-reading functions can only read CFGs.


In [4]:
g.is_contextfree()


Out[4]:
True

Like RegularExpression objects, there isn't a lot you can do with Grammar objects other than convert them to automata.

From CFGs to PDAs

To convert to a PDA using a top-down construction, use the from_grammar function:


In [5]:
m = from_grammar(g)
to_graph(m)


%3 _START 0 start _START->0 2 loop 0->2 ε,ε → [S] $ 1 accept 2->1 ε,$ → ε 2->2 ε,S → [a] T b ε,S → b ε,T → [T] a ε,T → ε b,b → ε a,a → ε

In [6]:
run(m, "a a b", show_stack=4)


%3 _START 0 start,ε _START->0 23 loop,[S] $ 0->23 1 2 1->2 a 3 2->3 a 4 3->4 b 5 loop,[a] T b $ 7 loop,[T] b $ 5->7 6 loop,[b] $ 8 loop,[T] a b $ 7->8 9 loop,[b] $ 7->9 10 loop,[a] b $ 8->10 11 loop,[T] a a b … 8->11 12 loop,[b] $ 10->12 14 loop,[T] a a a … 11->14 15 loop,[a] a b $ 11->15 13 loop,$ 12->13 16 accept,ε 13->16 14->14 18 loop,[a] a a b … 14->18 19 loop,[a] a a a … 14->19 17 loop,[a] b $ 15->17 20 loop,[a] a b $ 18->20 21 loop,[a] a a a … 19->21 22 loop,[a] a a b … 19->22 23->5 23->6

This diagram is a little hard to read, but one thing to note is the cycle between loop and 3.1. This is caused by the left-recursive rule T -> T a, which the automaton applies an unbounded number of times.

Shift-reduce parsing

There's also a bottom-up version:


In [7]:
m = from_grammar(g, mode="bottomup")
to_graph(m)


%3 _START 0 start _START->0 2 loop 0->2 ε,ε → $ 1 accept 2->1 ε,[S] $ → ε 2->2 ε,[b] T a → S ε,b → S ε,[a] T → T ε,ε → T b,ε → b a,ε → a

In [8]:
run(m, "a a b")


%3 _START 14 start,ε _START->14 0 loop,[T] T S … 19 loop,[T] T T … 0->19 1 loop,[S] a $ 5 loop,[T] S a … 1->5 2 loop,[S] T T … 3 loop,[T] S T … 2->3 3->0 4 loop,[S] T a … 4->3 5->0 6 loop,[T] T b … 6->19 7 loop,[T] b T … 7->6 8 loop,[T] T T … 8->8 9 loop,[b] T T … 8->9 9->2 9->7 10 13 10->13 b 11 11->10 a 12 12->11 a 15 loop,$ 14->15 16 loop,[T] $ 15->16 21 loop,[a] $ 15->21 22 loop,[T] T $ 16->22 23 loop,[a] T $ 16->23 17 loop,[S] $ 18 accept,ε 17->18 20 loop,[T] S $ 17->20 19->19 20->0 24 loop,[T] a $ 21->24 25 loop,[a] a $ 21->25 27 loop,[T] T T … 22->27 28 loop,[a] T T … 22->28 26 loop,[T] $ 23->26 31 loop,[T] a T … 23->31 32 loop,[a] a T … 23->32 33 loop,[T] T a … 24->33 34 loop,[a] T a … 24->34 35 loop,[T] a a … 25->35 36 loop,[b] a a … 25->36 29 loop,[T] T $ 26->29 30 loop,[a] T $ 26->30 27->27 27->28 28->29 28->31 28->32 39 loop,[T] T T … 28->39 29->39 40 loop,[a] T T … 29->40 37 loop,[T] $ 30->37 44 loop,[T] a T … 30->44 45 loop,[b] a T … 30->45 31->33 31->34 32->35 32->36 33->39 33->40 41 loop,[T] a $ 34->41 34->44 34->45 46 loop,[T] T a … 35->46 47 loop,[b] T a … 35->47 38 loop,[S] a a … 36->38 48 loop,[T] b a … 36->48 42 loop,[T] T $ 37->42 43 loop,[b] T $ 37->43 38->5 39->39 39->40 40->8 40->42 40->44 40->45 40->46 41->46 41->47 42->8 42->9 43->7 49 loop,[S] T $ 43->49 44->2 44->4 44->46 44->47 44->49 45->48 50 loop,[S] a T … 45->50 46->8 46->9 47->1 47->2 47->4 47->7 47->17 47->49 47->50 48->6 49->3 50->5

LL(1) parsing


In [9]:
g = Grammar.from_lines(["S -> a S c",
                        "S -> T",
                        "T -> b T",
                        "T -> &"])

m = from_grammar(g, mode='ll1')
m


%3 _START 0 start _START->0 2 loop 0->2 ε,ε → [S] $ 1 accept 3 a 2->3 a,ε → ε 4 b 2->4 b,ε → ε 5 c 2->5 c,ε → ε 6 2->6 ␣,ε → ε 3->2 ε,a → ε 3->3 ε,S → [a] S c 4->2 ε,b → ε 4->4 ε,S → T ε,T → [b] T 5->2 ε,c → ε 5->5 ε,S → T ε,T → ε 6->1 ε,$ → ε 6->6 ε,S → T ε,T → ε

In [10]:
m.is_deterministic()


Out[10]:
True

In [11]:
run(m, "a a b c c", show_stack=5).only_path()


Out[11]:
start[a] a b c cε
loop[a] a b c c[S] $
a[a] b c c[S] $
a[a] b c c[a] S c $
loop[a] b c c[S] c $
a[b] c c[S] c $
a[b] c c[a] S c c $
loop[b] c c[S] c c $
b[c] c[S] c c $
b[c] c[T] c c $
b[c] c[b] T c c $
loop[c] c[T] c c $
cc[T] c c $
cc[c] c $
loopc[c] $
cε[c] $
loopε$
ε$
acceptεε

LR parsing

Construction of an LR parser involves building the LR automaton:


In [17]:
g = Grammar.from_lines([
    'S -> T -|',
    'T -> T l T r',
    'T -> &'
])

m = from_grammar(g, mode='lr0')
m


%3 _START 2 start' _START->2 0 start 3 loop 0->3 ε,2 → [3] $ 2 1 accept 2->0 ε,ε → 2 3->1 ε,[8] S 3 $ 2 → 2 3->3 ε,[6] ⊣ 7 T 3 → [8] S 3 ε,[5] r 4 T 1 l 4 T 1 → [4] T 1 ε,[5] r 4 T 1 l 7 T 3 → [7] T 3 ε,1 → [4] T 1 ε,3 → [7] T 3 r,4 → [5] r 4 l,4 → [1] l 4 l,7 → [1] l 7 ⊣,7 → [6] ⊣ 7

In [14]:
m.is_deterministic()


Out[14]:
True

In [15]:
run(m, 'l l r l r r -|', show_stack=15).only_path()


Out[15]:
start'[l] l r l r r ⊣ε
start[l] l r l r r ⊣2
loop[l] l r l r r ⊣[3] $ 2
loop[l] l r l r r ⊣[7] T 3 $ 2
loop[l] r l r r ⊣[1] l 7 T 3 $ 2
loop[l] r l r r ⊣[4] T 1 l 7 T 3 $ 2
loop[r] l r r ⊣[1] l 4 T 1 l 7 T 3 $ 2
loop[r] l r r ⊣[4] T 1 l 4 T 1 l 7 T 3 $ 2
loop[l] r r ⊣[5] r 4 T 1 l 4 T 1 l 7 T 3 $ 2
loop[l] r r ⊣[4] T 1 l 7 T 3 $ 2
loop[r] r ⊣[1] l 4 T 1 l 7 T 3 $ 2
loop[r] r ⊣[4] T 1 l 4 T 1 l 7 T 3 $ 2
loop[r] ⊣[5] r 4 T 1 l 4 T 1 l 7 T 3 $ 2
loop[r] ⊣[4] T 1 l 7 T 3 $ 2
loop[5] r 4 T 1 l 7 T 3 $ 2
loop[7] T 3 $ 2
loopε[6] ⊣ 7 T 3 $ 2
loopε[8] S 3 $ 2
acceptε2

You can visualize the LR automaton using grammars.lr_automaton, and you can also build an LR(1) parser using mode='lr1'.

From PDAs to CFGs

The conversion in the reverse direction, from PDA to CFG, is actually related to the algorithm that Tock uses internally to simulate PDAs. We use the remove_useless method to reduce the size of the converted CFG.


In [16]:
m = read_csv('examples/sipser-2-14.csv')
g = to_grammar(m).remove_useless()
g


Out[16]:
nonterminals: {(q1,q1),(q1,q4),(q2,q2),(q2,q3),(q3,q3),(q4,q4),(start,accept)}
start: (start,accept)
(q2,q3) → 0 (q2,q2) 1
(q2,q3) → 0 (q2,q3) 1
(q1,q4) → (q2,q3)
(start,accept) → (q1,q4)
(q1,q1) → (q1,q1) (q1,q1)
(q1,q4) → (q1,q1) (q1,q4)
(q1,q4) → (q1,q4) (q4,q4)
(q4,q4) → (q4,q4) (q4,q4)
(q3,q3) → (q3,q3) (q3,q3)
(q2,q3) → (q2,q3) (q3,q3)
(q2,q3) → (q2,q2) (q2,q3)
(q2,q2) → (q2,q2) (q2,q2)
(q1,q1) → ε
(q4,q4) → ε
(q3,q3) → ε
(q2,q2) → ε