Regular expressions


In [1]:
from tock import *

Regular expressions in Tock use the following operators:

  • | or for union
  • concatenation for concatenation
  • * for Kleene star

This is very similar to Unix regular expressions, but because a symbol can have more than one character, consecutive symbols must be separated by a space. Also, for the empty string, you must write ε (or &). The empty set is written as .

To create a regular expression from a string (Sipser, Example 1.56):


In [2]:
r = RegularExpression.from_str('(a b|a)*')
r


Out[2]:
(a b ∪ a)*

However, there isn't much you can do with a RegularExpression object other than to convert it to an NFA.

From regular expressions to NFAs


In [3]:
m = from_regexp(r)          # from RegularExpression object
m = from_regexp('(a b|a)*') # a str is automatically parsed into a RegularExpression

The regular expression is converted into a finite automaton, which you can view, as usual, as either a graph or a table.


In [4]:
to_graph(m)


%3 _START 2 q8 _START->2 0 q1 1 q2 0->1 a 4 q3 1->4 ε 5 q7 2->5 ε 3 q5 6 q6 3->6 a 7 q4 4->7 b 5->0 ε 5->3 ε 6->5 ε 7->5 ε

In [5]:
to_table(m)


Out[5]:
ε a b
q1 q2
q2 q3
q3 q4
@q4 q7
q5 q6
@q6 q7
q7 {q1,q5}
>@q8 q7

The states are numbered according to the position in the regular expression they came from (so that listing them in alphabetical order is natural). The letter suffixes are explained below.

We can also pass the display_steps=True option to show the automata created for all the subexpressions.


In [6]:
m = from_regexp('(a b|a)*', display_steps=True)


subexpression: a
%3 _START 0 q1 _START->0 1 q2 0->1 a
subexpression: b
%3 _START 0 q3 _START->0 1 q4 0->1 b
subexpression: a b
%3 _START 2 q1 _START->2 0 q2 1 q3 0->1 ε 3 q4 1->3 b 2->0 a
subexpression: a
%3 _START 1 q5 _START->1 0 q6 1->0 a
subexpression: a b ∪ a
%3 _START 4 q7 _START->4 0 q1 1 q2 0->1 a 2 q3 1->2 ε 5 q4 2->5 b 3 q5 6 q6 3->6 a 4->0 ε 4->3 ε
subexpression: (a b ∪ a)*
%3 _START 3 q8 _START->3 0 q1 1 q2 0->1 a 2 q3 1->2 ε 7 q4 2->7 b 5 q7 3->5 ε 4 q5 6 q6 4->6 a 5->0 ε 5->4 ε 6->5 ε 7->5 ε

From NFAs to regular expressions

The to_regexp function converts in the opposite direction:


In [7]:
e = to_regexp(m)
e


Out[7]:
ε ∪ a* a ∪ (ε ∪ a* a) (a b (ε ∪ a* a))* a b (ε ∪ a* a)

The resulting regular expression depends a lot on the order in which states are eliminated; Tock eliminates states in reverse alphabetical order.

Again, the display_steps option causes all the intermediate steps of the conversion to be displayed.


In [8]:
e = to_regexp(m, display_steps=True)


%3 _START 0 start _START->0 2 q8 0->2 ε 1 accept 2->1 ε 9 q7 2->9 ε 3 q1 4 q2 3->4 a 5 q3 4->5 ε 7 q4 5->7 b 6 q5 8 q6 6->8 a 7->1 ε 7->9 ε 8->1 ε 8->9 ε 9->3 ε 9->6 ε
eliminate q8
%3 _START 0 start _START->0 1 accept 0->1 ε 8 q7 0->8 ε 2 q1 3 q2 2->3 a 4 q3 3->4 ε 6 q4 4->6 b 5 q5 7 q6 5->7 a 6->1 ε 6->8 ε 7->1 ε 7->8 ε 8->2 ε 8->5 ε
eliminate q7
%3 _START 0 start _START->0 1 accept 0->1 ε 2 q1 0->2 ε 5 q5 0->5 ε 3 q2 2->3 a 4 q3 3->4 ε 6 q4 4->6 b 7 q6 5->7 a 6->1 ε 6->2 ε 6->5 ε 7->1 ε 7->2 ε 7->5 ε
eliminate q6
%3 _START 0 start _START->0 1 accept 0->1 ε 2 q1 0->2 ε 5 q5 0->5 ε 3 q2 2->3 a 4 q3 3->4 ε 6 q4 4->6 b 5->1 a 5->2 a 5->5 a 6->1 ε 6->2 ε 6->5 ε
eliminate q5
%3 _START 0 start _START->0 1 accept 0->1 ε ∪ a* a 2 q1 0->2 ε ∪ a* a 3 q2 2->3 a 4 q3 3->4 ε 5 q4 4->5 b 5->1 ε ∪ a* a 5->2 ε ∪ a* a
eliminate q4
%3 _START 0 start _START->0 1 accept 0->1 ε ∪ a* a 2 q1 0->2 ε ∪ a* a 3 q2 2->3 a 4 q3 3->4 ε 4->1 b (ε ∪ a* a) 4->2 b (ε ∪ a* a)
eliminate q3
%3 _START 0 start _START->0 1 accept 0->1 ε ∪ a* a 2 q1 0->2 ε ∪ a* a 3 q2 2->3 a 3->1 b (ε ∪ a* a) 3->2 b (ε ∪ a* a)
eliminate q2
%3 _START 0 start _START->0 1 accept 0->1 ε ∪ a* a 2 q1 0->2 ε ∪ a* a 2->1 a b (ε ∪ a* a) 2->2 a b (ε ∪ a* a)
eliminate q1
%3 _START 0 start _START->0 1 accept 0->1 ε ∪ a* a ∪ (ε ∪ a* a) (a b (ε ∪ a* a))* a b (ε ∪ a* a)