In [ ]:

    
%%HTML
<style>
.container { width:100% } 
</style>

The Shunting Yard Algorithm (Operator Precedence Parsing)

The function $\texttt{toInt}(s)$ tries to convert the string $s$ to an integer. If this works out, the integer is returned. Otherwise, the string $s$ is returned unchanged.



In [ ]:

    
def toInt(s):
    try:
        return int(s)   
    except ValueError:
        return s



In [ ]:

    
toInt('123')



In [ ]:

    
toInt('**')

The module re provides support for regular expressions. These are needed for tokenizing a string.



In [ ]:

    
import re

The function $\texttt{tokenize}(s)$ takes a string $s$ representing an arithmetic expression and splits this string into a list of tokens. The string regExp in the implementation below is interpreted as follows:

The r in front of the apostrophe ' specifies that the regular expression is defined as a raw string. In a raw string the backslash does not have to be escaped because it is treated as a literal character.</br>
The regular expression is divided into three parts. These parts are separated by the character |.
1. [0-9]+ matches a natural number. For example, it matches 0 or 123. It would also match a string like 007. The + at the end of the substring [0-9]* specifies that there are any positive number of the characters in the range [0-9].</br>
2. \*\* matches the operator **.</br>
3. [()+*/%-] matches a parenthesis or an arithmetical operator. Note that we have to put the symbol - last in this group as otherwise this symbol would be interpreted as a range operator.



In [ ]:

    
def tokenize(s):
    regExp = r'[0-9]+|\*\*|[()+*%/-]'
    L = [ toInt(t) for t in re.findall(regExp, s) ]
    return list(reversed(L))



In [ ]:

    
re.findall(r'[0-9]+|\*\*|[()+*%/-]', '1 * 2 * 3**4')



In [ ]:

    
tokenize('1 * 2 * 3**4')

Given an operator $o$, the expression $\texttt{precedence}(o)$ returns the precedence of the operator $o$. If $o_1$ and $o_2$ are different operators and the precedence of $\texttt{o}_1$ is at least as high than the precedence of $\texttt{o}_2$, then the expression $$ a \;\texttt{o}_1\; b \;\texttt{o}_2\; c $$ should be evaluated as $$ (a \;\texttt{o}_1\; b) \;\texttt{o}_2\; c. $$ Otherwise, the expression $a \;\texttt{o}_1\; b \;\texttt{o}_2\; c$ should be evaluated as $$ a \;\texttt{o}_1\; (b \;\texttt{o}_2\; c). $$



In [ ]:

    
def precedence(o):
    Precedence = { '+': 1, '-': 1, '*': 2, '/': 2, '%': 2, '**' : 3 }
    return Precedence[o]

The expression $\texttt{isLeftAssociative}(o)$ is True iff the operator $o$ associates to the left. If $o$ associates to the right, this expression is False. If the operator $o$ is unknown, evaluation of the expression results in an error.



In [ ]:

    
def isLeftAssociative(o):
    if o in { '+', '-', '*', '/', '%' }:
        return True
    if o in { '**' }:
        return False
    assert False, f'unknown operator {o}'

The function $\texttt{evalBefore}(o_1, o_2)$ receives to strings representing arithmetical operators. It returns True if the operator $o_1$ should be evaluated before the operator $o_2$ in an arithmetical expression of the form $a \;\texttt{o}_1\; b \;\texttt{o}_2\; c$. In order to determine whether $o_1$ should be evaluated before $o_2$ it uses the precedence and the associativity of the operators.
Its behavior is specified by the following rules:

$\texttt{precedence}(o_1) > \texttt{precedence}(o_2) \rightarrow \texttt{evalBefore}(\texttt{o}_1, \texttt{o}_2) = \texttt{True}$,
$o_1 = o_2 \rightarrow \texttt{evalBefore}(\texttt{o}_1, \texttt{o}_2) = \texttt{isLeftAssociative}(o_1)$,
$\texttt{precedence}(o_1) = \texttt{precedence}(o_2) \wedge o_1 \not= o_2 \rightarrow \texttt{evalBefore}(\texttt{o}_1, \texttt{o}_2) = \texttt{True}$,
$\texttt{precedence}(o_1) < \texttt{precedence}(o_2) \rightarrow \texttt{evalBefore}(\texttt{o}_1, \texttt{o}_2) = \texttt{False}$.



In [ ]:

    
def evalBefore(stackOp, nextOp):
    if precedence(stackOp) > precedence(nextOp):
        return True
    if stackOp == nextOp:
        return isLeftAssociative(stackOp)
    if precedence(stackOp) == precedence(nextOp) and stackOp != nextOp:
        return True
    if precedence(stackOp) < precedence(nextOp):
        return False
    assert False, f'incomplete case distinction in evalBefore({stackOp}, {nextOp})'



In [ ]:

    
import stack

The class Calculator supports three member variables:

the token stack mTokenStack
the operator stack mOperators
the argument stack mArguments

The constructor takes a string that is tokenized and pushes the tokens onto the token stack such that the first token is on top of the token stack.



In [ ]:

    
class Calculator:
    def __init__(self, s):
        self.mTokens    = stack.createStack(tokenize(s))
        self.mOperators = stack.Stack()
        self.mArguments = stack.Stack()

The method __str__ is used to convert an object of class Calculator to a string.



In [ ]:

    
def toString(self):
    return '\n'.join(['_'*50, 
                      'Tokens:    ', str(self.mTokens), 
                      'Arguments: ', str(self.mArguments), 
                      'Operators: ', str(self.mOperators), 
                      '_'*50])

Calculator.__str__ = toString
del toString

The function $\texttt{evaluate}(\texttt{self})$ evaluates the expression that is given by the tokens on the mTokenStack.
There are two phases:

The first phase is the reading phase. In this phase the tokens are removed from the token stack mTokens.
The second phase is the evaluation phase. In this phase, the remaining operators on the operator stack mOperators are evaluated. Note that some operators are already evaluated in the reading phase.

We can describe what happens in the reading phase using rewrite rules that describe how the three stacks mTokens, mArguments and mOperators are changed in each step. Here, a step is one iteration of the first while-loop of the function evaluate. The following rewrite rules are executed until the token stack mTokens is empty.

If the token on top of the token stack is an integer, it is removed from the token stack and pushed onto the argument stack. The operator stack remains unchanged in this case.
$$\begin{array}{lc} \texttt{mTokens} = \texttt{mTokensRest} + [\texttt{token} ] & \wedge \\ \texttt{isInteger}(\texttt{token}) & \Rightarrow \\[0.2cm] \texttt{mArguments}' = \texttt{mArguments} + [\texttt{token}] & \wedge \\ \texttt{mTokens}' = \texttt{mTokensRest} & \wedge \\ \texttt{mOperators}' = \texttt{mOperators} \end{array} $$ Here, the primed variable $\texttt{mArguments}'$ refers to the argument stack after $\texttt{token}$ has been pushed onto it.

In the following rules we implicitly assume that the token on top of the token stack is not an integer but rather a parenthesis or a proper operator. In order to be more concise, we suppress this precondition from the following rewrite rules.
If the operator stack is empty, the next token is pushed onto the operator stack. $$\begin{array}{lc} \texttt{mTokens} = \texttt{mTokensRest} + [\texttt{op} ] & \wedge \\ \texttt{mOperators} = [] & \Rightarrow \\[0.2cm] \texttt{mOperators}' = \texttt{mOperators} + [\texttt{op}] & \wedge \\ \texttt{mTokens}' = \texttt{mTokensRest} & \wedge \\ \texttt{mArguments}' = \texttt{mArguments} \end{array} $$
If the next token is an opening parenthesis, this parenthesis token is pushed onto the operator stack. $$\begin{array}{lc} \texttt{mTokens} = \texttt{mTokensRest} + [\texttt{'('} ] & \Rightarrow \\[0.2cm] \texttt{mOperators}' = \texttt{mOperators} + [\texttt{'('}] & \wedge \\ \texttt{mTokens}' = \texttt{mTokensRest} & \wedge \\ \texttt{mArguments}' = \texttt{mArguments} \end{array} $$
If the next token is a closing parenthesis and the operator on top of the operator stack is an opening parenthesis, then both parentheses are removed. $$\begin{array}{lc} \texttt{mTokens} = \texttt{mTokensRest} + [\texttt{')'} ] & \wedge \\ \texttt{mOperators} =\texttt{mOperatorsRest} + [\texttt{'('}] & \Rightarrow \\[0.2cm] \texttt{mOperators}' = \texttt{mOperatorsRest} & \wedge \\ \texttt{mTokens}' = \texttt{mTokensRest} & \wedge \\ \texttt{mArguments}' = \texttt{mArguments} \end{array} $$
If the next token is a closing parenthesis but the operator on top of the operator stack is not an opening parenthesis, the operator on top of the operator stack is evaluated. Note that the token stack is not changed in this case. $$\begin{array}{lc} \texttt{mTokens} = \texttt{mTokensRest} + [\texttt{')'} ] & \wedge \ \texttt{mOperatorsRest} + [\texttt{op}] & \wedge \ \texttt{op} \not= \texttt{'('} & \wedge \ \texttt{mArguments} = \texttt{mArgumentsRest} + [\texttt{lhs}, \texttt{rhs}] & \Rightarrow \[0.2cm]
```
 \texttt{mOperators}' = \texttt{mOperatorsRest} & \wedge \\
  \texttt{mTokens}' = \texttt{mTokens} & \wedge \\
  \texttt{mArguments}' = \texttt{mArgumentsRest} + [\texttt{lhs} \;\texttt{op}\; \texttt{rhs}]
```
\end{array} $$ Here, the expression $\texttt{lhs} \;\texttt{op}\; \texttt{rhs}$ denotes evaluating the operator $\texttt{op}$ with the arguments $\texttt{lhs}$ and $\texttt{rhs}$.
If the token on top of the operator stack is an opening parenthesis, then the operator on top of the token stack is pushed onto the operator stack. $$\begin{array}{lc} \texttt{mTokens} = \texttt{mTokensRest} + [\texttt{op}] & \wedge \\ \texttt{op} \not= \texttt{')'} & \wedge \\ \texttt{mOperators} = \texttt{mOperatorsRest} + [\texttt{'('}] & \Rightarrow \\[0.2cm] \texttt{mOperator}' = \texttt{mOperator} + [\texttt{op}] & \wedge \\ \texttt{mTokens}' = \texttt{mTokensRest} & \wedge \\ \texttt{mArguments}' = \texttt{mArguments} \end{array} $$

In the remaining cases neither the token on top of the token stack nor the operator on top of the operator stack can be a a parenthesis. The following rules will implicitly assume that this is the case.

If the operator on top of the operator stack needs to be evaluated before the operator on top of the token stack, the operator on top of the operator stack is evaluated. $$\begin{array}{lc}

 \texttt{mTokens} = \texttt{mTokensRest} + [o_2]                                        & \wedge \\
 \texttt{mOperatorsRest} + [o_1]                                                        & \wedge \\
 \texttt{evalBefore}(o_1, o_2)                                                          & \wedge \\ 
 \texttt{mArguments} = \texttt{mArgumentsRest} + [\texttt{lhs}, \texttt{rhs}]           & \Rightarrow \\[0.2cm]
 \texttt{mOperators}' = \texttt{mOperatorRest}                                          & \wedge \\
 \texttt{mTokens}' = \texttt{mTokens}                                                   & \wedge \\
 \texttt{mArguments}' = \texttt{mArgumentsRest} + [\texttt{lhs} \;o_1\; \texttt{rhs}]
 \end{array}

Otherwise, the operator on top of the token stack is pushed onto the operator stack. $$\begin{array}{lc}

  \texttt{mTokens} = \texttt{mTokensRest} + [o_2]           & \wedge \\
  \texttt{mOperators} = \texttt{mOperatorsRest} + [o_1]     & \wedge \\
  \neg \texttt{evalBefore}(o_1, o_2)                        & \Rightarrow \\[0.2cm]
 \texttt{mOperators}' = \texttt{mOperators} + [o_2]         & \wedge \\
 \texttt{mTokens}' = \texttt{mTokensRest}                   & \wedge \\
 \texttt{mArguments}' = \texttt{mArguments}

\end{array} $$

In every step of the evaluation phase we

remove one operator from the operator stack,
remove its arguments from the argument stack,
evaluate the operator, and
push the result back on the argument stack.



In [ ]:

    
def evaluate(self):
    while not self.mTokens.isEmpty():
        print(self) # only for debugging
        nextOp = self.mTokens.top(); self.mTokens.pop()
        if isinstance(nextOp, int):
            self.mArguments.push(nextOp)
            continue
        if self.mOperators.isEmpty():
            self.mOperators.push(nextOp)
            continue
        if nextOp == "(":
            self.mOperators.push(nextOp)
            continue
        stackOp = self.mOperators.top()
        if stackOp == "(" and nextOp == ")":
            self.mOperators.pop()
            continue
        if nextOp == ")":
            self.popAndEvaluate()
            self.mTokens.push(nextOp)
            continue
        if stackOp == '(':
            self.mOperators.push(nextOp)
            continue
        if evalBefore(stackOp, nextOp):
            self.popAndEvaluate()
            self.mTokens.push(nextOp)
        else:
            self.mOperators.push(nextOp)
    while not self.mOperators.isEmpty():
        print(self) # only for debugging
        self.popAndEvaluate()
    print(self)
    return self.mArguments.top()
    
Calculator.evaluate = evaluate
del evaluate

The method $\texttt{popAndEvaluate}(\texttt{self})$ removes the two topmost numbers $\texttt{rhs}$ and $\texttt{lhs}$ from the argument stack and removes the topmost operator $\texttt{op}$ from the operator stack. It applies the operator $\texttt{op}$ to these numbers by computing $\texttt{lhs} \;\texttt{op}\; \texttt{rhs}$ and then pushes this value back on the argument stack.



In [ ]:

    
def popAndEvaluate(self):
    rhs = self.mArguments.top(); self.mArguments.pop()
    lhs = self.mArguments.top(); self.mArguments.pop()
    op  = self.mOperators.top(); self.mOperators.pop()
    result = None
    if op == '+':
        result = lhs + rhs
    if op == '-':
        result = lhs - rhs
    if op == '*':
        result = lhs * rhs
    if op == '/':
        result = lhs // rhs
    if op == '%':
        result = lhs % rhs
    if op == '**':
        result = lhs ** rhs
    assert result != None, f'ERROR: *** Unknown Operator *** "{op}"'
    self.mArguments.push(result)
    
Calculator.popAndEvaluate = popAndEvaluate
del popAndEvaluate



In [ ]:

    
C = Calculator('2*(3+4)**2')



In [ ]:

    
C.evaluate()



In [ ]:



In [ ]: