Building a Scheme Interpreter

Consider using Python to turn the string "(+ 1 2)" into the actual number 3. How does that happen? This question is really: what does a programming language do? How does it work? In this chapter, we will answer these questions by building a Scheme interpreter in Python.

Two Steps: Parse and Interpret

In designing a programming language, we break down the process into two steps:

STEP 1: The first step in implementing a programming language is to take a plain string and turn it into what is commonly called an Abstract Syntax Tree, or AST for short. This process is called parsing. ASTs are data structures.

STEP 2: The second step is to build an evaluator that takes ASTs and interprets them. This is called interpreting.

We will now go through the steps of designing a language. As this language will start off as a simple calculator with the syntax looking like Scheme, we'll call this language S-Calc, short for Scheme Calculator.

Parsing

To build a function that will take a string and produce AST, we further break down the parsing into three stages. We could do this all in one function, but it is common to break the process up into smaller chunks to make processing (and understanding/debugging) easier. The three components are:

  1. tokenize - turns a string into tokens
  2. reader - take the tokens and group them
  3. parser - turn the segmented parts into AST

The idea is that we can then take our string "(+ 1 2)" and end up with AST, like so:

parser(reader(tokenize("(+ 1 1)")))

The form of the AST can really be any data structure that we decide. For these experiments, we will use very simple Scheme expressions (called s-exp). Thus, the above string might look like this in Scheme:

(apply-exp
 (var-exp +)
 ((lit-exp 1) (lit-exp 2)))

That is, it is an application-expression composed of the operator (a variable-expression '+') and two literal-expressions 1 and 2 as operands.

We call the syntax of the string the Concrete Syntax as compared to the Abstract Syntax of the AST.

As we have seen, Scheme is a simple language composed of lists, symbols, strings, and numbers. Everything in the language can be parsed into those components, so writing a Scheme parser is pretty easy compared to languages like Python.

Tokenize

To parse S-Calc we first define the lowest level of the process, the tokenizer:


In [1]:
def tokenizer(string):
    """
    Takes a string and segments it into parts.
    We break strings up by brackets, and whitespace.
    Returns a Python list of strings.
    """
    retval = []
    current = ""
    for i in range(len(string)):
        if string[i] in ["(", "[", ")", "]"]:
            if current:
                retval.append(current)
            current = ""
            retval.append(string[i])
        elif string[i] in [" ", "\t", "\n"]:
            if current:
                retval.append(current)
            current = ""
        else:
            current += string[i]
    if current:
        retval.append(current)
    return retval

In [2]:
tokenizer("""(this    is a
3.14 
(test))""")


Out[2]:
['(', 'this', 'is', 'a', '3.14', '(', 'test', ')', ')']

Exercise 1: Try the tokenizer on many different strings. Describe what it does in simple terms based on its input and output.

Reader

The reader will take the tokenized expression (texp) and produced grouped results.


In [3]:
def reader(texp):
    """
    Takes the output of the tokenizer, and creates
    lists of lists of items. Numbers are represented
    as numbers.
    """
    current = None
    stack = []
    for item in texp:
        if item.isdigit():
            if current is not None:
                current.append(eval(item))
            else:
                current = eval(item)
        elif item in ["[", "("]:
            if current is not None:
                stack.append(current)
            current = []
        elif item in ["]", ")"]:
            if stack:
                stack[-1].append(current)
                current = stack.pop(-1)
            else:
                pass
        else:
            if current is not None:
                current.append(item)
            else:
                current = item
    return current

In [4]:
tokenizer("(this is (a) ((list))")


Out[4]:
['(', 'this', 'is', '(', 'a', ')', '(', '(', 'list', ')', ')']

In [5]:
reader(tokenizer("(this is (a) ((list))"))


Out[5]:
['this', 'is', ['a'], [['list']]]

Exercise 2: Try the reader on many different tokenized strings. Describe what it does in simple terms. How does this differ from the lexer?

Parser

The final process of Step 1 is to take the output of the reader and parse it into an AST. For our first S-Calc expression, we just need to handle "(+ 1 2)". That is, we need to handle three things:

  • numbers - any kind of number
  • variables, like "+" - anything not a number
  • application - starts with a parenthesis

In [6]:
## Version 1:

def parser(rexp):
    """
    Reads in a Python list of things, and returns an AST.
    """
    if isinstance(rexp, int):
        return lit_exp(rexp)
    elif isinstance(rexp, str):
        return var_exp(rexp)
    else:
        return app_exp(parser(rexp[0]), List(*map(parser, rexp[1:])))

That's it?! Yes, but we need to define some things before that will run. We need to define:

  • list_exp
  • var_exp
  • app_exp

To think like a Little Schemer, we define some utility functions in Python so that we can write code as if we were in Scheme. Specifically, let's replicate the linked-list of cons/car/cdr:


In [68]:
EmptyList = "()"

def cons(item1, item2):
    return [item1, item2]

def car(exp):
    return exp[0]

def cdr(exp):
    return exp[1]

def cadr(exp):
    return exp[1][0]

def cddr(exp):
    return exp[1][1]

def caddr(exp):
    return exp[1][1][0]

def List(*args):
    "Create a linked-list of items"
    retval = EmptyList
    for arg in reversed(args):
        retval = cons(arg, retval)
    return retval

An improper list, the dotted pair:


In [69]:
cons(1, 2)


Out[69]:
[1, 2]

Make sure that car and cdr can deconstruct a cons cell:


In [24]:
car(cons(1, 2))


Out[24]:
1

In [25]:
cdr(cons(1, 2))


Out[25]:
2

And, a convenience method for constructing Scheme-like lists of multiple items:


In [26]:
List(1, 2, 3, 4)


Out[26]:
[1, [2, [3, [4, '()']]]]

Exercise 3: Why does the list above look like this? Is this similar to how Scheme lists exist? Explain.

Now, we can compose our AST constructor functions:


In [39]:
def lit_exp(value):
    return List("lit-exp", value)

def var_exp(symbol):
    return List("var-exp", symbol)

def app_exp(f, args):
    return List("apply-exp", f, args)

In [40]:
parser(reader(tokenizer("1")))


Out[40]:
['lit-exp', [1, '()']]

In [41]:
parser(reader(tokenizer("+")))


Out[41]:
['var-exp', ['+', '()']]

In [42]:
parser(reader(tokenizer("(+ 1 2)")))


Out[42]:
['apply-exp',
 [['var-exp', ['+', '()']],
  [[['lit-exp', [1, '()']], [['lit-exp', [2, '()']], '()']], '()']]]

We'll be doing those three functions together quite often, so let's make a useful function:


In [43]:
def scalc_parse(string):
    return parser(reader(tokenizer(string)))

In [44]:
scalc_parse("652362")


Out[44]:
['lit-exp', [652362, '()']]

Exercise 4: Try out the scalc_parser. Can it handle nested mathematical expressions? Why? How does the parser handle recursive expressions?

Interpreter

Now we are ready for Step 2: the interpreter. This function takes an AST expression and interprets it (i.e., gives a result). We will call our interpreter evaluator.

Again, as we only have numbers, symbols, and applications, it only needs to handle those three items. To help with debugging, we will also now add a print application.


In [93]:
## Version 1:

def evaluator(expr):
    if car(expr) == "lit-exp":
        return cadr(expr)
    elif car(expr) == "var-exp":
        return cadr(expr) ## for now, return symbol
    elif car(expr) == "apply-exp":
        return evaluator_apply(evaluator(cadr(expr)), 
                               Map(evaluator, caddr(expr)))
    else:
        raise Exception("invalid ast: %s" % expr)

def evaluator_apply(op, operands):
    if op == "print":
        Print(operands)
    elif op == "+":
        return car(operands) + cadr(operands)
    else:
        raise Exception("unknown apply operator: %s" % op)
        
def Map(f, slist):
    if slist == EmptyList:
        return EmptyList
    else:
        return cons( f(car(slist)), Map(f, cdr(slist))) ## recursive!
    
def Print(slist):
    if slist == EmptyList:
        return
    else:
        print(car(slist))
        Print(cdr(slist))

In [94]:
Map(lambda v: v, List(1, 2, 3))


Out[94]:
[1, [2, [3, '()']]]

In [95]:
expr = scalc_parse("3")

In [96]:
evaluator(expr)


Out[96]:
3

In [97]:
def scalc(string):
    return evaluator(scalc_parse(string))

In [98]:
scalc("34")


Out[98]:
34

In [99]:
scalc("(+ 1 1)")


Out[99]:
2

In [100]:
scalc("(print 1 2 3)")


1
2
3

In [101]:
scalc("(+ 1 (+ 100 10))")


Out[101]:
111

Exercise 5: Add the following operators:

  • subtract
  • multiply
  • divide

You can redefine the Python functions here.

Test out these operations thoroughly.

What should you do with divide by zero (and other) errors? What are the choices?

Exercise 6: A quoted item is a literal item. However, if it is a Python list, it should be converted to a proper Scheme list in the parser. You can use the following for a quoted item. The evaluator need not change. Change the parser to use the following sexp function.

def sexp(item):
    """
    Takes an Python list of items and returns Scheme s-exp.
    """
    if isinstance(item, list):
        return List(*map(sexp, item)) # recursion!
    else:
        return item

Exercise 7: Add the literals #t and #f that evaluate to 1 and 0.

In[ ]: scalc("#t")
Out[]: 1
In[ ]: scalc("#f")
Out[]: 0

How will you add these to the language? There is no one right answer, but you should justify your choice among possible options.

Exercise 8: Add an if-expression that works as follows:

(if 1 2 3)

should return 2. Note that 3 should never be evaluted. For example:

(if 0 (/ 2 0) 3)

should return 3, but should not evaluate the divide-by-zero expression.