6. Interpreting Python in Scheme

In Chapter 5, we saw how to build an interpreter by breaking the process into 2 steps:

parsing concrete syntax into abstract syntax, and building abstract syntax trees (AST)
interpreting AST

For many real-world problems, this can be slower than what is possible. For many modern languages, an intermediate stage is used. Rather than interpreting AST, the AST is compiled into bytecodes. The final interpreter works on bytecodes rather than ASTs:

parsing concrete syntax into abstract syntax, and building abstract syntax trees (AST)
compiling AST into bytecode
interpreting bytecode

Both Java and Python (and many other languages) work this way. Before we build a Python interpreter, let's examine the way that Python actually works.

First, let's take a look at manually constructing a Python AST representing "1 + 2". We will examing parsing shortly, but for now, let's start with AST. Python comes with a library containing the necessary AST data structures:



In [1]:

    
(import "ast")









    Out[1]:





(ast)

To use the built-in Python tools for interpreting an expression, we must build the AST in a precise manner. So, we take the idea of "1 + 2" and we express this as:



In [2]:

    
(define expr 
  (ast.Expression
    (ast.BinOp (ast.Num 1)
               (ast.Add)
               (ast.Num 2))))

We build the AST precisely as needed by Python, so we can actually let Python compile and run the code. Coming up with the proper Python AST is non-trivial; we'll see the details in the next section. For now, just note that 1 + 2 is represented as a "Binary Operation" (BinOp) which is itself an Expression. The BinOp is really just a special version of our apply-exp in Scheme AST.

To see a string representation of the AST we can use the ast.dump function:



In [3]:

    
(ast.dump expr)









    Out[3]:





"Expression(body=BinOp(left=Num(n=1), op=Add(), right=Num(n=2)))"

Because our expr didn't actually come from source code (e.g., from a file or a string) we need to poke in dummy values for line numbers so that the compiler won't complain. There is a function called ast.fix_missing_locations for doing exactly that:



In [4]:

    
(set! expr (ast.fix_missing_locations expr))

Then, in Python, the AST is compiled into bytecode using Python's compile function:



In [5]:

    
(define bytecode (compile expr "<string>" "eval"))

Finally, Python can interpret the bytecode and give you the result:



In [6]:

    
(python-eval bytecode)









    Out[6]:





3

Bytecode

What is bytecode, and how does it compare to abstract syntax?

Let's take a look at our expression 1 + 2 in bytecode. To see the actual bytecode, we examine the co_code property of the Python bytecode object:



In [7]:

    
bytecode.co_code









    Out[7]:





b'd\x02\x00S'

Well then! Not surprisingly the bytecode is composed of bytes. How long is the entire program:



In [8]:

    
(len bytecode.co_code)









    Out[8]:





4

Four bytes long. Exactly, what is the value of each of those 4 bytes:



In [9]:

    
(map (lambda (i) i) bytecode.co_code)









    Out[9]:





(100 2 0 83)

We can use the Python library dis to disassemble the bytecode. Disassemble means that we take the bytes in the bytecode and turn them into a human-readable format.



In [17]:

    
(import "dis")









    Out[17]:





(dis)



In [18]:

    
(dis.disassemble bytecode)









    



  1           0 LOAD_CONST               2 (3)
              3 RETURN_VALUE

That's it: two lines. How do you read this? The number in the first column shows the line number in our Python source code (i.e., the concrete syntax). We didn't actually have source code in this example as we started with AST. But the ast.fix_missing_locations function stuck in a "1" for us.

The second column is an index into the bytecode, and the third column is that instruction in human-readable form. Let's look up those two indexes in the bytecode:



In [19]:

    
(getitem bytecode.co_code 0)









    Out[19]:





100



In [20]:

    
(getitem bytecode.co_code 3)









    Out[20]:





83

Therefore, LOAD_CONST is represented by 100, and RETURN_VALUE is represented by 83. If you have taken an assembly language course, this type of output might look familar. It turns out that bytecode is an assembly language. Usually, assembly languages are designed for a particular computer, a particular CPU in hardware. This assembly language is designed for a Virtual Machine. A Virtual Machine (or VM) doesn't actually exist as hardware, but is interpreted by another program. Java works this way too, although it has its own VM called the Java Virtual Machine, or JVM.

The fourth column of the disassembled bytecode (when present) is the argument to that instruction. The fifth column (when present) is a hint about what the argument means. In this example, it means that the argument is the literal number 3.

Wait a minute... what happened to "1 + 2"? Well, during the compile step, Python can do some optimizations. It recognized that we were adding two constants (1 and 2) and realized that it could do that at compile time rather than waiting until the code was interpreted.

You can try compiling and running other Python AST expressions. However, putting together Python AST is not as easy as our Scheme AST. However, in the next section will start with Python concrete syntax, and parse it automatically to Python AST.

Building a Python Interpreter

Parsing Python

Python is a much more complex language in terms of concrete syntax. It is not just simple lists, like Scheme. It has brackets, newlines, colons, semicolons, dots, commas, and much more. Luckily for us, Python comes with a library to parse itself. To parse Python we usin the ast module again:



In [21]:

    
(import "ast")









    Out[21]:





(ast)

ast has a function called parse for taking valid Python expressions and statements and turning them into AST.



In [22]:

    
(define pyexpr (ast.parse "1 + 2"))

Again, we can use the ast.dump to see the AST in a more human-readable format:



In [23]:

    
(ast.dump pyexpr)









    Out[23]:





"Module(body=[Expr(value=BinOp(left=Num(n=1), op=Add(), right=Num(n=2)))])"

Notice that Python added a couple of items:

Module
Expr

Also note, again, that + is parsed as a BinOp. Now we can automatically generate AST from Python concrete syntax.

What to Interpret?

As noted above, Python generates an assembly language (called bytecodes) for a virtual machine (VM). We could build a VM interpreter and interpret bytes. Let's explore that for just a momement.

If you have taken assmbly language, you may have heard the term "instructions". Assembly languages are composed of instructions. In bytecode, these instructions are opcodes.

How many opcodes/instructions does Python have? The Python dis module has a dictionary containing the instruction names mapped to instruction opcodes.



In [25]:

    
(len dis.opmap)









    Out[25]:





114

The Python VM has 114 opcodes. What are the 114 instructions?



In [51]:

    
(sort < (map (lambda (i) i) (dis.opmap.keys)))









    Out[51]:





("BEFORE_ASYNC_WITH" "BINARY_ADD" "BINARY_AND" "BINARY_FLOOR_DIVIDE" "BINARY_LSHIFT" "BINARY_MATRIX_MULTIPLY" "BINARY_MODULO" "BINARY_MULTIPLY" "BINARY_OR" "BINARY_POWER" "BINARY_RSHIFT" "BINARY_SUBSCR" "BINARY_SUBTRACT" "BINARY_TRUE_DIVIDE" "BINARY_XOR" "BREAK_LOOP" "BUILD_LIST" "BUILD_LIST_UNPACK" "BUILD_MAP" "BUILD_MAP_UNPACK" "BUILD_MAP_UNPACK_WITH_CALL" "BUILD_SET" "BUILD_SET_UNPACK" "BUILD_SLICE" "BUILD_TUPLE" "BUILD_TUPLE_UNPACK" "CALL_FUNCTION" "CALL_FUNCTION_KW" "CALL_FUNCTION_VAR" "CALL_FUNCTION_VAR_KW" "COMPARE_OP" "CONTINUE_LOOP" "DELETE_ATTR" "DELETE_DEREF" "DELETE_FAST" "DELETE_GLOBAL" "DELETE_NAME" "DELETE_SUBSCR" "DUP_TOP" "DUP_TOP_TWO" "END_FINALLY" "EXTENDED_ARG" "FOR_ITER" "GET_AITER" "GET_ANEXT" "GET_AWAITABLE" "GET_ITER" "GET_YIELD_FROM_ITER" "IMPORT_FROM" "IMPORT_NAME" "IMPORT_STAR" "INPLACE_ADD" "INPLACE_AND" "INPLACE_FLOOR_DIVIDE" "INPLACE_LSHIFT" "INPLACE_MATRIX_MULTIPLY" "INPLACE_MODULO" "INPLACE_MULTIPLY" "INPLACE_OR" "INPLACE_POWER" "INPLACE_RSHIFT" "INPLACE_SUBTRACT" "INPLACE_TRUE_DIVIDE" "INPLACE_XOR" "JUMP_ABSOLUTE" "JUMP_FORWARD" "JUMP_IF_FALSE_OR_POP" "JUMP_IF_TRUE_OR_POP" "LIST_APPEND" "LOAD_ATTR" "LOAD_BUILD_CLASS" "LOAD_CLASSDEREF" "LOAD_CLOSURE" "LOAD_CONST" "LOAD_DEREF" "LOAD_FAST" "LOAD_GLOBAL" "LOAD_NAME" "MAKE_CLOSURE" "MAKE_FUNCTION" "MAP_ADD" "NOP" "POP_BLOCK" "POP_EXCEPT" "POP_JUMP_IF_FALSE" "POP_JUMP_IF_TRUE" "POP_TOP" "PRINT_EXPR" "RAISE_VARARGS" "RETURN_VALUE" "ROT_THREE" "ROT_TWO" "SETUP_ASYNC_WITH" "SETUP_EXCEPT" "SETUP_FINALLY" "SETUP_LOOP" "SETUP_WITH" "SET_ADD" "STORE_ATTR" "STORE_DEREF" "STORE_FAST" "STORE_GLOBAL" "STORE_NAME" "STORE_SUBSCR" "UNARY_INVERT" "UNARY_NEGATIVE" "UNARY_NOT" "UNARY_POSITIVE" "UNPACK_EX" "UNPACK_SEQUENCE" "WITH_CLEANUP_FINISH" "WITH_CLEANUP_START" "YIELD_FROM" "YIELD_VALUE")

We saw above that 0 means LOAD_CONST and 83 means RETURN_VALUE. Let's check that in the opmap:



In [26]:

    
(getitem dis.opmap "LOAD_CONST")









    Out[26]:





100



In [27]:

    
(getitem dis.opmap "RETURN_VALUE")









    Out[27]:





83

We can list out all of the instructions and opcodes with this:



In [61]:

    
(for-each (lambda (name) (printf "~a ~a~%" name (getitem dis.opmap name)))
          (sort < (map (lambda (i) i) (dis.opmap.keys))))









    



BEFORE_ASYNC_WITH 52
BINARY_ADD 23
BINARY_AND 64
BINARY_FLOOR_DIVIDE 26
BINARY_LSHIFT 62
BINARY_MATRIX_MULTIPLY 16
BINARY_MODULO 22
BINARY_MULTIPLY 20
BINARY_OR 66
BINARY_POWER 19
BINARY_RSHIFT 63
BINARY_SUBSCR 25
BINARY_SUBTRACT 24
BINARY_TRUE_DIVIDE 27
BINARY_XOR 65
BREAK_LOOP 80
BUILD_LIST 103
BUILD_LIST_UNPACK 149
BUILD_MAP 105
BUILD_MAP_UNPACK 150
BUILD_MAP_UNPACK_WITH_CALL 151
BUILD_SET 104
BUILD_SET_UNPACK 153
BUILD_SLICE 133
BUILD_TUPLE 102
BUILD_TUPLE_UNPACK 152
CALL_FUNCTION 131
CALL_FUNCTION_KW 141
CALL_FUNCTION_VAR 140
CALL_FUNCTION_VAR_KW 142
COMPARE_OP 107
CONTINUE_LOOP 119
DELETE_ATTR 96
DELETE_DEREF 138
DELETE_FAST 126
DELETE_GLOBAL 98
DELETE_NAME 91
DELETE_SUBSCR 61
DUP_TOP 4
DUP_TOP_TWO 5
END_FINALLY 88
EXTENDED_ARG 144
FOR_ITER 93
GET_AITER 50
GET_ANEXT 51
GET_AWAITABLE 73
GET_ITER 68
GET_YIELD_FROM_ITER 69
IMPORT_FROM 109
IMPORT_NAME 108
IMPORT_STAR 84
INPLACE_ADD 55
INPLACE_AND 77
INPLACE_FLOOR_DIVIDE 28
INPLACE_LSHIFT 75
INPLACE_MATRIX_MULTIPLY 17
INPLACE_MODULO 59
INPLACE_MULTIPLY 57
INPLACE_OR 79
INPLACE_POWER 67
INPLACE_RSHIFT 76
INPLACE_SUBTRACT 56
INPLACE_TRUE_DIVIDE 29
INPLACE_XOR 78
JUMP_ABSOLUTE 113
JUMP_FORWARD 110
JUMP_IF_FALSE_OR_POP 111
JUMP_IF_TRUE_OR_POP 112
LIST_APPEND 145
LOAD_ATTR 106
LOAD_BUILD_CLASS 71
LOAD_CLASSDEREF 148
LOAD_CLOSURE 135
LOAD_CONST 100
LOAD_DEREF 136
LOAD_FAST 124
LOAD_GLOBAL 116
LOAD_NAME 101
MAKE_CLOSURE 134
MAKE_FUNCTION 132
MAP_ADD 147
NOP 9
POP_BLOCK 87
POP_EXCEPT 89
POP_JUMP_IF_FALSE 114
POP_JUMP_IF_TRUE 115
POP_TOP 1
PRINT_EXPR 70
RAISE_VARARGS 130
RETURN_VALUE 83
ROT_THREE 3
ROT_TWO 2
SETUP_ASYNC_WITH 154
SETUP_EXCEPT 121
SETUP_FINALLY 122
SETUP_LOOP 120
SETUP_WITH 143
SET_ADD 146
STORE_ATTR 95
STORE_DEREF 137
STORE_FAST 125
STORE_GLOBAL 97
STORE_NAME 90
STORE_SUBSCR 60
UNARY_INVERT 15
UNARY_NEGATIVE 11
UNARY_NOT 12
UNARY_POSITIVE 10
UNPACK_EX 94
UNPACK_SEQUENCE 92
WITH_CLEANUP_FINISH 82
WITH_CLEANUP_START 81
YIELD_FROM 72
YIELD_VALUE 86

You can also go the other-way round: given a instruction name, you can get the opcode using dis.opname:



In [28]:

    
(getitem dis.opname 83)









    Out[28]:





"RETURN_VALUE"



In [46]:

    
(getitem dis.opname 100)









    Out[46]:





"LOAD_CONST"



In [116]:

    
(map (lambda (i) 
       (getitem bytecode.co_code i)) 
     (range (len bytecode.co_code)))









    Out[116]:





(100 2 0 83)



In [114]:

    
(define virtual-machine
  (lambda (bytecode)
    (let ((instructions bytecode.co_code)
          (constants bytecode.co_consts)
          (stack '())
          (retval (void)))
      (let loop ((i 0))
        (if (< i (len bytecode.co_code))
            (let ((byte (getitem bytecode.co_code i)))
              (cond
                ((= byte 100) 
                 (begin 
                   (printf "LOAD_CONST ~a ~%" i)
                   (set! stack (cons (getitem constants (getitem bytecode.co_code (+ i 1)))
                                     stack))
                   (loop (+ i 2))
                  ))
                ((= byte 83) 
                 (begin 
                   (set! retval (car stack))
                   (set! stack (cdr stack))
                   (printf "RETURN_VALUE ~a ~%" i)
                   (loop (+ i 1))
                  ))
                ((= byte 0) 
                 (begin 
                   (printf "NOP ~a ~%" i)
                   (loop (+ i 1))
                 ))
                (else (error "virtual-machine" "out of commands: ~a" i))))
            retval)))))



In [115]:

    
(virtual-machine bytecode)









    



LOAD_CONST 0 
NOP 2 
RETURN_VALUE 3 






    Out[115]:





3

We could continue to fill-out the Python virtual machine. However, we are also interested in possibly changing the semantics of how Python works. Rather, let's explore writing an interpreter at the abstract syntax level.

Summary:

we wrote Python AST by hand, and had Python compile and run it
we used Python ast.parse, and compiled it to bytecode
we wrote the beginning of a Python VM in Scheme to interpret/run the bytecode

Interpreting Python AST

To keep things simple for now, let's keep on using Python's built-in ast parser that goes from concrete syntax to AST.

As in the previous chapter, we will define an evaluator (and perhaps help functions) to implement our Python Calculator language, P-Calc.



In [117]:

    
(import "ast")









    Out[117]:





(ast)



In [123]:

    
(define expr (ast.parse "1 + 2"))



In [124]:

    
(ast.dump expr)









    Out[124]:





"Module(body=[Expr(value=BinOp(left=Num(n=1), op=Add(), right=Num(n=2)))])"

In order to interpret this AST, we have to handle the following items:

Module
Expr
BinOp
Add
Num

Note that Module has a body that is a list of Expr. That means that it acts like a Scheme begin expression. We will handle the Add inside the code evaluating the BinOp.



In [125]:

    
(define evaluator
  (lambda (expr)
    (cond
     ((isinstance expr ast.Module)
      (evaluator-begin (vector->list expr.body)))
     ((isinstance expr ast.Expr)
      (evaluator expr.value))
     ((isinstance expr ast.BinOp)
      (evaluator-binop expr.op 
                   (evaluator expr.left)
                   (evaluator expr.right)))
     ((isinstance expr ast.Num)
      expr.n)
     (else (error "evaluator" "invalid ast: ~s" expr)))))



In [126]:

    
(define evaluator-begin
  (lambda (exprs)
    (cond
     ((null? exprs)
      (void))
     ((= 1 (length exprs))
      (evaluator (car exprs)))
     (else (begin
            (evaluator (car exprs))
            (evaluator-begin (cdr exprs)))))))

(define evaluator-binop
  (lambda (op left right)
    (cond
     ((isinstance op ast.Add)
      (+ left right))
     (else (error "apply-binop" "unknown operator: ~s" op)))))



In [127]:

    
(evaluator expr)









    Out[127]:





3



In [138]:

    
(define pcalc
  (lambda (string)
    (let ((pyast (ast.parse string)))
      (evaluator pyast))))

Try it out!

Python vs. P-Calc

Even at this simple state of our interpreter, there are some differences in how Python handles and how we handle expressions. Consider the following:



In [139]:

    
(define pyexpr (ast.parse "1; 2; 3; 4"))



In [140]:

    
(evaluator pyexpr)









    Out[140]:





4



In [141]:

    
(define bytecode (compile pyexpr "<string>" "exec"))



In [142]:

    
(python-eval bytecode)

It didn't return anything whereas our evaluator returned 4.

Exercise: How did the Python interpreter differ from our interpretation of the same expression? Why do you thing this is?



In [143]:

    
(pcalc "1 x 5")









    



Traceback (most recent call last):
  File "In [143]", line 1, col 1, in 'pcalc'
  File "In [138]", line 3, col 18, in 'ast.parse'
UnhandledException: invalid syntax (<unknown>, line 1)

That is, we can only use the symbols that Python has predeterimed as valid BinOps. If we want to define our own BinOp (not one of +, -, *, /, etc.) Then we must parse it as an application:



In [144]:

    
(pcalc "x(1, 5)")









    



Traceback (most recent call last):
  File "In [144]", line 1, col 1, in 'pcalc'
  File "In [138]", line 3, col 5, in 'let'
  File "In [138]", line 4, col 7, in 'evaluator'
  File "In [125]", line 5, col 7, in 'evaluator-begin'
  File "In [126]", line 7, col 7, in 'evaluator'
  File "In [125]", line 7, col 7, in 'evaluator'
  File "In [125]", line 14, col 12, in 'error'
  File "In [125]", line 14, col 12
RunTimeError: Error in 'evaluator': invalid ast: <_ast.Call object at 0x7f59ad711a90>

Of course, we must handle the ast.Call in evaluator.



In [ ]:

    
;; adding Call

(define evaluator
  (lambda (expr)
    (cond
     ((isinstance expr ast.Module)
      (evaluator-begin (vector->list expr.body)))
     ((isinstance expr ast.Expr)
      (evaluator expr.value))
     ((isinstance expr ast.BinOp)
      (evaluator-binop expr.op 
                   (evaluator expr.left)
                   (evaluator expr.right)))
     ((isinstance expr ast.Num)
      expr.n)
     ((isinstance expr ast.Call)
      (evaluator-apply expr.func.id (map evaluator (vector->list expr.args))))
     (else (error "evaluator" "invalid ast: ~s" expr)))))

(define evaluator-apply
  (lambda (op operands)
    (cond
     ((string=? op "print")
      (apply print operands))
     (else (error "evaluator-appy" "unknown apply operator: ~s" op)))))

References

http://aosabook.org/en/500L/a-python-interpreter-written-in-python.html