In Chapter 5, we saw how to build an interpreter by breaking the process into 2 steps:
For many real-world problems, this can be slower than what is possible. For many modern languages, an intermediate stage is used. Rather than interpreting AST, the AST is compiled into bytecodes. The final interpreter works on bytecodes rather than ASTs:
Both Java and Python (and many other languages) work this way. Before we build a Python interpreter, let's examine the way that Python actually works.
First, let's take a look at manually constructing a Python AST representing "1 + 2". We will examing parsing shortly, but for now, let's start with AST. Python comes with a library containing the necessary AST data structures:
In [1]:
(import "ast")
Out[1]:
To use the built-in Python tools for interpreting an expression, we must build the AST in a precise manner. So, we take the idea of "1 + 2" and we express this as:
In [2]:
(define expr
(ast.Expression
(ast.BinOp (ast.Num 1)
(ast.Add)
(ast.Num 2))))
We build the AST precisely as needed by Python, so we can actually let Python compile and run the code. Coming up with the proper Python AST is non-trivial; we'll see the details in the next section. For now, just note that 1 + 2 is represented as a "Binary Operation" (BinOp) which is itself an Expression. The BinOp is really just a special version of our apply-exp
in Scheme AST.
To see a string representation of the AST we can use the ast.dump
function:
In [3]:
(ast.dump expr)
Out[3]:
Because our expr didn't actually come from source code (e.g., from a file or a string) we need to poke in dummy values for line numbers so that the compiler won't complain. There is a function called ast.fix_missing_locations
for doing exactly that:
In [4]:
(set! expr (ast.fix_missing_locations expr))
Then, in Python, the AST is compiled into bytecode using Python's compile
function:
In [5]:
(define bytecode (compile expr "<string>" "eval"))
Finally, Python can interpret the bytecode and give you the result:
In [6]:
(python-eval bytecode)
Out[6]:
In [7]:
bytecode.co_code
Out[7]:
Well then! Not surprisingly the bytecode is composed of bytes. How long is the entire program:
In [8]:
(len bytecode.co_code)
Out[8]:
Four bytes long. Exactly, what is the value of each of those 4 bytes:
In [9]:
(map (lambda (i) i) bytecode.co_code)
Out[9]:
We can use the Python library dis
to disassemble the bytecode. Disassemble means that we take the bytes in the bytecode and turn them into a human-readable format.
In [17]:
(import "dis")
Out[17]:
In [18]:
(dis.disassemble bytecode)
That's it: two lines. How do you read this? The number in the first column shows the line number in our Python source code (i.e., the concrete syntax). We didn't actually have source code in this example as we started with AST. But the ast.fix_missing_locations
function stuck in a "1" for us.
The second column is an index into the bytecode, and the third column is that instruction in human-readable form. Let's look up those two indexes in the bytecode:
In [19]:
(getitem bytecode.co_code 0)
Out[19]:
In [20]:
(getitem bytecode.co_code 3)
Out[20]:
Therefore, LOAD_CONST
is represented by 100, and RETURN_VALUE
is represented by 83. If you have taken an assembly language course, this type of output might look familar. It turns out that bytecode is an assembly language. Usually, assembly languages are designed for a particular computer, a particular CPU in hardware. This assembly language is designed for a Virtual Machine. A Virtual Machine (or VM) doesn't actually exist as hardware, but is interpreted by another program. Java works this way too, although it has its own VM called the Java Virtual Machine, or JVM.
The fourth column of the disassembled bytecode (when present) is the argument to that instruction. The fifth column (when present) is a hint about what the argument means. In this example, it means that the argument is the literal number 3.
Wait a minute... what happened to "1 + 2"? Well, during the compile step, Python can do some optimizations. It recognized that we were adding two constants (1 and 2) and realized that it could do that at compile time rather than waiting until the code was interpreted.
You can try compiling and running other Python AST expressions. However, putting together Python AST is not as easy as our Scheme AST. However, in the next section will start with Python concrete syntax, and parse it automatically to Python AST.
Python is a much more complex language in terms of concrete syntax. It is not just simple lists, like Scheme. It has brackets, newlines, colons, semicolons, dots, commas, and much more. Luckily for us, Python comes with a library to parse itself. To parse Python we usin the ast
module again:
In [21]:
(import "ast")
Out[21]:
ast
has a function called parse
for taking valid Python expressions and statements and turning them into AST.
In [22]:
(define pyexpr (ast.parse "1 + 2"))
Again, we can use the ast.dump
to see the AST in a more human-readable format:
In [23]:
(ast.dump pyexpr)
Out[23]:
Notice that Python added a couple of items:
Also note, again, that +
is parsed as a BinOp. Now we can automatically generate AST from Python concrete syntax.
As noted above, Python generates an assembly language (called bytecodes) for a virtual machine (VM). We could build a VM interpreter and interpret bytes. Let's explore that for just a momement.
If you have taken assmbly language, you may have heard the term "instructions". Assembly languages are composed of instructions. In bytecode, these instructions are opcodes.
How many opcodes/instructions does Python have? The Python dis
module has a dictionary containing the instruction names mapped to instruction opcodes.
In [25]:
(len dis.opmap)
Out[25]:
The Python VM has 114 opcodes. What are the 114 instructions?
In [51]:
(sort < (map (lambda (i) i) (dis.opmap.keys)))
Out[51]:
We saw above that 0 means LOAD_CONST and 83 means RETURN_VALUE. Let's check that in the opmap:
In [26]:
(getitem dis.opmap "LOAD_CONST")
Out[26]:
In [27]:
(getitem dis.opmap "RETURN_VALUE")
Out[27]:
We can list out all of the instructions and opcodes with this:
In [61]:
(for-each (lambda (name) (printf "~a ~a~%" name (getitem dis.opmap name)))
(sort < (map (lambda (i) i) (dis.opmap.keys))))
You can also go the other-way round: given a instruction name, you can get the opcode using dis.opname:
In [28]:
(getitem dis.opname 83)
Out[28]:
In [46]:
(getitem dis.opname 100)
Out[46]:
In [116]:
(map (lambda (i)
(getitem bytecode.co_code i))
(range (len bytecode.co_code)))
Out[116]:
In [114]:
(define virtual-machine
(lambda (bytecode)
(let ((instructions bytecode.co_code)
(constants bytecode.co_consts)
(stack '())
(retval (void)))
(let loop ((i 0))
(if (< i (len bytecode.co_code))
(let ((byte (getitem bytecode.co_code i)))
(cond
((= byte 100)
(begin
(printf "LOAD_CONST ~a ~%" i)
(set! stack (cons (getitem constants (getitem bytecode.co_code (+ i 1)))
stack))
(loop (+ i 2))
))
((= byte 83)
(begin
(set! retval (car stack))
(set! stack (cdr stack))
(printf "RETURN_VALUE ~a ~%" i)
(loop (+ i 1))
))
((= byte 0)
(begin
(printf "NOP ~a ~%" i)
(loop (+ i 1))
))
(else (error "virtual-machine" "out of commands: ~a" i))))
retval)))))
In [115]:
(virtual-machine bytecode)
Out[115]:
We could continue to fill-out the Python virtual machine. However, we are also interested in possibly changing the semantics of how Python works. Rather, let's explore writing an interpreter at the abstract syntax level.
Summary:
In [117]:
(import "ast")
Out[117]:
In [123]:
(define expr (ast.parse "1 + 2"))
In [124]:
(ast.dump expr)
Out[124]:
In order to interpret this AST, we have to handle the following items:
Note that Module has a body that is a list of Expr. That means that it acts like a Scheme begin expression. We will handle the Add inside the code evaluating the BinOp.
In [125]:
(define evaluator
(lambda (expr)
(cond
((isinstance expr ast.Module)
(evaluator-begin (vector->list expr.body)))
((isinstance expr ast.Expr)
(evaluator expr.value))
((isinstance expr ast.BinOp)
(evaluator-binop expr.op
(evaluator expr.left)
(evaluator expr.right)))
((isinstance expr ast.Num)
expr.n)
(else (error "evaluator" "invalid ast: ~s" expr)))))
In [126]:
(define evaluator-begin
(lambda (exprs)
(cond
((null? exprs)
(void))
((= 1 (length exprs))
(evaluator (car exprs)))
(else (begin
(evaluator (car exprs))
(evaluator-begin (cdr exprs)))))))
(define evaluator-binop
(lambda (op left right)
(cond
((isinstance op ast.Add)
(+ left right))
(else (error "apply-binop" "unknown operator: ~s" op)))))
In [127]:
(evaluator expr)
Out[127]:
In [138]:
(define pcalc
(lambda (string)
(let ((pyast (ast.parse string)))
(evaluator pyast))))
Try it out!
In [139]:
(define pyexpr (ast.parse "1; 2; 3; 4"))
In [140]:
(evaluator pyexpr)
Out[140]:
In [141]:
(define bytecode (compile pyexpr "<string>" "exec"))
In [142]:
(python-eval bytecode)
It didn't return anything whereas our evaluator returned 4.
Exercise: How did the Python interpreter differ from our interpretation of the same expression? Why do you thing this is?
In [143]:
(pcalc "1 x 5")
That is, we can only use the symbols that Python has predeterimed as valid BinOps. If we want to define our own BinOp (not one of +, -, *, /, etc.) Then we must parse it as an application:
In [144]:
(pcalc "x(1, 5)")
Of course, we must handle the ast.Call in evaluator.
In [ ]:
;; adding Call
(define evaluator
(lambda (expr)
(cond
((isinstance expr ast.Module)
(evaluator-begin (vector->list expr.body)))
((isinstance expr ast.Expr)
(evaluator expr.value))
((isinstance expr ast.BinOp)
(evaluator-binop expr.op
(evaluator expr.left)
(evaluator expr.right)))
((isinstance expr ast.Num)
expr.n)
((isinstance expr ast.Call)
(evaluator-apply expr.func.id (map evaluator (vector->list expr.args))))
(else (error "evaluator" "invalid ast: ~s" expr)))))
(define evaluator-apply
(lambda (op operands)
(cond
((string=? op "print")
(apply print operands))
(else (error "evaluator-appy" "unknown apply operator: ~s" op)))))