Playing with Python Bytecode



https://github.com/ssanderson/pycon-2016



In [1]:
# (Open to Scott standing at podium, on title slide. Joe is hiding in the audience.)
#
# Scott: Hey everyone, thank you all for coming out today. We've just received word 
#        from Joe and Scott via text message that they're not going to be able to 
#        make it today due to a freak scheduling accident.  But we know that you were
#        all super excited to learn about Python bytecode, so I'm gonna try to do my
#        best to give the talk in their place.
#
# Scott: I took a look at their abstract a minute ago...it looks like they 
#        were going to explain CPython's internal code representation, they were
#        going to show how to create a function "from scratch" (whatever that means),
#        and they were going to show how to change the bytecode of Python functions
#        at runtime.
#        
#
# Scott: Okay, so it sounds like this talk is supposed to be about mucking 
#        with compiled Python functions. I guess that means we need a function 
#        to muck with.

def add(a, b):
    return a + b
add(1, 2)


Out[1]:
3

In [2]:
# Scott: We've got our add() function. There must be some way for us to access
#        its bytecode if we're going to play with it.
#        All the magic stuff in Python starts and ends with double
#        underscores.  Maybe we can find the bytecode.

# add.__<TAB>
# add.__code__
# add.__code__.co_<TAB>
# add.__code__.co_argcount
# add.__code__.co_consts
add.__code__.co_code

# Scott: These don't really look like printable characters.
# Scott: Maybe we can understand these better if we look at the byte values as integers.


Out[2]:
b'|\x00\x00|\x01\x00\x17S'

In [3]:
print(list(add.__code__.co_code))

# Scott: Hrm. There's definitely some structure here.  
#        I see (124, 0, 0) followed by (124, 1, 0). Maybe one of
#        these corresponds to 'a' and the other corresponds to 'b'?
#        (confused noises)
#
# Scott: I think I'm a little out of my depth here.  Is there anyone in the audience that
#        knows how bytecode works?
#
# Joe stands in audience, wearing a 'BYTECODE EXPERT' shirt.  
# He turns to face the audience, looking around to see if anyone 
# else will answer the call to action.
#    
# After a moment, he shrugs and says:
#
# Joe: Well, **I'm** actually a Certified Bytecode Expert.
#
# Scott: Great, why don't you come on up here. 
#        Can we get a mic for him? (Joe is already mic'ed)
#
# Joe: It's okay, I brought my own.

# (Joe scrambles up onto the stage.)

# Scott: Wait, you brought you're own micr--

# Joe: Let's get back on track here. You were going in the right direction.
#      The integers in that list are the instructions
#      that the interpreter executes when you call add().
#      There's actually a built-in module for displaying them 
#      in human-readable form.
#      Why don't you import 'dis'?

# import this

# Scott: <starts reading>
# Joe: No, not 'this': dis.  It's the 'disassembly' module.
# Scott: Ah.

import dis

# Joe: You can do `dis.dis` of a function to show the disassembly.

dis.dis(add)

# Scott: What does all this mean?

# Joe: The dis function prints a human-readable representation 
#      of the instructions in the code object.
# 
#      1. There are 8 bytes in that code object, but there are only 4 instructions:
#         LOAD_FAST, LOAD_FAST, BINARY_ADD, RETURN_VALUE.

#      2. The first three bytes in the list correspond to the first LOAD_FAST. 
#         The second three bytes
#         correspond to the second LOAD_FAST, the last two bytes correspond to the BINARY_ADD
#         and RETURN_VALUE, respectively.

# Scott: Why does LOAD_FAST get three bytes, but BINARY_ADD and RETURN_VALUE only get one?

# Joe: 3. The two bytes after 124 are the **arguments** to the LOAD_FAST instruction. They
#         tell the interpreter which local variable to load.  
#         (124, 0, 0) says to load local variable zero.
#         (124, 1, 0) says to load local variable one.
#
#      4. In general, whenever an instruction takes an argument, 
#         the next two bytes contain the argument to the instruction.

# Scott: Why are those one and zero? Don't we want to load a and b?

# Joe:   Argument instructions are usually represented as little-endian 16-bit integers.
#        For instructions with an argument, the last two dis columns show the value of the
#        argument as an integer, followed by the actual meaning of the argument.
#
#        In this case, dis says the first local variable is 'a' and the second local 
#        variable is 'b'.

# Scott: Okay. So LOAD_FAST of 0 tells Python to load 'a'. 
#        Where exactly are we loading it **to**?

# Joe:   LOAD instructions push values onto a stack where they can be 
#        manipulated by other instructions. The BINARY_ADD instruction 
#        doesn't have an argument in the bytecode because it always pops 
#        the top two values off the stack, adds them, and pushes the result 
#        back onto the stack.

# Scott: I think I understand:
# 
#        At the start of the function the stack is empty, we load a and b, 
#        and then we get to the BINARY_ADD. When we execute the add, we
#        pop 'a' and 'b' off the stack, add them together, 
#        and push the result back onto the stack. 
#        Finally, we hit the RETURN_VALUE instruction, which pops the top 
#        value off the stack and returns it to the caller.

# Joe: Perf.

# Scott: What are those numbers next to the instruction names?

# Joe: Those tell you where in the bytecode the instruction is located.
#      The first instruction starts at index 0, and the second instruction starts at index 3
#      because indices 1 and 2 were used for arguments.

# Scott: And 6 and 7 are next to each other because BINARY_ADD is one-byte instruction?

# Joe: Yep.

# Scott: What's the number on the top left

# Joe: That's the line number for the instructions on or below that line.
#      It's a little easier to see on a function with multiple lines.


[124, 0, 0, 124, 1, 0, 23, 83]
 21           0 LOAD_FAST                0 (a)
              3 LOAD_FAST                1 (b)
              6 BINARY_ADD
              7 RETURN_VALUE

In [6]:
# Scott: Ok...how about this?

def add_with_assign(a, b):
    x = a + b
    return x
dis.dis(add_with_assign)

# Scott: So this says that the first four instructions correspond to the fourth Python line,
#        and the next two instructions correspond to the 5th Python line.

# Joe: Yep. Why don't you try something a little more complex.


  4           0 LOAD_FAST                0 (a)
              3 LOAD_FAST                1 (b)
              6 BINARY_ADD
              7 STORE_FAST               2 (x)

  5          10 LOAD_FAST                2 (x)
             13 RETURN_VALUE

In [7]:
def abs(x):
    if x > 0:
        return x
    else:
        return -x
dis.dis(abs)

# Scott: <talk about LOAD_CONST> Argument is 1, but the value is zero.
#        <talk about COMPARE_OP> How does Python know that 4 means "greater than"?

# Joe: Not all instruction arguments are indices. 
#      COMPARE_OP takes an enum value as its argument.  
#      There are entries for less than, equal, not equal, etc.
#      That next instruction:
#
#      Luckily, POP_JUMP_IF_FALSE does exactly what it says: it pops the top value off the stack and
#      jumps to the instruction at the index of its argument if that value is falsey.
#      Otherwise it continues onto the next instruction as normal.
#
# Scott: Okay. So when we get to the POP_JUMP_IF_FALSE, if the result of COMPARE_OP 
#        is truthy, we continue on to the next instruction. Otherwise, 
#        we jump to LOAD_FAST at index 16?

# Joe: Yep. Those arrows are dis' way of noting that byte 16 is a jump target.

# Scott: Okay. I think I understand the first branch, and I understand the second branch.
#        What about the final LOAD_CONST/RETURN_VALUE? 
#        I don't think we can ever hit those instructions.

# Joe: Yep, that's dead code.
#      CPython uses a fairly simple code generation algorithm.  
#      One of the rules is that if the body of a function doesn't end in a return statement, 
#      an implicit return of None is always inserted. In this case, even though it looks like
#      our function ends in a return value, CPython considers the if-statement to be
#      the last statement, so we get the dummy return even though it'll never be hit.

# Scott: That seems kinda wasteful...

# Joe: In most programs, an extra four unused bytes at the end of a function isn't a big deal.
#      That's only half the size of a pointer.
#      The CPython team decided that eliminating those four bytes isn't worth
#      the additional complexity it would add to the compiler.

# Scott: Okay, I guess that makes sense. 
#        Is there any way we could remove that code if we wanted to?

# Joe: Well...you don't **have** to use the CPython compiler to make a code object. You can
#      just construct one yourself like any other object.


  2           0 LOAD_FAST                0 (x)
              3 LOAD_CONST               1 (0)
              6 COMPARE_OP               4 (>)
              9 POP_JUMP_IF_FALSE       16

  3          12 LOAD_FAST                0 (x)
             15 RETURN_VALUE

  5     >>   16 LOAD_FAST                0 (x)
             19 UNARY_NEGATIVE
             20 RETURN_VALUE
             21 LOAD_CONST               0 (None)
             24 RETURN_VALUE

In [8]:
# Scott: Alright, well, let's write our own abs then!
# Joe:   Hold on there killer. Why don't I get you started with something a little 
#        simpler. How about addone?
# Scott: Alright, fine. Let's write it out in Python first so we know what we're going for:

def addone(x):
    return x + 1

# Scott: Okay, so that's our target. You said I need to construct a code object.
#        Where do I find a code constructor?

# Joe:   The types module provides a CodeType.

# from types import CodeType
# print(CodeType.__doc__)

# Scott: Not for the faint of heart...great.

# my_code = CodeType(1, # argcount (Scott)
#                    0, # kwonlyargcount (Scott)
#                    1, # nlocals (Joe)
#                    2, # stacksize (Joe)
                   
# Joe: stacksize tells Python how much space to allocate for storing values on the stack.
#      We need enough slots to hold the maximum number of objects that ever 
#      will appear on the stack simultaneously.
#
# Scott: Okay, so, the biggest the stack is ever going to be is right before the add when
#        both 'x' and 1 are on the stack. So the stacksize should be two.
                   
# Scott: The next entry is `flags`.  What's the deal with those?

# Joe: `flags` is a bitmask of options for the code object.  
#      I've prepared some material to illustrate the flags in depth.
#      If you could be so kind as to press the down arrow on your keyboard.

# Scott: (presses down, then looks up, a beat)
# Scott: (confused sputtering)
# Joe: (brushing off Scott's confusion) Let's get back on track here...if 
#      take a look at the slide, you can see that there's a lot of different options.

In [9]:
# Used to determine if certain extra optimizations can be made.
# In practice, means "is this code for a function (as opposed to a module or class body)".
CO_OPTIMIZED          = 0x0001

# Should a new locals dict be allocated every time this code is executed?
CO_NEWLOCALS          = 0x0002

CO_VARARGS            = 0x0004
CO_VARKEYWORDS        = 0x0008

# Are we defined inside another function?
CO_NESTED             = 0x0010

# Are we a generator?
CO_GENERATOR          = 0x0020

# Do we share any variable cells with another function.
# We could infer this from other info on the code object,
# but this used for optimization purposes by the interpreter.
CO_NOFREE             = 0x0040

# Are we an async-def'd coroutine or a types.coroutine?
CO_COROUTINE          = 0x0080
CO_ITERABLE_COROUTINE = 0x0100

In [10]:
# __future__ flags
CO_FUTURE_DIVISION         = 0x2000
CO_FUTURE_ABSOLUTE_IMPORT  = 0x4000
CO_FUTURE_WITH_STATEMENT   = 0x8000
CO_FUTURE_PRINT_FUNCTION   = 0x10000
CO_FUTURE_UNICODE_LITERALS = 0x20000

# Compiled with enhanced inequality syntax.
CO_FUTURE_BARRY_AS_BDFL    = 0x40000

# Python 3.5 backwards-compat flag.
CO_FUTURE_GENERATOR_STOP   = 0x80000

In [11]:
# code(argcount, kwonlyargcount, nlocals, stacksize, flags, codestring,
#      constants, names, varnames, filename, name, firstlineno,
#      lnotab[, freevars[, cellvars]])

my_code = CodeType(1,             # argcount
                   0,             # kwonlyargcount
                   1,             # nlocals
                   2,             # stacksize
                   (CO_OPTIMIZED | CO_NEWLOCALS | CO_NOFREE),
# (Resume Typing)
                   bytes([124, 0, 0, 100, 0, 0, 23, 83]),
                   (1,),          # constants
                   (),            # names
                   ('x',),        # varnames
                   '<string>',    # filename
                   'addone',      # name
                   42,            # firstlineno
                   b"",           # lnotab
                   (),            # freevars
                   ())            # cellvars

# Scott: Okay, codestring. (types, 'bytes([])') What are our bytes?

# Joe: (quickly, but not memorized) 124, 0, 0, 100, 0, 0, 23, 83

# Scott: Could you elaborate a little bit on that?

# Joe: Well, looking at our function, we need to load 'x', load 1, do a binary add, 
#      and then return the result. To load x, we need a LOAD_FAST instruction.
#      The opcode for LOAD_FAST is 124. 'x' is the only local variable, so the argument is 0.
#      Next we want to load 1 with a LOAD_CONST instruction. The opcode for LOAD_CONST is 100.
#      We're only using one constant, so we can put the value 1 at index 0.
#      We saw earlier that the opcodes for BINARY_ADD and RETURN_VALUE were 23 and 83.

# Scott: It's impressive that you can just rattle that off.

# Joe: Well, you should expect nothing less from a certified bytecode expert.

# Scott: Okay. So if we're doing a LOAD_CONST of zero, I guess we want a tuple containing just one?

# Joe: (thumbs up)

# Scott: Okay, names?

# Joe: Names should be a tuple containing the names of global variables and 
#      attributes that appear in the function.
#      We don't have any, so you just want an empty tuple.

# Scott: One empty tuple, coming right up. Okay...varnames. How is that different from names?

# Joe: Varnames is a tuple containing the names of the local variables for the function, 
#      in the order that they're indexed by LOAD_FAST and STORE_FAST.

# Scott: So that's just 'x'? (types `('x',))

# Joe: (affirmative noises)

# Joe: The next few entries don't really matter for a hand-written code object:
#      filename is normally the name of the file where the code was defined.
#      We don't have a file, but the convention for exec is the word 'string' 
#      in angle brackets. (Scott types)
#      Next is the name of the code object. (Scott types 'addone')
#      Then we have the first line number where the instructions appear. 
#      Pick your favorite number.
#      After that is the line number table. It's a bytes object representing a mapping 
#      from instructions to their line number. We don't really care about line numbers here,
#      so you can just put an empty bytes object.

# Joe: Last but not least, we have the freevars and the cellvars. These are the names of 
#      variables that are shared with other functions. 
#      Since we set CO_NOFREE, those should be empty tuples. (Scott types)

# Scott: Alright, moment of truth: (Scott executes the cell.  Hopefully it works.)

In [12]:
# Scott: Let's fire this baby up!
my_code(1)

# Scott: What gives?
# Joe: A code object isn't executable by itself. What you really need is a **function**.
# Scott: Alright, let me guess: I need FunctionType from the types module.


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-12-24917692c19d> in <module>()
      1 # Scott: Let's fire this baby up!
----> 2 my_code(1)
      3 
      4 # Scott: What gives?
      5 # Joe: A code object isn't executable by itself. What you really need is a **function**.

TypeError: 'code' object is not callable

In [13]:
# from types import FunctionType
# print(FunctionType.__doc__)

# Scott: This doesn't look so bad.

my_addone = FunctionType(my_code, {})

In [14]:
my_addone(1)

# Scott: We did it! I wonder how different is our implementation from Python's...

dis.dis(addone)
print("--------")
dis.dis(my_addone)

# Scott: Other than the made up line numbers, these look pretty identical.
# Joe: Well, not quite. If you look over at the raw argument column, CPython is loading
#      a constant at index 1, but we're loading a constant at index 0.
# Scott: Oh, weird.  I wonder what's at index 0.


  7           0 LOAD_FAST                0 (x)
              3 LOAD_CONST               1 (1)
              6 BINARY_ADD
              7 RETURN_VALUE
--------
 42           0 LOAD_FAST                0 (x)
              3 LOAD_CONST               0 (1)
              6 BINARY_ADD
              7 RETURN_VALUE

In [15]:
print(addone.__code__.co_consts)
print(my_addone.__code__.co_consts)

# Scott: Why is None in CPython's co_consts. Nothing ever touches it.
# Joe: Every code object created by the CPython compiler has None at index 0.
# Scott: So our hand-crafted code object is more compact than Python's.
# Joe: Ever so slightly, yes.


(None, 1)
(1,)

In [16]:
# Scott: Hold up. If CPython is just loading values at certain indices from the 
#        co_consts, what happens if I change them?

my_addone.__code__.co_consts = (2,)

# Joe: Code objects are read-only to guard against those kinds of shenanigans.
#      If that worked, anyone who depended on your addone function would suddenly
#      find themselves with an addtwo function. Probably not what they wanted.

# Scott: Fair enough. So there's no way to modify existing code objects?

# Joe: You can't modify them in place, but it's easy enough to copy all the attributes
#      from one code object to another, and you can swap out any values you want to 
#      change when you're doing your copying.

# Scott: Okay. So maybe we can write some sort of *function*... that 
#        performs a *functional* update... on a function?


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-16-5026455a5841> in <module>()
      2 #        what happens if I change them?
      3 
----> 4 my_addone.__code__.co_consts = (2,)
      5 
      6 # Joe: Code objects are read-only to guard against that kind of shenanigans.

AttributeError: readonly attribute

In [17]:
def update(f, **kwargs):
    "A function that performs a functional update on a function."
    code = f.__code__
    attrs = [
        'co_argcount', 'co_kwonlyargcount', 'co_nlocals', 'co_stacksize',
        'co_flags', 'co_code', 'co_consts', 'co_names', 'co_varnames',
        'co_filename', 'co_name', 'co_firstlineno', 'co_lnotab',
        'co_freevars', 'co_cellvars',
    ]
    newcode = CodeType(*(kwargs.get(a, getattr(code, a)) for a in attrs))
    return FunctionType(
        newcode, f.__globals__, f.__name__, f.__defaults__, f.__closure__,
    )

# Joe: I took the liberty of preparing this one ahead of time.
# Scott: I'm changing my password when this is over.

In [3]:
# Scott: Okay, so if I do update(my_addone, co_consts=(2,)), 
#        I can *transform* my_addone into my_addtwo.

my_addtwo = update(my_addone, co_consts=(2,))
my_addtwo(2)


Out[3]:
4

In [8]:
# Joe: That's cute, but, you know there's only so much you can do with the 
#      code metadata.  The **real** power comes from manipulating the bytecode itself.

# Scott: (a little bit indignant) Okay, well, co_code is just another attribute.
#        If we can swap out one constants tuple for another, we can just as well swap
#        out the code string.

# Joe: Now you're cooking with gas.  How about we write a function that
#      replaces 23s with 20s in a function's instructions.

# Scott: 23 and 20?

# Joe: Sorry, BINARY_ADD and BINARY_MULTIPLY. Sometimes I forget that not everyone is
#      a certified bytecode expert.

def add_to_mul(f):
    old = f.__code__.co_code
    new = old.replace(bytes([23]), bytes([20]))
    return update(f, co_code=new)

multwo = add_to_mul(addtwo)
multwo(5)

# Joe: See, bytecode hacking isn't so hard once you know how everything works.
#
# (a beat)
#
# Scott: (Scott peers long and hard at the code on the screen, 
#        his brow furrowed with concentration.  Finally, he speaks:)
#        I think there's a bug here.
#
# Joe: (immediately dismissive) How could there be a bug? 
#      **All** we did was replace all the BINARY_ADD opcodes
#      with BINARY_MULTIPLY opcodes.
# 
# Scott: But remember, you told me that not all the bytes are opcodes.
#        Some of them are **arguments**.
# 
# Joe: Okay...so? 23 and 20 only ever mean BINARY_ADD and BINARY_MULTIPLY.
#      (increasingly agitated).  Those would **NEVER** appear as arguments...
#      (suddenly unsure of himself) ...right?
# 
# Scott: Well, what if we wrote a function that needed 23 local variables?
# 
# Joe: Who would write a function with **23** local variab--
#
# Scott: I actually happen to have a function like that lying around.


Out[8]:
10

In [11]:
from string import ascii_lowercase

def get_x(a, b, c, d, e, f, g, h, i, j, k, l, m, 
          n, o, p, q, r, s, t, u, v, w, x, y, z):
    "A function with **26** local variables."
    return x

# Scott: This function has 26 local variables.  
#        If we call it normally, it returns the
#        23rd local variable, 'x'.

# get_x(*ascii_lowercase)

# Scott: But if we apply add_to_mul to get_x...
add_to_mul(get_x)(*ascii_lowercase)

# Joe: (ahh shit...) It turns into `get_u`.


Out[11]:
'u'

In [ ]:
# Scott: I think this could have been a lot worse too.
#        At least there was still a local variable at index
#        20 for us to use.  What would have happened if the
#        index was out of bounds entirely?

baddone = update(addone, co_consts=())
baddone(3)

# Joe: Yeah, that looks pretty bad. I think you crashed the interpreter.
#      Now that I think about it, there are other cases that are
#      just as bad or worse.  Remember those JUMP instructions we talked about
#      earlier?  The arguments there were just indices into the bytecode.
#      If those ever got changed, who knows what would happen?

# Scott: Does that mean we can't ever insert or delete instructions from a
#        code object?

# Joe: Yeah, we'd have to recompute all those offsets.

# Scott: Oof...

# (pause as Joe and Scott reflect on the surprising complexity they've seen)

# Joe: Didn't you say that Joe and Scott wrote a library for dealing with
#      some of these issues?

# Scott: Yeah, codetransformer.  I actually downloaded it right before I came on
#        stage.  Maybe they've got some ideas for making bytecode hacking
#        a little easier.

In [3]:
from codetransformer.transformers import ordereddict_literals

@ordereddict_literals
def make_dict(a, b, c):
    return {a: 1, b: 2, c: 3}

make_dict('a', 'b', 'c')


Out[3]:
OrderedDict([('a', 1), ('b', 2), ('c', 3)])