In [1]:
name = '2018-02-19-debugging-profiling'
title = 'Debugging and profiling'
tags = 'optimisation, debugging, profiling, basics'
author = 'Denis Sergeev'

In [2]:
from nb_tools import connect_notebook_to_post
from IPython.core.display import HTML

html = connect_notebook_to_post(name, title, tags, author)

Today we went through some basic tools to inspect Python scripts for errors and performance bottlenecks.

Debugging

Python DeBugger (PDB)

The standard Python tool for interactive debugging is pdb, the Python debugger.

  • This debugger lets the user step through the code line by line in order to see what might be causing a more difficult error.
  • The IPython-enhanced version of this is ipdb, the IPython debugger.

Using in the command line

  • An even more convenient version is pdbpp

Using PDB in IPython or Jupyter

First, we import a function to insert breakpoints:


In [3]:
from IPython.core.debugger import set_trace

Define a simple function, insert a breakpoint:


In [4]:
def fun(foo, bar):
    set_trace()
    return foo + bar

In [5]:
# fun(1, 2)

(Uncomment it to run with debugger)

In IPython, perhaps the most convenient interface to debugging is the %debug magic command. If you call it after hitting an exception, it will automatically open an interactive debugging prompt at the point of the exception. The ipdb prompt lets you explore the current state of the stack, explore the available variables, and even run Python commands!

Example: 2 functions, one of which calls another

In [6]:
def func1(a, b):
    return a / b

def func2(x):
    a = x
    b = x - 1
    return func1(a, b)

In [7]:
func2(1)


---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-7-7cb498ea7ed1> in <module>()
----> 1 func2(1)

<ipython-input-6-586ccabd0db3> in func2(x)
      5     a = x
      6     b = x - 1
----> 7     return func1(a, b)

<ipython-input-6-586ccabd0db3> in func1(a, b)
      1 def func1(a, b):
----> 2     return a / b
      3 
      4 def func2(x):
      5     a = x

ZeroDivisionError: division by zero

In [8]:
# %debug

If you'd like the debugger to launch automatically whenever an exception is raised, you can use the %pdb magic function to turn on this automatic behavior:

%pdb on

...some code to debug...

%pdb off

Timing and Profiling

Simple timing

In IPython and Jupyter you can use special magics to measure how much time a snippet of code takes to run.


In [9]:
%timeit sum(range(100))


1.59 µs ± 229 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Note that because this operation is so fast, %timeit automatically does a large number of repetitions. For slower commands, %timeit will automatically adjust and perform fewer repetitions.

Profiling

Once you have your code working, it can be useful to dig into its efficiency a bit. Sometimes it's useful to check the execution time of a given command or set of commands; other times it's useful to dig into a multiline process and determine where the bottleneck lies in some complicated series of operations.

You can either use the built in profiler:

import cProfile
cp = cProfile.Profile()
cp.enable()

# some code to profile

cp.disable()
cp.print_stats()

Line-by-line profiling

Let's define a function with a big number of for-loop iterations.


In [10]:
def sum_of_lists(N):
    total = 0
    for i in range(5):
        L = [j ^ (j >> i) for j in range(N)]
        total += sum(L)
    return total

We can now load a line_profiler extension (after we install line_profiler package using pip or conda).


In [11]:
%load_ext line_profiler

And call the function using the special profiler magic:


In [12]:
%lprun -f sum_of_lists sum_of_lists(5000)

Memory profiling

In the similar manner, we can even do memory profiling (once the memory_profiler package is installed).


In [13]:
%load_ext memory_profiler

In [14]:
%memit sum_of_lists(100000)


peak memory: 50.20 MiB, increment: 7.16 MiB

Disassembling and inspecting Python code

Tracing CPython code execution


In [15]:
def foo(a, b):
    return a + b

Let's disassemble it with the dis module:


In [16]:
import dis
dis.dis(foo)


  2           0 LOAD_FAST                0 (a)
              2 LOAD_FAST                1 (b)
              4 BINARY_ADD
              6 RETURN_VALUE

What's BINARY_ADD?

We crack open CPython's source code and take a look inside Python/ceval.c:

/* Python/ceval.c */
TARGET(BINARY_ADD) {
    PyObject *right = POP();
    PyObject *left = TOP();
    PyObject *sum;
    /* NOTE(haypo): Please don't try to micro-optimize int+int on
       CPython using bytecode, it is simply worthless.
       See http://bugs.python.org/issue21955 and
       http://bugs.python.org/issue10044 for the discussion. In short,
       no patch shown any impact on a realistic benchmark, only a minor
       speedup on microbenchmarks. */
    if (PyUnicode_CheckExact(left) &&
            PyUnicode_CheckExact(right)) {
        sum = unicode_concatenate(left, right, f, next_instr);
        /* unicode_concatenate consumed the ref to left */
    }
    else {
        sum = PyNumber_Add(left, right); // <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
        Py_DECREF(left);
    }
    Py_DECREF(right);
    SET_TOP(sum);
    if (sum == NULL)
        goto error;
    DISPATCH();
}

What's PyNumber_Add(left, right)?

/* Objects/abstract.c */
PyObject *
PyNumber_Add(PyObject *v, PyObject *w)
{
    PyObject *result = binary_op1(v, w, NB_SLOT(nb_add)); // <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
    if (result == Py_NotImplemented) {
        PySequenceMethods *m = v->ob_type->tp_as_sequence;
        Py_DECREF(result);
        if (m && m->sq_concat) {
            return (*m->sq_concat)(v, w);
        }
        result = binop_type_error(v, w, "+");
    }
    return result;
}

What's binary_op1()?

static PyObject *
binary_op1(PyObject *v, PyObject *w, const int op_slot)
{
    PyObject *x;
    binaryfunc slotv = NULL;
    binaryfunc slotw = NULL;

    if (v->ob_type->tp_as_number != NULL)
        slotv = NB_BINOP(v->ob_type->tp_as_number, op_slot);
    if (w->ob_type != v->ob_type &&
        w->ob_type->tp_as_number != NULL) {
        slotw = NB_BINOP(w->ob_type->tp_as_number, op_slot); // <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
        if (slotw == slotv)
            slotw = NULL;
    }
    if (slotv) {
        if (slotw && PyType_IsSubtype(w->ob_type, v->ob_type)) {
            x = slotw(v, w);
            if (x != Py_NotImplemented)
                return x;
            Py_DECREF(x); /* can't do it */
            slotw = NULL;
        }
        x = slotv(v, w);
        if (x != Py_NotImplemented)
            return x;
        Py_DECREF(x); /* can't do it */
    }
    if (slotw) {
        x = slotw(v, w);
        if (x != Py_NotImplemented)
            return x;
        Py_DECREF(x); /* can't do it */
    }
    Py_RETURN_NOTIMPLEMENTED;
}

What's NB_BINOP()?

#define NB_BINOP(nb_methods, slot) \
        (*(binaryfunc*)(& ((char*)nb_methods)[slot]))

Cut to the chase: where's the addition function for two ints (longs)?

/* Objects/longobject.c */
static PyObject *
long_add(PyLongObject *a, PyLongObject *b)
{
    PyLongObject *z;

    CHECK_BINOP(a, b);

    if (Py_ABS(Py_SIZE(a)) <= 1 && Py_ABS(Py_SIZE(b)) <= 1) {
        return PyLong_FromLong(MEDIUM_VALUE(a) + MEDIUM_VALUE(b)); // <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
    }
    if (Py_SIZE(a) < 0) {
        if (Py_SIZE(b) < 0) {
            z = x_add(a, b);
            if (z != NULL) {
                /* x_add received at least one multiple-digit int,
                   and thus z must be a multiple-digit int.
                   That also means z is not an element of
                   small_ints, so negating it in-place is safe. */
                assert(Py_REFCNT(z) == 1);
                Py_SIZE(z) = -(Py_SIZE(z));
            }
        }
        else
            z = x_sub(b, a);
    }
    else {
        if (Py_SIZE(b) < 0)
            z = x_sub(a, b);
        else
            z = x_add(a, b);
    }
    return (PyObject *)z;
}

What's MEDIUM_VALUE()?

/* Objects/longobject.c */
#define MEDIUM_VALUE(x) (assert(-1 <= Py_SIZE(x) && Py_SIZE(x) <= 1),  \
     Py_SIZE(x) < 0 ? -(sdigit)(x)->ob_digit[0] :  \
         (Py_SIZE(x) == 0 ? (sdigit)0 :  \
          (sdigit)(x)->ob_digit[0]))

Why so much code for such a simple operation?

  • C-level is interpreting bytecodes (BINARY_ADD in this case).
  • Polymorphism -- code can handle foo('a', 'b') or any types that support +.
  • Works for user-defined types, too, with an __add__ or __radd__ magic method.
  • Error checking everywhere...
  • For adding ints, does overflow checking and conversions, etc.

All these features mean a lot of code at the C level.

What is the performance?


In [17]:
%timeit foo(1, 2)


104 ns ± 1.22 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [18]:
%timeit foo('a', 'b')


150 ns ± 14.2 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Summary

CPython slowdown 1: interpreted bytecode execution

  • Fetching CPython bytecode ops, managing the stack machine, all ops in ceval.c.
  • Extensive error checking and handling.

CPython slowdown 2: dynamic type resolution

  • Type introspection, dynamic dispatch on every operation / method call, supporting generic operations.
  • And extensive error checking and handling, up-and-down the call stack.

In [19]:
HTML(html)


Out[19]:

This post was written as an IPython (Jupyter) notebook. You can view or download it using nbviewer.