Michaël Defferrard, PhD student, EPFL LTS2
That is one of the most heard complain about Python. Because CPython, the Python reference implementation, interprets the language (i.e. it compiles Python code to intermediate bytecode which is then interpreted by a virtual machine), it is inherentably slower than compiled languages, especially for computation heavy tasks such as number crunching.
There are three ways around it that we'll explore in this exercise:
In this exercise we'll compare many possible implementations of a function. Our goal is to compare the execution time of our implementations and get a sense of the many ways to write efficient Python code. We test seven implementations:
Along the exercise we'll use the function $$accuracy(\hat{y}, y) = \frac1n \sum_{i=0}^{i=n-1} 1(\hat{y}_i = y_i),$$ where $1(x)$ is the indicator function and $n$ is the number of samples. This function computes the accuracy, i.e. the percentage of correct predictions, of a classifier. A pure Python implementation is given below.
In [ ]:
def accuracy_python(y_pred, y_true):
"""Plain Python implementation."""
num_correct = 0
for y_pred_i, y_true_i in zip(y_pred, y_true):
if y_pred_i == y_true_i:
num_correct += 1
return num_correct / len(y_true)
Below we test and measure the execution time of the above implementation. The %timeit function provided by IPython is a useful helper to measure the execution time of a line of Python code. As we'll see, the above implementation is very inefficient compared to what we can achieve.
In [ ]:
import numpy as np
c = 10 # Number of classes.
n = int(1e6) # Number of samples.
y_true = np.random.randint(0, c, size=n)
y_pred = np.random.randint(0, c, size=n)
print('Expected accuracy: {}'.format(1/c))
print('Empirical accuracy: {}'.format(accuracy_python(y_pred, y_true)))
%timeit accuracy_python(y_pred, y_true)
Specialized libraries, which provide efficient compiled implementations of the heavy computations, is an easy way to solve the performance problem. That is for example NumPy, which uses efficient BLAS and LAPACK implementations as a backend. SciPy and scikit-learn fall in the same category.
Implement below the accuracy function using:
Then test that it provides the correct result and measure it's execution time. How much faster are they compared to the pure Python implementation ?
In [ ]:
def accuracy_numpy(y_pred, y_true):
"""Numpy implementation."""
return np.sum(y_pred == y_true) / y_true.size
def accuracy_sklearn(y_pred, y_true):
"""Scikit-learn implementation."""
from sklearn import metrics
return metrics.accuracy_score(y_pred, y_true)
assert np.allclose(accuracy_numpy(y_pred, y_true), accuracy_python(y_pred, y_true))
assert np.allclose(accuracy_sklearn(y_pred, y_true), accuracy_python(y_pred, y_true))
%timeit accuracy_numpy(y_pred, y_true)
%timeit accuracy_sklearn(y_pred, y_true)
The second option of choice, when the algorirthm does not exist in our favorite libraries and we have to implement it, it to implement in Python and compile it to machine code.
Below you'll compile Python with two frameworks.
While these two approaches offer maximal compatibility with the CPython and NumPy ecosystems, another approach is to use another Python implementation such as PyPy, which features a just-in-time compiler and supports multiple back-ends (C, CLI, JVM). Alternatives are Jython, which runs Python on the Java platform, and IronPython / PythonNet for the .NET platform.
In [ ]:
from numba import jit
# Decorator, same as accuracy_numba = jit(accuracy_python)
@jit
def accuracy_numba(y_pred, y_true):
"""Plain Python implementation, compiled by LLVM through Numba."""
num_correct = 0
for y_pred_i, y_true_i in zip(y_pred, y_true):
if y_pred_i == y_true_i:
num_correct += 1
return num_correct / len(y_true)
In [ ]:
%load_ext Cython
In [ ]:
%%cython
cimport numpy as np
cimport cython
@cython.boundscheck(False) # Turn off bounds-checking for entire function.
@cython.wraparound(False) # Turn off negative index wrapping for entire function.
def accuracy_cython(np.ndarray[long, ndim=1] y_pred, np.ndarray[long, ndim=1] y_true):
"""Python implementation with type information, transpiled to C by Cython."""
cdef int num_total = len(y_true)
cdef int num_correct = 0
cdef int n = y_pred.size
for i in range(n):
if y_pred[i] == y_true[i]:
num_correct += 1
return num_correct / num_total
Evaluate below the performance of those two implementations, while testing their correctness. How do they compare with plain Python and specialized libraries ?
In [ ]:
assert np.allclose(accuracy_numba(y_pred, y_true), accuracy_python(y_pred, y_true))
assert np.allclose(accuracy_cython(y_pred, y_true), accuracy_python(y_pred, y_true))
%timeit accuracy_numba(y_pred, y_true)
%timeit accuracy_cython(y_pred, y_true)
Here we'll explore our third option to make Python faster: implement in another language ! Below you'll
In [ ]:
%%file function.c
double accuracy(long* y_pred, long* y_true, int n) {
int num_correct = 0;
for(int i = 0; i < n; i++) {
if(y_pred[i] == y_true[i])
num_correct++;
}
return (double) num_correct / n;
}
The below cell describe a shell script, which will be executed by IPython as if you typed the commands in your terminal. Those commands are compiling the above C program into a dynamic library with GCC. You can use any other compiler, text editor or IDE to produce the C library. Windows users, you may want to use Microsoft toolchain.
In [ ]:
%%script sh
FILE=function
gcc -c -O3 -Wall -std=c11 -pedantic -fPIC -o $FILE.o $FILE.c
gcc -o lib$FILE.so -shared $FILE.o
file lib$FILE.so
The below cell finally create a wrapper around our C library so that we can easily use it from Python.
In [ ]:
import ctypes
libfunction = np.ctypeslib.load_library('libfunction', './')
libfunction.accuracy.restype = ctypes.c_double
libfunction.accuracy.argtypes = [
np.ctypeslib.ndpointer(dtype=np.int),
np.ctypeslib.ndpointer(dtype=np.int),
ctypes.c_int
]
def accuracy_c(y_pred, y_true):
n = y_pred.size
return libfunction.accuracy(y_pred, y_true, n)
Evaluate below the performance of your C implementation, and test its correctness. How does it compare with the others ?
In [ ]:
assert np.allclose(accuracy_c(y_pred, y_true), accuracy_python(y_pred, y_true))
%timeit accuracy_c(y_pred, y_true)
Same idea as before, with Fortran ! Fortran is an imperative programming language developed in the 50s, especially suited to numeric computation and scientific computing. While you probably won't write new code in Fortran, you may have to interface with legacy code (especially in large and old corporations). Here we'll resort to the f2py utility provided by the Numpy project for the (almost automatic) generation of a wrapper.
In [ ]:
%%file function.f
! The content of this cell is written to the "function.f" file in the current directory.
SUBROUTINE DACCURACY(YPRED, YTRUE, ACC, N)
CF2PY INTENT(OUT) :: ACC
CF2PY INTENT(HIDE) :: N
INTEGER*4 YPRED(N)
INTEGER*4 YTRUE(N)
DOUBLE PRECISION ACC
INTEGER N, NCORRECT
NCORRECT = 0
DO 10 J = 1, N
IF (YPRED(J) == YTRUE(J)) THEN
NCORRECT = NCORRECT + 1
END IF
10 CONTINUE
ACC = REAL(NCORRECT) / N
END
The below command compile the Fortran code and generate a Python wrapper.
In [ ]:
!f2py -c -m function function.f >> /dev/null
In [ ]:
import function
def accuracy_fortran(y_pred, y_true):
return function.daccuracy(y_pred, y_true)
Evaluate below the performance of your Fortran implementation, and test its correctness. How does it compare with the others ?
In [ ]:
assert np.allclose(accuracy_fortran(y_pred, y_true), accuracy_python(y_pred, y_true))
%timeit accuracy_fortran(y_pred, y_true)
In [ ]:
%%capture
# Suppress outputs.
N = 8 # Number of samples: from 1 to 10^N.
methods = ['accuracy_python', 'accuracy_sklearn', 'accuracy_fortran', 'accuracy_numpy',
'accuracy_numba', 'accuracy_cython', 'accuracy_c']
times = np.empty((len(methods), N+1))
for n in range(0, N+1):
y_true = np.random.randint(2, size=10**n)
y_pred = np.random.randint(2, size=len(y_true))
for i, method in enumerate(methods):
res = %timeit -o eval(method)(y_pred, y_true)
times[i, n] = res.best
In [ ]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize=(15, 5))
for i, method in enumerate(methods):
plt.semilogy(range(N+1), times[i,:], '.-', label=method)
plt.legend(loc='upper left')
plt.ylabel('execution time [s]')
plt.xlabel('log_10 number of samples')
plt.xticks(range(N+1))
plt.show();
To reflect on what we've done, answer the following questions:
It turns out the compiled versions by Numba and Cython are almost as fast as C ! Although much easier to write. In this case, they are even faster than Fortran, NumPy and scikit-learn. This gives us the best of both worlds: an interpreted language for rapid development, which can then be compiled for efficient execution in production.