Notebook created by Martín Villanueva - martin.villanueva@usm.cl
- DI UTFSM - April 2017.
In [102]:
%matplotlib inline
%load_ext memory_profiler
import numpy as np
import numexpr as ne
import numba
import math
import random
import matplotlib.pyplot as plt
import scipy as sp
import sys
A JIT compiler runs after the program has started and compiles the code (usually bytecode) on the fly (or just-in-time, as it's called) into a form that's usually faster, typically the host CPU's native instruction set. A JIT has access to dynamic runtime information whereas a standard compiler doesn't and can make better optimizations like inlining functions that are used frequently.
This is in contrast to a traditional compiler (AOT, Ahead Of Time compilation) that compiles all the code to machine language before the program is first run.
NumPy
code.NumPy
is not sufficient?(Some kind of) Rule of thumb: When a loop in an algorithm needs the result from one iteration to perform the next iteration, then it is not possible to implement it in a vectorized way.
Lets introduce its usage with a naive example:
The function below is a naive sum function that sums all the elements of a given array.
In [5]:
def sum_array(inp):
I,J = inp.shape
mysum = 0
for i in range(I):
for j in range(J):
mysum += inp[i, j]
return mysum
In [16]:
arr = np.random.random((500,500))
In [17]:
sum_array(arr)
Out[17]:
In [24]:
naive = %timeit -o sum_array(arr)
In [25]:
#lazzy compilation
@numba.jit
def sum_array_numba(inp):
I,J = inp.shape
mysum = 0
for i in range(I):
for j in range(J):
mysum += inp[i, j]
return mysum
In [26]:
sum_array_numba(arr)
Out[26]:
In [27]:
jitted = %timeit -o sum_array_numba(arr)
In [40]:
print("Improvement: {0} times faster".format(naive.best/jitted.best))
In [15]:
%timeit np.sum(arr)
Some important notes:
numba.jit
decorator, this will try to automatically detect the types of input, output and intermediate variables.
In [82]:
#single signature
@numba.jit('float64[:] (float64[:], float64[:])')
def sum1(a,b):
return a+b
a = np.arange(10, dtype=np.float64)
b = np.arange(10, dtype=np.float64)
print(sum1(a,b))
In [59]:
#multiple signatures (polymorphism)
signatures = ['int32[:] (int32[:], int32[:])', 'int64[:] (int64[:], int64[:])', \
'float32[:] (float32[:], float32[:])', 'float64[:] (float64[:], float64[:])']
@numba.jit(signatures)
def sum2(a,b):
return a+b
a = np.arange(10, dtype=np.int64)
b = np.arange(10, dtype=np.int64)
#print(sum1(a,b))
print(sum2(a,b))
For a full reference of the signature types supported by Numba see the documentation.
Now that we've run sum1
and sum2
once, they are now compiled and we can check out what's happened behind the scenes. Use the inspect_types
method to see how Numba translated the functions.
In [85]:
sum1.inspect_types()
nopython
modeNumba can compile a Python function in two modes:
Citting the official documentation:
Numba is aware of NumPy arrays as typed memory regions and so can speedup code using NumPy arrays. Other, less well-typed code will be translated to Python C-API calls effectively removing the "interpreter" but not removing the dynamic indirection.
Rule of thumb: In order to successfully use the nopython mode, we need to ensure that no native and unsupported Python features are being used inside the fuction to be jitted.
In [61]:
@numba.jit('float64[:] (float64[:,:], float64[:])', nopython=True)
def dot_numba1(A,b):
m,n = A.shape
c = np.empty(m)
for i in range(m):
c[i] = np.dot(A[i],b)
return c
In [69]:
A = np.random.random((1000,1000))
b = np.random.random(1000)
In [70]:
%timeit dot_numba1(A,b)
In [73]:
@numba.jit('float64[:] (float64[:,:], float64[:])', nopython=True)
def dot_numba2(A,b):
m,n = A.shape
c = []
for i in range(m):
c.append( np.dot(A[i],b) )
return np.array(c)
# A few time ago, it wouldn't work!
In [72]:
%timeit dot_numba2(A,b)
Lets create a silly example where Numba fails at nopython mode. Prepare for a long error...
In [74]:
@numba.jit('float64[:] (float64[:,:], float64[:])', nopython=True)
def dot_numba2(A,b):
m,n = A.shape
c = dict()
for i in range(m):
c[i] = np.dot(A[i],b)
return np.array(c.values)
for a full list of the supported native Python features see here.
Numba
and NumPy
To see all the feature of NumPy
actually supported by Numba
, please see the documentation.
In [79]:
#row mean example
@numba.jit('float64[:] (float64[:,:])', nopython=True)
def row_mean(A):
m,n = A.shape
mean = np.empty(m)
for i in range(m):
mean[i] = np.sum(A[i])/n
return mean
A = np.random.random((10,10))
print( row_mean(A) )
Numba also provides a facility for Ahead-of-Time compilation (AOT), which has the following beneficts:
But it is much more restrictive than JIT functionality. For more info see the documentation.
We will simulate random walk with jums:
This type of stochastic model is notably used in neuroscience.
We first create a function that returns a -1
or +1
value:
In [126]:
def step():
if random.random()>.5: return 1.
else: return -1.
and create the simulation in pure Python, where the function walk()
takes a number of steps as input:
In [127]:
def walk(n):
x = np.zeros(n)
dx = 1./n
for i in range(n-1):
x_new = x[i] + dx * step()
if abs(x_new) > 5e-3:
x[i+1] = 0.
else:
x[i+1] = x_new
return x
Now we create a random walk, plot it and %timeit
its execution:
In [128]:
n = 100000
x = walk(n)
plt.figure(figsize=(8,8))
plt.plot(x)
plt.show()
In [133]:
python_t = %timeit -o walk(n)
Now, let's JIT-compile this function with Numba:
In [131]:
@numba.jit(nopython=True)
def step_numba():
if random.random()>.5: return 1.
else: return -1.
@numba.jit(nopython=True)
def walk_numba(n):
x = np.zeros(n)
dx = 1./n
for i in range(n-1):
x_new = x[i] + dx * step_numba()
if abs(x_new) > 5e-3:
x[i+1] = 0.
else:
x[i+1] = x_new
return x
In [134]:
numba_t = %timeit -o walk_numba(n)
In [138]:
print("Improvement: {0} times faster".format(python_t.best/numba_t.best))
Now we will create a Mandelbrot fractal (which is a task that cannot be vectorized) using native Python and Numba...
It basically consist in the next iteration, with starting point $z_0 = 0$: $$ z_{i+1} = z_{i}^2 + c $$ for which the values of the sequence remains bounded.
In [86]:
size = 200
iterations = 100
In [90]:
def mandelbrot_python(m, size, iterations):
for i in range(size):
for j in range(size):
c = -2 + 3./size*j + 1j*(1.5-3./size*i)
z= 0
for n in range(iterations):
if np.abs(z) <= 10:
z = z*z + c
m[i, j] = n
else:
break
In [91]:
m = np.zeros((size,size))
mandelbrot_python(m, size, iterations)
In [95]:
plt.figure(figsize=(7,7))
plt.imshow(np.log(m), cmap=plt.cm.hot)
plt.axis('off')
Out[95]:
Now we evaluate the time taken by this function:
In [96]:
%%timeit m = np.zeros((size,size))
mandelbrot_python(m,size,iterations)
Next, we add the numba.jit
decorator and let Numba infer the types of all the variables (Lazzy Compilation):
In [99]:
@numba.jit
def mandelbrot_numba(m, size, iterations):
for i in range(size):
for j in range(size):
c = -2 + 3./size*j + 1j*(1.5-3./size*i)
z= 0
for n in range(iterations):
if np.abs(z) <= 10:
z = z*z + c
m[i, j] = n
else:
break
In [100]:
%%timeit m = np.zeros((size,size))
mandelbrot_python(m,size,iterations)
In [146]:
def test_func(a,b,c):
"""
Consider that a, b and c are 1D ndarrys
"""
return np.sin(a**2 + np.exp(b)) + np.cos(b**2 + np.exp(c)) + np.tan(a**2+b**2+c**2)
In [166]:
n = 1000000
a = np.random.random(n)
b = np.random.random(n)
c = np.random.random(n)
In [167]:
%timeit test_func(a,b,c)
Let's create now a Numba function that performs the same operations but iteratively:
In [168]:
@numba.jit('float64[:] (float64[:], float64[:], float64[:])', nopython=True)
def test_func_numba(a,b,c):
n = len(a)
res = np.empty(n)
for i in range(n):
res[i] = np.sin(a[i]**2 + np.exp(b[i])) + np.cos(b[i]**2 + np.exp(c[i])) + np.tan(a[i]**2+b[i]**2+c[i]**2)
return res
In [169]:
%timeit test_func_numba(a,b,c)
Then, what is the problem with NumPy:
NumExpr
NumPy
?Suppose we want to perform the next operation: 2*a+3*b
. In NumPy you will need 3 temporary arrays, and doesn't make an efficient use of the cache memory: The results of 2*a
and 3*b
won't be in cache when you do the add.
A possible solution it to iterate over a
and b
computing the operation element by element (We test this above with Numba). But here is the approach the NumExpr
follows:
Arrays are handled as chunks (of 256 elements) at a time, using a register machine. As Python code, it looks something like this:
for i in xrange(0, len(a), 256):
r0 = a[i:i+256]
r1 = b[i:i+256]
multiply(r0, 2, r2)
multiply(r1, 3, r3)
add(r2, r3, r2)
c[i:i+128] = r2
Let's use it!
In [185]:
# Change to size of the arrays to see the differences
m = 10000
n = 5000
A = np.random.random((m,n))
B = np.random.random((m,n))
C = np.random.random((m,n))
In [188]:
np_t = %timeit -o test_func(A,B,C)
In [190]:
ne_t = %timeit -o ne.evaluate('sin(a**2 + exp(b)) + cos(b**2 + exp(c)) + tan(a**2+b**2+c**2)')
In [191]:
print("Improvement: {0} times".format(np_t.best/ne_t.best))
Additionally we can explicitly specify the number of threads that NumExpr
can use to evaluate the expression with the set_num_threads()
function:
In [194]:
n_threads = 4
for i in range(1, n_threads+1):
ne.set_num_threads(i)
%timeit ne.evaluate('sin(a**2 + exp(b)) + cos(b**2 + exp(c)) + tan(a**2+b**2+c**2)')