This is a follow-up to day 7, where I compared NumPy with Numexpr for various different operators. Now, I only want focus on one expression that was listed in the Numexpr documentation (which supposedly show off its strengths), and compare the performance of Numexpr to the two other JIT (just-in-time) compilers Numba and parakeet.
For this benchmark, we will compare different arithmetic operators for simple matrix calculations using NumPy arrays to their Numexpr equivalents.
In [1]:
import numpy as np
In [2]:
def numpy_regular(A, B):
return(A*B-4.1*A > 2.5*B)
Numexpr is a "fast numerical expression evaluator for NumPy". It has a internal Virtual Machine to rewrite code more efficiently by breaking down operations into smaller chunks for the CPU cache. By using a JIT (just-in-time compiler), compilation at runtime is not necessary.
The usage is quite simple, we just have to provide the numerical expression that we want to evaluate as a string to the evaluate
function, e.g.,
import numexpr
import numpy
a = numpy.array([[1, 2]])
numexpr.evaluate('a ** 2')
>> array([[1, 4]], dtype=int64)
In [3]:
import numexpr as ne
In [4]:
def numexpr_evaluate(A, B):
return ne.evaluate('A*B-4.1*A > 2.5*B')
Numba is using the LLVM compiler infrastructure for compiling Python code to machine code. Its strength is to work with NumPy arrays to speed-up the code. If you want to read more about Numba, please see refer to the original website and documentation.
Continuum Analytics write on their website that "the NumPy array ufunc$^1$ support in nopython node is incomplete at this time. Most if not all ufuncs should work in object mode though."
$^1$ "A universal function (or ufunc for short) is a function that operates on ndarrays in an element-by-element fashion
(http://docs.scipy.org/doc/numpy/reference/ufuncs.html)
In [5]:
from numba import jit
In [6]:
@jit
def numba_jit(A, B):
return A*B-4.1*A > 2.5*B
In [7]:
from numba import vectorize
In [8]:
@vectorize(['u1(float64, float64)'])
def numba_vectorized(A,B):
return A*B+4.1*A > 2.5*B
Similar to Numba, parakeet is a Python compiler that optimizes the runtime of numerical computations based on the NumPy data types, such as NumPy arrays.
The usage is also similar to Numba where we just have to put the jit
decorator on top of the function we want to optimize.
In [9]:
from parakeet import jit as para_jit
In [10]:
@para_jit
def parakeet_jit(A, B):
return A*B-4.1*A > 2.5*B
Maybe we can speed things up a little bit via Cython's C-extensions for Python. Cython is basically a hybrid between C and Python and can be pictured as compiled Python code with type declarations.
Since we are working in an IPython notebook here, we can make use of the very convenient IPython magic: It will take care of the conversion to C code, the compilation, and eventually the loading of the function.
Note that the static type declarations that we add via cdef
are note required for Cython to work, but it will speed things up tremendously.
In [11]:
%load_ext cythonmagic
In [12]:
%%cython
import numpy as np
cimport numpy as np
cimport cython
@cython.boundscheck(False)
@cython.wraparound(False)
cpdef cython_unrolled(double[:, :] A, double[:, :] B):
cdef long i, j, rows, cols
cdef double[:, :] C
rows = A.shape[0]
cols = A.shape[1]
C = np.empty((rows,cols,))
for i in xrange(rows):
for j in xrange(cols):
C[i,j] = A[i,j]*B[i,j] + 4.1*A[i,j] > 2.5*B[i,j]
return np.asarray(C)
In [19]:
import timeit
orders_n = [10**n for n in range(1, 5)]
funcs = ['numpy_regular', 'numba_jit', 'numba_vectorized',
'parakeet_jit', 'cython_unrolled', 'numexpr_evaluate'
]
timings = {f:[] for f in funcs}
for n in orders_n:
for f in funcs:
A = np.random.rand(n,n)
B = np.random.rand(n,n)
timings[f].append(min(timeit.Timer('%s(A, B)' %f,
'from __main__ import A, B, %s' %f)
.repeat(repeat=3, number=10)))
In [20]:
import platform
from llvm import __version__ as llvm__version__
from parakeet import __version__ as parakeet__version__
from numba import __version__ as numba__version__
def print_sysinfo():
print '\nsystem :', platform.system()
print 'release :', platform.release()
print 'machine :', platform.machine()
print 'processor:', platform.processor()
print 'interpreter:', platform.architecture()[0]
print '\nPython version', platform.python_version()
print 'compiler', platform.python_compiler()
print 'NumPy version', np.__version__
print 'Numexpr version', ne.__version__
print 'parakeet version', parakeet__version__
print 'Numba version', numba__version__
print 'llvm version', llvm__version__
print '\n\n'
In [21]:
import prettytable
def summary_table(funcs):
fit_table = prettytable.PrettyTable(['n=%s' %orders_n[-1],
'function' ,
'rel. performance gain'])
fit_table.align['function'] = 'l'
for f in funcs:
fit_table.add_row(['', f, '{:.2f}x'.format(
(timings['numpy_regular'][-1]/timings[f][-1]))])
print(fit_table)
In [22]:
%matplotlib inline
In [25]:
import matplotlib.pyplot as plt
def plot_figures(funcs):
fig = plt.figure(figsize=(8,8))
for f in funcs:
plt.plot([i**2 for i in orders_n], [i*100 for i in timings[f]],
alpha=0.5, label='%s' %f, marker='o', lw=2)
plt.legend(loc='upper left')
plt.xscale('log')
plt.yscale('log')
plt.grid()
plt.xlabel('number of elements per matrix')
plt.ylabel('time in milliseconds')
plt.title('Approaches to evaluate the NumPy array expression "A*B-4.1*A > 2.5*B"')
plt.show()
In [26]:
print_sysinfo()
summary_table(funcs)
plot_figures(funcs)
In [1]:
import multiprocessing
print('This benchmark was done on a machine with %s CPUs', multiprocessing.cpu_count())
In [ ]: