NB: This is based on a longer workshop about Python optimization that I gave for the other students earlier this year. There I was also talking about other Python optimization tools (e.g. numba) and some simple ways of parallel programming in Python. The Jupyter notebooks from this workshop are available here: http://www.ast.cam.ac.uk/~bs538/programming.html.
After testing various options for making numerical Python code fast, Cython is my clear favourite. For my current main project, all numerically expensive parts are written as Cython extensions.
Here is the procedure I would recommend for optimizing a piece of numerical code.
For the sake of this tutorial we focus exclusively on optimizing a given piece of code, but profiling and the algorithm are equally important!
NB: There are built-in functions in scipy and astropy to do exactly this computation, but that's not the point here.
In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
#helper function to generate points
def gen_points_sphere(n):
"""
generate random points on sphere
"""
#angles
costheta = 2*np.random.rand(n)-1
theta = np.arccos(costheta)
phi = 2*np.pi*np.random.rand(n)
#unit vectors on the sphere
x = np.sin(theta)*np.cos(phi)
y = np.sin(theta)*np.sin(phi)
z = np.cos(theta)
vec = np.array([x,y,z]).T
return vec,theta,phi
n = 100000
vec,theta,phi = gen_points_sphere(n)
The next cell is just for visualization and can be commented if you do not have healpy installed.
In [3]:
import healpy as hp
def skymap(theta,phi,nside):
"""
creates healpix map of density on the sky
"""
ipix = hp.ang2pix(nside,theta,phi)
testmap = np.bincount(ipix,minlength=hp.nside2npix(nside))
return testmap
testmap = skymap(theta,phi,64)
hp.mollview(testmap)
Generate the populations for the example. We begin with a small number of objects.
In [4]:
# objects (obj)
vec_obj,theta_obj,phi_obj = gen_points_sphere(5000) # ~500,000 in my real use case
# contaminants (ps)
vec_ps,theta_ps,phi_ps = gen_points_sphere(500) # ~50,000 in my real use case
print(vec_obj.shape,vec_ps.shape)
# maximum separation:
maxsep_deg = 1.
cos_maxsep = np.cos(np.deg2rad(maxsep_deg))
If we represent our points by 3D unit vectors, then $ \cos (\Delta \sigma) = \bf{n}_1 \dot \bf{n}_2$. We therefore only need compute one dot product per pair, which is faster then computing the great-circle distance.
NB: For large N a k-d tree is more efficient (this is what the astropy version uses internally. But for the sake of having an example that is easy to implement, let's stick with the dot product (2).
In [5]:
def angdistcut_python_naive(vec_obj,vec_ps,cos_maxsep):
nobj = vec_obj.shape[0]
nps = vec_ps.shape[0]
dim = vec_obj.shape[1]
#objects to be deleted
out = []
for i in range(nobj):
for j in range(nps):
cos = 0.
#compute dot product
for k in range(dim):
cos += vec_obj[i,k] * vec_ps[j,k]
#stop once we have found one contaminant
if cos > cos_maxsep:
out.append(i)
break
return np.array(out)
In [6]:
%%time
result_python_naive = angdistcut_python_naive(vec_obj,vec_ps,cos_maxsep)
This is a more sensible solution making use of the numpy-builtins.
In [7]:
def angdistcut_numpy(vec_obj,vec_ps,cos_maxsep):
nobj = vec_obj.shape[0]
nps = vec_ps.shape[0]
# fast vectorized way to compute dot product
# (looses the advantage of stopping once one contaminant is found)
# (but the numpy speed gains easily outweigh that)
cos_arr = np.einsum('ik,jk->ij',vec_obj,vec_ps)
# number of contaminants per object
temp = (cos_arr > cos_maxsep).sum(axis=1)
#indices of objects that should be removed
out = np.flatnonzero(temp)
return out
In [8]:
%%time
result_numpy = angdistcut_numpy(vec_obj,vec_ps,cos_maxsep)
In [9]:
# check that results agree
assert np.array_equal(result_python_naive,result_numpy)
Using numpy has given us a speed-up of a factor of ~200. This is quite typical for computations that can be vectorized efficiently.
In [10]:
# expected runtime for full sample (100 times as many objects and contaminants)
20e-3*100*100/3600. #hours
Out[10]:
In terms of runtime, our problem is solved. However, we have not considered memory usage so far. For its vectorized computation to be efficient, numpy has to store everything in large arrays that are contiguous in memory. Simply to store the intermediate array cos_arr we need at least:
In [11]:
# double precision (64 bit = 8 byte per element)
# Nobj*Nps*8
5e5*5e4*8/1e9 #in GB
Out[11]:
Our work desktops have 16 GB, so suddenly memory becomes the bottleneck. As soon as things need to be swapped from memory to the hard drive, things become really slow.
On top, some numpy functions allocate larger temporary arrays "behind the scenes", making this problem even worse.
From my experience, the two above are the most mature and useful ones, but there are certainly other options, such as:
NB: The latter three require you to write C/C++ or Fortran code. If you are proficient in either of these, you might not run into the issue of having to optimize numerical Python code so frequently. It might still be worth learning, as Python (and also Cython) are much nicer to write than C/C++ or Fortran.
Cython is a super-set of Python. It allows static type definitions, translates our Python code into C code, which we can then compile and then import as an external library (.so on linux) back into Python. In Jupyter notebooks, all this can be done automatically behind the scenes.
Outside of Jupyter notebooks, we need to write a very short Python Makefile that handles the compilation for us, see e.g. here: http://cython.readthedocs.io/en/latest/src/tutorial/cython_tutorial.html
In [12]:
%load_ext Cython
In [13]:
%%cython -a
# -a provides annotation
# yellow annotations: calls to Python --> slow
# want to remove all yellow from our loops
#Python functions,attributes from numpy
import numpy as np
# C functions,attributes
cimport numpy as np
# Cython handles the namespace ambiguity internally in this case
def angdistcut_cython_naive(vec_obj, vec_ps, cos_maxsep):
nobj = vec_obj.shape[0]
nps = vec_ps.shape[0]
dim = vec_obj.shape[1]
#objects to be deleted
out = []
for i in range(nobj):
for j in range(nps):
cos = 0.
#compute dot product
for k in range(dim):
cos += vec_obj[i,k] * vec_ps[j,k]
#stop once we have found one contaminant
if cos > cos_maxsep:
out.append(i)
break
return np.array(out)
Out[13]:
In [14]:
%%time
result_cython_naive = angdistcut_cython_naive(vec_obj,vec_ps,cos_maxsep)
In [15]:
assert np.array_equal(result_cython_naive,result_numpy)
Essentially all Python code is valid Cython code. But without adding type definitions we get only a very small speed-up.
NB: I have also made one small change to the algorithm: instead of appending to a list I use an array to store which objects are flagged as contaminated, and at the end return the nonzero elements of that array (similar to the numpy case above.)
In [16]:
%%cython -a
import numpy as np
cimport numpy as np
def angdistcut_cython_typed(double[:,::1] vec_obj, double[:,::1] vec_ps, double cos_maxsep):
"""
NB: double[:] --> "typed memoryview": gives fast access to numpy-like arrays in Cython
double[:,::1] --> 2D typed memoryview, with last index varying fastest
for this it needs to be C contigous
"""
cdef:
int nps = vec_ps.shape[0]
int nobj = vec_obj.shape[0]
int dim = vec_obj.shape[1]
#use int array instead of list
int[:] found = np.zeros(nobj, np.int32)
int i,j
double cos
for i in range(nobj):
for j in range(nps):
cos = (vec_obj[i,0]*vec_ps[j,0] +
vec_obj[i,1]*vec_ps[j,1] +
vec_obj[i,2]*vec_ps[j,2])
if cos > cos_maxsep:
found[i] = 1
break
#convert result from typed memory view into array before returning
return np.flatnonzero(np.asarray(found))
Out[16]:
Because of the way we have created the input arrays, they are not C-contiguous (row-major ordering) in memory, but Fortran-contiguous (column-major ordering). If we simply input them into the Cython function, we would get an error. Of course we could have anticipated this when creating them, but is an instructive to do this here. (For Python this is never an issue.)
In [17]:
#change memory ordering
vec_obj_c = np.ascontiguousarray(vec_obj)
vec_ps_c = np.ascontiguousarray(vec_ps)
In [18]:
%%time
result_cython_typed = angdistcut_cython_typed(vec_obj_c,vec_ps_c,cos_maxsep)
In [19]:
assert np.array_equal(result_cython_typed,result_numpy)
With a little typing effort, we are already at roughly the same speed as the (internally highly optimized!) numpy version.
NB: again, this statement is somewhat platform dependent
full overview of available directives: http://cython.readthedocs.io/en/latest/src/reference/compilation.html#compiler-directives
In [20]:
%%cython -a
import numpy as np
cimport numpy as np
cimport cython
@cython.boundscheck(False)
@cython.wraparound(False)
@cython.nonecheck(False)
@cython.cdivision(True)
def angdistcut_cython_opt(double[:,::1] vec_obj, double[:,::1] vec_ps, double cos_maxsep):
"""
disabled some of the run-time checks, and also enabled C division
the latter is pointless here, but can make large differences when divisions are involved
"""
cdef:
int nps = vec_ps.shape[0]
int nobj = vec_obj.shape[0]
int dim = vec_obj.shape[1]
#use int array instead of list
int[:] found = np.zeros(nobj, np.int32)
int i,j
double cos
for i in range(nobj):
for j in range(nps):
cos = (vec_obj[i,0]*vec_ps[j,0] +
vec_obj[i,1]*vec_ps[j,1] +
vec_obj[i,2]*vec_ps[j,2])
if cos > cos_maxsep:
found[i] = 1
break
return np.flatnonzero(np.asarray(found))
Out[20]:
In [21]:
%%time
result_cython_opt = angdistcut_cython_opt(vec_obj_c,vec_ps_c,cos_maxsep)
In [22]:
assert np.array_equal(result_cython_opt,result_numpy)
By adding the decorators Cython now runs significantly faster than numpy. (NB: Part of this is because our Cython implementation does not calculate any further dot products once one contaminant is found for a certain object.)
Cython has an easy integration of OMP-parallel loops (very similar to C,C++, Fortran). This circumvents the Global Interpreter Lock (GIL), which normally prohibits the execution of parallel code in Python (for reasons related to Python's internal memory management). More infos here: https://wiki.python.org/moin/GlobalInterpreterLock
NB: The different threads share access to the same memory. Therefore we need to make sure that our implementation is thread-safe, i.e. that different threads do not write to the same memory location.
In [23]:
%%cython --compile-args=-fopenmp --link-args=-fopenmp
import numpy as np
cimport numpy as np
cimport cython
from cython.parallel import prange, parallel
@cython.boundscheck(False)
@cython.wraparound(False)
@cython.nonecheck(False)
@cython.cdivision(True)
def angdistcut_cython_par(double[:,::1] vec_obj, double[:,::1] vec_ps, double cos_maxsep, int num_threads):
"""
added extra compiler flags
also added a new argument for the number of threads
"""
cdef:
int nps = vec_ps.shape[0]
int nobj = vec_obj.shape[0]
int dim = vec_obj.shape[1]
#use int array instead of list
int[:] found = np.zeros(nobj, np.int32)
int i,j
double cos
#every object is independent, so can parallelize the outer loop easily
# need to release the GIL for that (see above)
with nogil, parallel(num_threads=num_threads):
#extra keyword arguments of prange control OMP settings (can fine-tune this based on our problem)
for i in prange(nobj): #,schedule='static',chunksize=1):
for j in range(nps):
cos = (vec_obj[i,0]*vec_ps[j,0] +
vec_obj[i,1]*vec_ps[j,1] +
vec_obj[i,2]*vec_ps[j,2])
if cos > cos_maxsep:
found[i] = 1
break
return np.flatnonzero(np.asarray(found))
In [24]:
%%time
result_cython_par = angdistcut_cython_par(vec_obj_c,vec_ps_c,cos_maxsep, num_threads=2)
In [25]:
assert np.array_equal(result_cython_par,result_numpy)
For such a small computation overhead becomes significant, therefore there is no further speedup (NB: slightly different on my work desktop). Let's repeat the benchmark with larger arrays.
In [26]:
# similar to real use case
vec_obj_large,theta_obj_large,phi_obj_large = gen_points_sphere(500000)
vec_ps_large,theta_ps_large,phi_ps_large = gen_points_sphere(50000)
#make them C contiguous again
vec_obj_large_c = np.ascontiguousarray(vec_obj_large)
vec_ps_large_c = np.ascontiguousarray(vec_ps_large)
In [27]:
%%time
result_cython_typed = angdistcut_cython_typed(vec_obj_large_c,vec_ps_large_c,cos_maxsep)
In [28]:
%%time
result_cython_opt = angdistcut_cython_opt(vec_obj_large_c,vec_ps_large_c,cos_maxsep)
In [31]:
%%time
result_cython_par_large = angdistcut_cython_par(vec_obj_large_c,vec_ps_large_c,cos_maxsep,num_threads=2)
In [32]:
%%time
result_cython_par_large = angdistcut_cython_par(vec_obj_large_c,vec_ps_large_c,cos_maxsep,num_threads=4)
How much speed-up OMP parallelization brings, depends both on the system and the problem. In purely CPU-bound problems (almost no memory I/O needed), one can get quasi-linear speed-up to a large number of cores.
NB: Sometimes we can squeeze out some extra speed-up by fine-tuning the 'schedule' and 'chunksize' arguments of prange. See e.g. http://cython.readthedocs.io/en/latest/src/userguide/parallelism.html
In [ ]:
In [ ]: