This notebook aims to take the reader through a realworld example of increasing the speed on an algorithm. The given example is that of computing the normals for a given triangle mesh (points and a trilist). Computing the per-vertex normal for a mesh is an intensive operation that yields poor performance in pure Python. This is due to the need to loop over every triangle and and sum the per-triangle normal that every vertex is a member of. This notebook is not designed to describe Cython syntax or basics. It is assumed that the reader has some understanding and preferably experience with writing Cythonised functions.
foreach face in faces:
face_normal = crossproduct(vertices[face[1]] - vertices[face[0]],
vertices[face[2]] - vertices[face[0]])
foreach v in face:
normalise(face_normal)
vertices[v].in_faces.append(face_normal)
foreach vertex in vertices:
normal = (0,0,0)
for face in vertex.in_faces:
normal += face_normal
normalise(normal)
crossproduct(v0, v1):
v0.y * v1.z - v0.z * v1.y,
v0.z * v1.x - v0.x * v1.z,
v0.x * v1.y - v0.y * v1.x,
We begin by loading an appropriate mesh.
In [ ]:
import menpo.io as pio
import numpy as np
shapeimage = pio.import_image('/vol/atlas/databases/frgc/spring2003/04201d302.abs')
tris = shapeimage.mesh.trilist
points = shapeimage.mesh.points
print shapeimage
In [ ]:
def normalise(vec):
# Avoid divisions by almost 0 numbers
# np.spacing(1) is equivalent to Matlab's eps
d = np.sqrt(np.sum(vec ** 2, axis=1))
d[d < np.spacing(1)] = 1.0
return vec / d[..., None]
def py_compute_normal(vertex, face):
nface = face.shape[0]
nvert = vertex.shape[0]
# Calculate the cross product (per-face normal)
normalf = np.cross(vertex[face[:, 1], :] - vertex[face[:, 0], :],
vertex[face[:, 2], :] - vertex[face[:, 0], :])
normalf = normalise(normalf)
# Calculate per-vertex normal
normal = np.zeros([nvert, 3])
for i in xrange(nface):
f = face[i, :]
for j in xrange(3):
normal[f[j], :] += normalf[i, :]
# Normalize
normal = normalise(normal)
# Enforce that the normal are outward
v = vertex - np.mean(vertex)[..., None]
s = np.sum(v * normal, axis=1)
if np.sum(np.greater(s, 0)) < np.sum(np.less(s, 0)):
# flip
normal = -normal
normalf = -normalf
return normal, normalf
If we then time this function, we can see that it takes about 3 seconds (on my Intel(R) Xeon(R) CPU E5-1650 @ 3.20GHz
with 32GB
of RAM) for 123160
triangles and 61599
points.
In [ ]:
%timeit py_compute_normal(points, tris)
This is obviously far too slow to be of any use. Therefore, we naively port this method to Cython. Cython is useful for code where tight looping is unavoidable, as is the case in computing the per-vertex normal. This is because it pre-compiles as much of the code as possible down to C, which is very efficient at tight looping. To compile Cython code, we have to load the Cython magic extension
In [ ]:
%load_ext cythonmagic
The Cython extension gives us the %%cython
cell magic where we can put raw Cython code which will compiled on execution. TO get started with Cython, we note that the majority of Cython's speedup comes from the fact that we statically type variables. Therefore, we always have to import some C code via the cimport
statement. For example, to use numpy, we could use:
In [ ]:
%%cython
import numpy as np
cimport cython
cimport numpy as np
Note that we have to Python import
Numpy AND cimport
it. Therefore, a simple Cython function using numpy would look like:
In [ ]:
%%cython
import numpy as np
cimport cython
cimport numpy as np
def my_pow(double x):
return np.power(x, 2)
print my_pow(2.0)
It's important to note that there are 3 kinds of functions definitions in Cython:
def
cdef
cpdef
cdef
function. So Python calls the wrapper and C calls the cdef
function.So, to create a naive implementation of our Python function, in Cython, we define cpdef
function as follows:
In [ ]:
%%cython
import numpy as np
cimport numpy as np
cimport cython
cdef np.ndarray[np.float64_t, ndim=2] cy_normalise_naive(np.ndarray[np.float64_t, ndim=2] vec):
# Avoid divisions by almost 0 numbers
cdef np.ndarray[np.float64_t, ndim=1] d = np.sqrt(np.sum(vec ** 2, axis=1))
d[d < np.spacing(1)] = 1.0
return vec / d[..., None]
cpdef cy_compute_normal_naive(np.ndarray[np.float64_t, ndim=2] vertex, np.ndarray[int, ndim=2] face):
cdef int nface = face.shape[0]
cdef int nvert = vertex.shape[0]
# Calculate the cross product (per-face normal)
cdef np.ndarray[np.float64_t, ndim=2] normalf = np.cross(vertex[face[:, 1], :] - vertex[face[:, 0], :],
vertex[face[:, 2], :] - vertex[face[:, 0], :])
normalf = cy_normalise_naive(normalf)
# Calculate per-vertex normal
cdef np.ndarray[np.float64_t, ndim=2] normal = np.zeros([nvert, 3])
cdef np.ndarray[int, ndim=1] f
for i in xrange(nface):
f = face[i, :]
for j in xrange(3):
normal[f[j], :] += normalf[i, :]
# Normalize
normal = cy_normalise_naive(normal)
# Enforce that the normal are outward
cdef np.ndarray[np.float64_t, ndim=2] v = vertex - np.mean(vertex)[..., None]
cdef np.ndarray[np.float64_t, ndim=1] s = np.sum(v * normal, axis=1)
if np.sum(np.greater(s, 0)) < np.sum(np.less(s, 0)):
# flip
normal = -normal
normalf = -normalf
return normal, normalf
If we then time this function, we can see that it takes about 1.8 seconds (on my Intel(R) Xeon(R) CPU E5-1650 @ 3.20GHz
with 32GB
of RAM) for 123160
triangles and 61599
points. This represents an approximately 1.6x speedup just by naively moving the code in to a Cython function. Other than the static typing, the code is almost identical.
Note: There are decorators such as @cython.boundscheck(False)
and @cython.wraparound(False)
that can provide speedups by telling Cython that you guarantee the kinds of accesses arrays will have inside the function. See here for more information.
In [ ]:
%timeit cy_compute_normal_naive(points, tris)
However, we can do better than this! In order to give us a better indiciaton, Cython provides the ability to pass flags in for execution. These can be compile time flags, or special running flags. The flag we are interested in is -a
. This provides an output that colour codes the typing that is going on within the Cython function. Yellow backgrounds indicate function calls back in to Python (which is slow), and white/clear backgrounds represent pure C calls. If we run this on our naive implementaton, we get the following:
In [ ]:
%%cython -a
import numpy as np
cimport numpy as np
cimport cython
cdef np.ndarray[np.float64_t, ndim=2] cy_normalise_naive(np.ndarray[np.float64_t, ndim=2] vec):
# Avoid divisions by almost 0 numbers
cdef np.ndarray[np.float64_t, ndim=1] d = np.sqrt(np.sum(vec ** 2, axis=1))
d[d < np.spacing(1)] = 1.0
return vec / d[..., None]
cpdef cy_compute_normal_naive(np.ndarray[np.float64_t, ndim=2] vertex, np.ndarray[int, ndim=2] face):
cdef int nface = face.shape[0]
cdef int nvert = vertex.shape[0]
# unit normals to the faces
cdef np.ndarray[np.float64_t, ndim=2] normalf = np.cross(vertex[face[:, 1], :] - vertex[face[:, 0], :],
vertex[face[:, 2], :] - vertex[face[:, 0], :])
normalf = cy_normalise_naive(normalf)
# unit normal to the vertex
cdef np.ndarray[np.float64_t, ndim=2] normal = np.zeros([nvert, 3])
cdef double[:] f
for i in xrange(nface):
f = face[i, :]
for j in xrange(3):
normal[f[j], :] += normalf[i, :]
# normalize
normal = cy_normalise_naive(normal)
# enforce that the normal are outward
cdef np.ndarray[np.float64_t, ndim=2] v = vertex - np.mean(vertex)[..., None]
cdef np.ndarray[np.float64_t, ndim=1] s = np.sum(v * normal, axis=1)
if np.sum(np.greater(s, 0)) < np.sum(np.less(s, 0)):
# flip
normal = -normal
normalf = -normalf
return normal, normalf
Looking above, we see that the majority of the code is still making calls back in to Python. In particular, the slow vetex loop is making a Python call every iteration. Therefore, we want to try and remove this.
In [ ]:
%%cython -a
import numpy as np
cimport numpy as np
cimport cython
cdef np.ndarray[np.float64_t, ndim=2] cy_normalise_naive(np.ndarray[np.float64_t, ndim=2] vec):
# Avoid divisions by almost 0 numbers
cdef np.ndarray[np.float64_t, ndim=1] d = np.sqrt(np.sum(vec ** 2, axis=1))
d[d < np.spacing(1)] = 1.0
return vec / d[..., None]
cpdef cy_compute_normal_better(np.ndarray[np.float64_t, ndim=2] vertex, np.ndarray[int, ndim=2] face):
cdef int nface = face.shape[0]
cdef int nvert = vertex.shape[0]
# unit normals to the faces
cdef np.ndarray[np.float64_t, ndim=2] normalf = np.cross(vertex[face[:, 1], :] - vertex[face[:, 0], :],
vertex[face[:, 2], :] - vertex[face[:, 0], :])
normalf = cy_normalise_naive(normalf)
# unit normal to the vertex
cdef np.ndarray[np.float64_t, ndim=2] normal = np.zeros([nvert, 3])
cdef int f0, f1, f2
for i in range(nface):
f0 = face[i, 0]
f1 = face[i, 1]
f2 = face[i, 2]
for j in range(3):
normal[f0, j] += normalf[i, j]
normal[f1, j] += normalf[i, j]
normal[f2, j] += normalf[i, j]
# normalize
normal = cy_normalise_naive(normal)
# enforce that the normal are outward
cdef np.ndarray[np.float64_t, ndim=2] v = vertex - np.mean(vertex)[..., None]
cdef np.ndarray[np.float64_t, ndim=1] s = np.sum(v * normal, axis=1)
if np.sum(np.greater(s, 0)) < np.sum(np.less(s, 0)):
# flip
normal = -normal
normalf = -normalf
return normal, normalf
Eureka! By turning lines 24-36
to pure C, just by guaranteeing their accesses as C types, we have sped up our function to approximately 80 ms. This represents an approximate 38x speedup from the original! And all we did was partially unwrap a single loop. This is the key when trying to optimise Cython code. You need to ensure that all loops make as few calls in to Python code as possible.
In [ ]:
%timeit cy_compute_normal_better(points, tris)
So now the game has become trying to turn as much of that Yellow code in to white code. Note that there is certainly a diminishing law of returns going on here. Our previous optimisation was almost certainly the largest jump in performance we will be able to achieve. Given that we don't gave any other loops, we are unlikely to get large 100+% jumps in performance. Numpy calls are already vectorized and manually unrolling them in to loops will not yield a very big performance boost. If we run the magic function %prun
, this will give us profiling information about the functions that are called. We use the -r
flag in order to return the profiler object so that we can print it in to the cell. Normally, this need just be called as:
%prun cy_compute_normal_better(points, tris)
which opens up a seperate window in the notebook.
In [ ]:
p = %prun -r cy_compute_normal_better(points, tris)
p.print_stats()
The profiling output tells us that the majority of the time is spent inside the Cython function. However, almost half the time is spent inside the numpy cross product function. Looking at the source code of numpy's cross product shows us that it does a bunch of checks to try and ensure that the shapes of the vectors match. However, we know that are guaranteed to have standard Nx3
vectors. So, what happens if we roll our own cross product method?
In [ ]:
%%cython -a
import numpy as np
cimport numpy as np
cimport cython
cdef np.ndarray[np.float64_t, ndim=2] normalise(np.ndarray[np.float64_t, ndim=2] vec):
# Avoid divisions by almost 0 numbers
cdef np.ndarray[np.float64_t, ndim=1] d = np.sqrt(np.sum(vec ** 2, axis=1))
d[d < np.spacing(1)] = 1.0
return vec / d[..., None]
cdef inline np.ndarray[np.float64_t, ndim=2] cross(double[:, :] x, double[:, :] y):
cdef np.ndarray[np.float64_t, ndim=2] z = np.empty_like(x)
cdef int n = x.shape[0]
for i in range(n):
z[i, 0] = x[i, 1] * y[i, 2] - x[i, 2] * y[i, 1]
z[i, 1] = x[i, 2] * y[i, 0] - x[i, 0] * y[i, 2]
z[i, 2] = x[i, 0] * y[i, 1] - x[i, 1] * y[i, 0]
return z
cpdef cy_compute_normal(np.ndarray[np.float64_t, ndim=2] vertex, np.ndarray[int, ndim=2] face):
cdef int nface = face.shape[0]
cdef int nvert = vertex.shape[0]
# unit normals to the faces
cdef np.ndarray[np.float64_t, ndim=2] normalf = cross(vertex[face[:, 1], :] - vertex[face[:, 0], :],
vertex[face[:, 2], :] - vertex[face[:, 0], :])
normalf = normalise(normalf)
# unit normal to the vertex
cdef np.ndarray[np.float64_t, ndim=2] normal = np.zeros([nvert, 3])
cdef int f0, f1, f2
for i in range(nface):
f0 = face[i, 0]
f1 = face[i, 1]
f2 = face[i, 2]
for j in range(3):
normal[f0, j] += normalf[i, j]
normal[f1, j] += normalf[i, j]
normal[f2, j] += normalf[i, j]
# normalize
normal = normalise(normal)
# enforce that the normal are outward
cdef np.ndarray[np.float64_t, ndim=2] v = vertex - np.mean(vertex)[..., None]
cdef np.ndarray[np.float64_t, ndim=1] s = np.sum(v * normal, axis=1)
if np.sum(np.greater(s, 0)) < np.sum(np.less(s, 0)):
# flip
normal = -normal
normalf = -normalf
return normal, normalf
In [ ]:
%timeit cy_compute_normal(points, tris)
print ""
p = %prun -r cy_compute_normal(points, tris)
p.print_stats()
We've now reduced the execution time by almost a third again. Looking at the profiler output, we see that all of the time is simply spent inside the Cython function. Since all the operations are vectorized, we are unlikely to see anything but very incremental improvements. However, we've gone from nearly 3s down to around 50ms. Looking at the code, we've changed very little from the original Python version. Easily the most difficult part was rolling our own cross product, and even that was not really a necessary optimisation.
Now, just for our own piece of mind, we'll check that our optimised Cython version produces the same output as the original Python implementation.
In [ ]:
cy_normal, cy_normalf = cy_compute_normal(points, tris)
py_normal, py_normalf = py_compute_normal(points, tris)
print np.allclose(cy_normal, py_normal)
print np.allclose(cy_normalf, py_normalf)
By request I have implemented the algorithm as best I could in Numba. I should note that:
print
statements because the output is so uselessI've tried to comment why I did certain unrollings, though I don't justify them because I don't really understand them. I assume that specifying the expected type will help the optimisation - but I honestly have no idea. Presumably the np.greater
and np.less
problem is a bug in Numba?
In [ ]:
import numba
from numba.decorators import autojit, jit
from math import sqrt
# Had to unroll this otherwise it complained about some strange python object
# coercion error
@autojit
def numba_normalise(vec):
# Avoid divisions by almost 0 numbers
# np.spacing(1) is equivalent to Matlab's eps
n = vec.shape[0]
for i in range(n):
d = sqrt(vec[i, 0] * vec[i, 0] +
vec[i, 1] * vec[i, 1] +
vec[i, 2] * vec[i, 2])
if d < np.spacing(1):
d = 1.0
vec[i, 0] /= d
vec[i, 1] /= d
vec[i, 2] /= d
# If I didn't roll my own cross product then computing
# the normals actually takes LONGER than the pure Python implementation
@jit(argtypes=(numba.double[:, :], numba.double[:, :]))
def cross_numba(x, y):
output = np.empty_like(x)
n = x.shape[0]
for i in range(n):
output[i, 0] = x[i, 1] * y[i, 2] - x[i, 2] * y[i, 1]
output[i, 1] = x[i, 2] * y[i, 0] - x[i, 0] * y[i, 2]
output[i, 2] = x[i, 0] * y[i, 1] - x[i, 1] * y[i, 0]
return output
@jit(argtypes=(numba.double[:, :], numba.int32[:, :]))
def numba_compute_normal(vertex, face):
nface = face.shape[0]
# Calculate the cross product (per-face normal)
normalf = cross_numba(vertex[face[:, 1], :] - vertex[face[:, 0], :],
vertex[face[:, 2], :] - vertex[face[:, 0], :])
numba_normalise(normalf)
# Calculate per-vertex normal
normal = np.zeros_like(vertex)
for i in range(nface):
f = face[i, :]
for j in range(3):
normal[f[j], :] += normalf[i, :]
# Normalize
numba_normalise(normal)
# Enforce that the normal are outward
v = vertex - np.mean(vertex)[..., None]
s = np.sum(v * normal, axis=1)
s_gt_sum = 0
s_lt_sum = 0
# Had to expand this loop otherwise numba complained:
# 'only length-1 arrays can be converted to Python scalars'
# On the calls to np.greater and np.less
for i in range(s.shape[0]):
if s[i] > 0:
s_gt_sum += 1
elif s[i] < 0:
s_lt_sum += 1
if s_gt_sum < s_lt_sum:
# flip
normal = -normal
normalf = -normalf
return normal, normalf
As for the results, we've gone down to around 800ms, which is definitely an improvement. Interestingly, this is on par with a Matlab implementation that I have (which is known to be jitted). To sanity check we will also check that the output is correct.
In [ ]:
%timeit numba_compute_normal(points, tris)
In [ ]:
numba_normal, numba_normalf = numba_compute_normal(points, tris)
py_normal, py_normalf = py_compute_normal(points, tris)
print np.allclose(numba_normal, py_normal)
print np.allclose(numba_normalf, py_normalf)