1. Write a Python function using for loops to calculate the pairwise distance matrix given two sets of vectors. The function signature should look like this:

np.ndarray <- cdist(xs, ys, dist)

where xs and ys are collections of vectors, and dist is some function for a distance metric. Write a function for the Euclidean distance metric with the signature:

float <- euclidean_distance(x, y)

Example of usage:

coords = [(35.0456, -85.2672),
          (35.1174, -89.9711),
          (35.9728, -83.9422),
          (36.1667, -86.7833)]
cdist(coords, coords, euclidean_distance)

where coords is interpreted as 4 row vectors of dimension 2 each. The result should be

Euclidean metric

array([[ 0.        ,  4.70444794,  1.6171966 ,  1.88558331],
       [ 4.70444794,  0.        ,  6.0892811 ,  3.35605413],
       [ 1.6171966 ,  6.0892811 ,  0.        ,  2.84770898],
       [ 1.88558331,  3.35605413,  2.84770898,  0.        ]])

Time the performance of your function for Euclidean distance metric using %timeit for the given data sets XA and XB.

Use the library function scipy.spatial.distance.cdist to see how much speed-up is achievable.


In [11]:
import numpy as np

np.random.seed(123)

n1 = 50
n2 = 100
p = 10
XA = np.random.normal(0, 1, (n1, p))
XB = np.random.normal(0, 1, (n2, p))

In [ ]:

2. Write a version cdist_numpy to speed up calculations using vectorization and broadcasting. Check that it gives the correct answers on coords and compare timings.


In [ ]:

3. Write a verison cdist_numba using numba JIT. Check that it gives the correct answers on coords and compare timings.


In [ ]:

4. Write a verison cdist_cython using Cython AIT. Check that it gives the correct answers on coords and compare timings.


In [ ]: