Scientific Programming in Python

Topic 4: Just in Time Compilation: Numba and NumExpr

Notebook created by Martín Villanueva - martin.villanueva@usm.cl - DI UTFSM - April 2017.


In [1]:
import numba
import numpy as np
import numexpr as ne
import matplotlib.pyplot as plt

En esta actividad implementaremos una conocida métrica para medir disimilitud entre conjuntos: La métrica de Hausdorff. Esta corresponde a un métrica o distancia ocupada para medir cuán disímiles son dos subconjuntos dados.

Esta tiene muchas aplicaciones, en particular para comparar el parecido entre imágenes. En el caso en donde los conjuntos son arreglos bidimensionales, la definición es la siguiente:

Sean $X \in \mathbb{R}^{m \times 3}$ e $Y \in \mathbb{R}^{n \times 3}$ dos matrices, la métrica/distancia de Hausdorff sobre sobre estas como:

$$ d_H(X,Y) = \max \left(\ \max_{i\leq m} \min_{j \leq n} d(X[i],Y[j]), \ \max_{j\leq n} \min_{i \leq m} d(Y[j],X[i]) \ \right) $$

donde $d$ es la distancia Euclideana clásica. ($X[i]$ indíca la i-ésima fila de X).

Ilustración unidimensional: Distancia entre funciones.

  1. Implemente la métrica de Hausdorff en Python clásico.
  2. Implemente la métrica de Hausdorff usando Numba (Forzando el modo nopython y definiendo explícitamente las signatures de las funciones).
  3. Cree 10 arreglos $X,Y$ aleatorios, con cantidad creciente de filas, y realice análsis de tiempos de ejecuciones de las funciones anteriores sobre estos arreglos.
  4. Concluya.

Paso 1.


In [2]:
def metric_python(x, y):
    """
    standard Euclidean distance
    """
    ret = x-y
    ret *= ret
    return np.sqrt(ret).sum()


def inf_dist_python(x, Y):
    """
    inf distance between row x and array Y
    """
    m = Y.shape[0]
    inf = np.inf
    
    for i in range(m):
        dist = metric_python(x, Y[i])
        if dist < inf:
            inf = dist
    return inf

def hausdorff_python(X, Y):
    """
    Hausdorff distance between arrays X and Y
    """
    m = X.shape[0]
    n = Y.shape[0]
    sup1 = -1.
    sup2 = -1.
    
    for i in range(m):
        inf1 = inf_dist_python(X[i], Y)
        if inf1 > sup1:
            sup1 = inf1
    for i in range(n):
        inf2 = inf_dist_python(Y[i], X)
        if inf2 > sup2:
            sup2 = inf2
            
    return max(sup1, sup2)

Paso 2.

Notar que los cambios son mínimos: Decoradores + nombres de funciones.


In [3]:
@numba.jit('float64 (float64[:], float64[:])')
def metric_numba(x, y):
    """
    standard Euclidean distance
    """
    ret = x-y
    ret *= ret
    return np.sqrt(ret).sum()


@numba.jit('float64 (float64[:], float64[:,:])', nopython=True)
def inf_dist_numba(x, Y):
    """
    inf distance between row x and array Y
    """
    m = Y.shape[0]
    inf = np.inf
    
    for i in range(m):
        dist = metric_numba(x, Y[i])
        if dist < inf:
            inf = dist
    return inf

@numba.jit('float64 (float64[:,:], float64[:,:])', nopython=True)
def hausdorff_numba(X, Y):
    """
    Hausdorff distance between arrays X and Y
    """
    m = X.shape[0]
    n = Y.shape[0]
    sup1 = -1.
    sup2 = -1.
    
    for i in range(m):
        inf1 = inf_dist_numba(X[i], Y)
        if inf1 > sup1:
            sup1 = inf1
    for i in range(n):
        inf2 = inf_dist_numba(Y[i], X)
        if inf2 > sup2:
            sup2 = inf2
            
    return max(sup1, sup2)

Paso 3.


In [4]:
#nrows = [10**n for n in range(10)]
nrows = np.linspace(100,5000,10).astype(int)

for nrow in nrows:
    X = np.random.random((nrow,3))
    Y = np.random.random((nrow,3))
    tp = %timeit -o hausdorff_python(X,Y)
    tn = %timeit -o hausdorff_numba(X,Y)
    print("Number of rows: {0}".format(nrow))
    print("Best time in Python: {0}".format(tp.best))
    print("Best time in Numba: {0} \n".format(tn.best))
    del X,Y


10 loops, best of 3: 96.9 ms per loop
100 loops, best of 3: 10.3 ms per loop
Number of rows: 100
Best time in Python: 0.09689636600087397
Best time in Numba: 0.010275031779892743 

1 loop, best of 3: 4.03 s per loop
1 loop, best of 3: 420 ms per loop
Number of rows: 644
Best time in Python: 4.0261467099771835
Best time in Numba: 0.4199122780119069 

1 loop, best of 3: 13.6 s per loop
1 loop, best of 3: 1.44 s per loop
Number of rows: 1188
Best time in Python: 13.602671164029744
Best time in Numba: 1.4442056000116281 

1 loop, best of 3: 28.8 s per loop
1 loop, best of 3: 3.07 s per loop
Number of rows: 1733
Best time in Python: 28.812231293006334
Best time in Numba: 3.074851111974567 

1 loop, best of 3: 50 s per loop
1 loop, best of 3: 5.38 s per loop
Number of rows: 2277
Best time in Python: 49.96447215799708
Best time in Numba: 5.381338081031572 

1 loop, best of 3: 1min 20s per loop
1 loop, best of 3: 8.18 s per loop
Number of rows: 2822
Best time in Python: 80.22970647504553
Best time in Numba: 8.180889984010719 

1 loop, best of 3: 1min 54s per loop
1 loop, best of 3: 11.5 s per loop
Number of rows: 3366
Best time in Python: 114.21166667999933
Best time in Numba: 11.538127077976242 

1 loop, best of 3: 2min 36s per loop
1 loop, best of 3: 15.5 s per loop
Number of rows: 3911
Best time in Python: 156.53557300998364
Best time in Numba: 15.49025035899831 

1 loop, best of 3: 3min 13s per loop
1 loop, best of 3: 20.9 s per loop
Number of rows: 4455
Best time in Python: 193.120016134053
Best time in Numba: 20.861760444997344 

1 loop, best of 3: 4min per loop
1 loop, best of 3: 25.7 s per loop
Number of rows: 5000
Best time in Python: 240.2516234299983
Best time in Numba: 25.672869634989183 

Paso 4.

Gracias a la compilación en tiempo real, los resultados muestran que Numba es ordenes de magnitud más rápido que una implemetación nativa con Python+NumPy.