Preprocessing

TL;DR This notebook demonstrates the performance improvement of using a numba JIT compiled algorithm for preprocessing data over the scikit-learn equivalent for some sample data.


In [36]:
from numba import jit
import pandas as pd
import numpy as np
import time
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

import os
import sys
nb_dir = os.path.split(os.getcwd())[0]
if nb_dir not in sys.path:
    sys.path.append(nb_dir)

In [37]:
from utilities.preprocessing import standard_scale  
# this is the function we're going to test versus sklearn

In [38]:
x = load_iris().data

In [39]:
output = StandardScaler().fit_transform(x)
fast_output = standard_scale(x)
np.allclose(output, fast_output)


Out[39]:
True

In [40]:
%timeit StandardScaler().fit_transform(x)


The slowest run took 4.81 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 252 µs per loop

In [41]:
%timeit standard_scale(x)


100000 loops, best of 3: 12.1 µs per loop