This notebook contains an excerpt from the book Machine Learning for OpenCV by Michael Beyeler. The code is released under the MIT license, and is available on GitHub.

Note that this excerpt contains only the raw code - the book is rich with additional explanations and illustrations. If you find this content useful, please consider supporting the work by buying the book!

Preprocessing Data

The more disciplined we are in handling our data, the better results we are likely to achieve in the end. The first step in this procedure is known as data preprocessing.

Standardizing features

Standardization refers to the process of scaling the data to have zero mean and unit variance. This is a common requirement for a wide range of machine learning algorithms, which might behave badly if individual features do not fulfill this requirement. We could manually standardize our data by subtracting from every data point the mean value ($\mu$) of all the data, and dividing that by the variance ($\sigma$) of the data; that is, for every feature $x$, we would compute $(x - \mu) / \sigma$.

Alternatively, scikit-learn offers a straightforward implementation of this process in its preprocessing module. Let's consider a 3 x 3 data matrix X, standing for three data points (rows) with three arbitrarily chosen feature values each (columns):


In [1]:
from sklearn import preprocessing
import numpy as np
X = np.array([[ 1., -2.,  2.],
              [ 3.,  0.,  0.],
              [ 0.,  1., -1.]])

Then, standardizing the data matrix X can be achieved with the function scale:


In [2]:
X_scaled = preprocessing.scale(X)
X_scaled


Out[2]:
array([[-0.26726124, -1.33630621,  1.33630621],
       [ 1.33630621,  0.26726124, -0.26726124],
       [-1.06904497,  1.06904497, -1.06904497]])

Let's make sure X_scaled is indeed standardized: zero mean, unit variance


In [3]:
X_scaled.mean(axis=0)


Out[3]:
array([  7.40148683e-17,   0.00000000e+00,   0.00000000e+00])

In addition, every row of the standardized feature matrix should have variance of 1 (which is the same as checking for a standard deviation of 1 using std):


In [4]:
X_scaled.std(axis=0)


Out[4]:
array([ 1.,  1.,  1.])

Normalizing features

Similar to standardization, normalization is the process of scaling individual samples to have unit norm. I'm sure you know that the norm stands for the length of a vector, and can be defined in different ways. We discussed two of them in the previous chapter: the L1 norm (or Manhattan distance) and the L2 norm (or Euclidean distance).

X can be normalized using the normalize function, and the L1 norm is specified by the norm keyword:


In [5]:
X_normalized_l1 = preprocessing.normalize(X, norm='l1')
X_normalized_l1


Out[5]:
array([[ 0.2, -0.4,  0.4],
       [ 1. ,  0. ,  0. ],
       [ 0. ,  0.5, -0.5]])

Similarly, the L2 norm can be computed by specifying norm='l2':


In [6]:
X_normalized_l2 = preprocessing.normalize(X, norm='l2')
X_normalized_l2


Out[6]:
array([[ 0.33333333, -0.66666667,  0.66666667],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.        ,  0.70710678, -0.70710678]])

Scaling features to a range

An alternative to scaling features to zero mean and unit variance is to get features to lie between a given minimum and maximum value. Often these values are zero and one, so that the maximum absolute value of each feature is scaled to unit size. In scikit-learn, this can be achieved using MinMaxScaler:


In [7]:
min_max_scaler = preprocessing.MinMaxScaler()
X_min_max = min_max_scaler.fit_transform(X)
X_min_max


Out[7]:
array([[ 0.33333333,  0.        ,  1.        ],
       [ 1.        ,  0.66666667,  0.33333333],
       [ 0.        ,  1.        ,  0.        ]])

By default, the data will be scaled to fall within 0 and 1. We can specify different ranges by passing a keyword argument feature_range to the MinMaxScaler constructor:


In [8]:
min_max_scaler = preprocessing.MinMaxScaler(feature_range=(-10, 10))
X_min_max2 = min_max_scaler.fit_transform(X)
X_min_max2


Out[8]:
array([[ -3.33333333, -10.        ,  10.        ],
       [ 10.        ,   3.33333333,  -3.33333333],
       [-10.        ,  10.        , -10.        ]])

Binarizing features

Finally, we might find ourselves not caring too much about the exact feature values of the data. Instead, we might just want to know if a feature is present or absent. Binarizing the data can be achieved by thresholding the feature values. Let's quickly remind ourselves of our feature matrix, X:


In [9]:
X


Out[9]:
array([[ 1., -2.,  2.],
       [ 3.,  0.,  0.],
       [ 0.,  1., -1.]])

Let's assume that these numbers represent the thousands of dollars in our bank accounts. If there are more than 0.5 thousand dollars in the account, we consider the person rich, which we represent with a 1. Else we put a 0. This is akin to thresholding the data with threshold=0.5:


In [10]:
binarizer = preprocessing.Binarizer(threshold=0.5)
X_binarized = binarizer.transform(X)
X_binarized


Out[10]:
array([[ 1.,  0.,  1.],
       [ 1.,  0.,  0.],
       [ 0.,  1.,  0.]])

The result is a matrix made entirely of ones and zeros.

Handling missing data

Another common need in feature engineering is the handling of missing data. For example, we might have a dataset that looks like this:


In [11]:
from numpy import nan
X = np.array([[ nan, 0,   3  ],
              [ 2,   9,  -8  ],
              [ 1,   nan, 1  ],
              [ 5,   2,   4  ],
              [ 7,   6,  -3  ]])

Most machine learning algorithms cannot handle the Not a Number (NAN) values (nan in Python). Instead, we first have to replace all the nan values with some appropriate fill values. This is known as imputation of missing values.

Three different strategies to impute missing values are offered by scikit-learn:

  • 'mean': Replaces all nan values with the mean value along a specified axis of the matrix (default: axis=0).
  • 'median': Replaces all nan values with median value along a specified axis of the matrix (default: axis=0).
  • 'most_frequent': Replaces all nan values with the most frequent value along a specified axis of the matrix (default: axis=0).

For example, the 'mean' imputer can be called as follows:


In [12]:
from sklearn.preprocessing import Imputer
imp = Imputer(strategy='mean')
X2 = imp.fit_transform(X)
X2


Out[12]:
array([[ 3.75,  0.  ,  3.  ],
       [ 2.  ,  9.  , -8.  ],
       [ 1.  ,  4.25,  1.  ],
       [ 5.  ,  2.  ,  4.  ],
       [ 7.  ,  6.  , -3.  ]])

Let's verify the math by calculating the mean by hand, should evaluate to 3.75 (same as X2[0, 0]):


In [13]:
np.mean(X[1:, 0]), X2[0, 0]


Out[13]:
(3.75, 3.75)

Similarly, the 'median' strategy relies on the same code:


In [14]:
imp = Imputer(strategy='median')
X3 = imp.fit_transform(X)
X3


Out[14]:
array([[ 3.5,  0. ,  3. ],
       [ 2. ,  9. , -8. ],
       [ 1. ,  4. ,  1. ],
       [ 5. ,  2. ,  4. ],
       [ 7. ,  6. , -3. ]])

Let's make sure the median of the column evaluates to 3.5 (same as X3[0, 0]):


In [15]:
np.median(X[1:, 0]), X3[0, 0]


Out[15]:
(3.5, 3.5)