Chapter 4: Data Preprocessing

Introduction

A good machine learning model is nothing without good data. In this chapter, Raschka goes over some common techniques for getting your data into shape for the training/testing process.

We cover three general elements of preprocessing:

  1. Removing and imputing missing data
  2. Dealing with categorical variables
  3. Feature selection
    1. Partitioning datasets
    2. Feature scaling
    3. Regularization
    4. Selecting good features

In the process, we make liberal use of the sklearn.preprocessing module, which has a lot of handy preprocessing functions built-in.

Using scikit-learn's estimator API

Most of scikit-learn's preprocessing functions make use of two essential class methods:

  • fit(): learns parameters based on sample data
  • transform(): uses learned parameters to change the values of the inputs

These two methods should look familiar: all of the classification models we've used so far have made use of them!

Scikit-learn also often includes a method that combines these two functions into one operation:

  • fit_transform(): learns parameters based on sample data, then uses those parameters to transform and return the sample data

We'll most often use fit_transform() when we're preprocessing data, since we'll often want a transformed version of the training data. When training a model, however, we often want the fine-grained control offered by the different fit() and transform() methods.

Removing and imputing missing data

What's up with missing data?

If they're not handled properly, missing data (i.e. missing features - row values that are empty) can be a huge source of error in machine learning models.

Some common reasons that row values can be missing from a dataset:

  1. Errors in the collection process
  2. Conscious decisions by the schema designers (e.g. NULL indicates that the respondent refused to answer)
  3. The feature does not apply to the sample (e.g. conditional features)

It's important to be aware of any intended meanings behind missing values before making a decision about how to interpret them. In 2 and 3 above, for example, missing values have semantic meaning that must be interpreted in the way that the producers of the data intended.

Common tactics

In general, there are two ways of dealing with missing data:

  1. Eliminate samples (or features) that contain erroneously missing values
  2. Impute (guess) missing values based on context.

The advantage of eliminating samples or features is that it's easy and principled; the downside is that it can remove valuable information from your data, leading to a biased estimator.

Since training data is often precious and difficult to come by, we have two main methods for imputing missing data:

  1. Mean imputation: Substitute the mean of the feature for the row value (most common for numerical variables)
  2. Mode imputation: Subsitute the mode of the feature for the row value (most common for categorical variables)

Different imputation strategies are available via the sklearn.preprocessing.Imputer class.

Dealing with categorical variables

Mathematical machine learning models can only learn based on numerical features of a sample space. Yet when humans record data, we often use categorical variables - that is, variables that are defined as strings of text, like "red" or "setosa". In order to learn a model on categorical variables, we must first define some principled way of transforming those variables into numerical types.

Categorical variables come in two fundamental classes:

  • ordinal variables have a hierarchical structure (e.g. risk level - "low", "medium", and "high")
  • nominal variables have no logical ordering (e.g. ethnicity - "Cuban", "Italian", "Pacific Islander")

Mapping functions

Which class a categorical variable falls under can have important implications for how to transform it into a numerical type. In both cases, we'll define a mapping from our categorical type to a numeric type.

In the risk level example, we might define our mapping like so:

categorical variable numerical variable
low risk 1
medium risk 2
high risk 3

Scikit-learn can store a mapping function in an object using the sklearn.preprocessing.LabelEncoder API. The two primary functions are:

LabelEncoder.map() - transform categorical variables into integers, and store the result LabelEncoder.reverse() – inverses the map function (turns integers into categoricals based on the mapping0

For a more detailed rundown on the API, see the scikit-learn docs.

The special case of nominal variables: one-hot encoding

The hierarchical nature of ordinal variables means that mapping functions often work well out of the box. But what about nominal variables, where there is no logical structure to the data? In this case, machine learning models may return undesirable results, since they will interpret meaning behind the magnitude of numeric class labels - effectively responding to information that doesn't actually exist in the source data.

To properly map nominal variables to numeric types, we make can use of a technique called one-hot encoding. One-hot encoding prevents magnitude-related errors in mapped categorical variables by using boolean types instead of integers.

In order to use a boolean type for a categorical variable, we'll have to add features onto our sample space, so-called dummy variables, that encode the presence or absence of the given variable. To see how this works, let's look at an example set of color samples:

sample number color
1 green
2 red
3 blue
4 red

Here, we have four color samples, but our feature space only includes three possible values for color: green, red, and blue. (In database lingo, this is called an enumerated type.) Using one-hot encoding, we'll add three new features, with each one corresponding to the presence or absence of a given possible value for color:

sample number color green red blue
1 green 1 0 0
2 red 0 1 0
3 blue 0 0 1
4 red 0 1 0

Now, our model can learn the presence or absence of a categorical variable, without erroneously interpreting a magnitude to that variable. Nice!

One concern that immediately arises out of one-hot encoding is that when the universe of possible values for a given categorical variable gets large (or, alternatively, when the number of categorical variables in our dataset gets large) we'll have to expand the dimensions of our dataset, potentially by a huge factor. However, since the overwhelming majority of values in our dummy features will be 0, we can make use of sparse matrices, a data structure specialized to store matrices with very few nonzero values.

In a sparse matrix, we define sparsity as the proportion of values that are 0. That is:

$$ sparsity = \frac{z}{m \cdot n} $$

Where $m \times n$ represents the dimensions of the matrix, and $z$ represents the number of values equal to 0.

In addition, we can define density as the proportion of nonzero values:

$$ density = 1 - sparsity $$

To make use of sparse matrices, we can use the sklearn.preprocessing.OneHotEncoder class to record our mapping function.

Feature selection

For the last section of this chapter, Raschka introduces a whole set of techniques for selecting good features to feed into a model.

Partitioning datasets

When partitioning a sample space into training and test datasets, it's important to consider the proportion of samples that get selected into each set. The proportion of training to test data involves a bias/variance tradeoff: too little test data can lead to variance (overfitting) whereas too little training data can lead to bias (underfitting).

Once a model reaches satisfying performance, it's usually a good idea to retrain it on the full sample space (training + test data). Training data is previous!

Feature scaling

All of the models we've covered so far, with the exception of decision trees/random forests, benefit from feature scaling, an umbrella term describing a wide variety of techniques for altering the distribution of training/test data. Since non-tree models learn parameters by minimizing a cost function, features with wide scales will dominate the cost function; hence, feature scaling seeks to standardize input values through some kind of standardization function, setting them all on the same scale.

Different application domains use different names for feature scaling techniques. Raschka uses the terms normalization and standardization for the two techniques covered in this chapter, although he warns that these terms are overloaded.

Normalization (min-max scaling)

Also known as min-max scaling, normalization seeks to transform features such that they are all bounded by the same interval - in this case, [0, 1]. By transforming features onto a uniform interval, normalization can help avoid the case where wide-ranged features dominate a model's cost function.

For a given sample $i$ in the feature column $x$, we can define the normalized value $x^{(i)}_{norm}$ as the value's proportionate position in the range of the feature column:

$$ x^{(i)}_{norm} = \frac{x^{(i)} - x_{min}}{x_{max} - x_{min}} $$

Where $x_{max} - x_{min}$ represents the range.

A major disadvantage of min-max scaling is that it is vulnerable to outliers: it doesn't preserve the distribution of the feature column, so if one value of a feature is very far from the majority of the data, it can over-compress the rest of the feature column, losing critical information in the process.

To preserve the distribution of input data, we turn to standardization.

Standardization

Rather than project features onto the interval [0, 1], standardization converts values to z-scores using the distribution represented by each feature. In this way, standardization transfers some important properties of the standard normal distribution to the data - hence the name "standardization". (Why wouldn't "normalization" be equally appropriate in this case, you might ask? Because the properties of the standard normal distribution that are attractive to us in this case come from its standard nature, not its normal nature.)

The properties of the standard normal distribution that are relevant to this case are that the mean of the feature column is set to 0, and each value is measured in units of standard deviations. Hence, we define the normalized value $x^{(i)}_{norm}$ as the value's z-score with respect to the feature column:

$$ x^{(i)}_{norm} = \frac{x^{(i)} - \mu_{x}}{\sigma_{x}} $$

Where $\mu_{x}$ represents the mean of the feature column, and $\sigma_{x}$ represents the standard deviation of the feature column.

One additional nice property of standardization is that it works well with the particular definitions of the learning algorithms that we've covered so far: since we initialize our weight vector $w$ with values equal (or extermely close to) 0, with standardized features we effectively initialize our weights to the mean of each feature column. Since in this case we're initializing the weight vector with a sense of the "shape" of the distribution, we can often learn the optimal parameters with fewer iterations over the training data.

Selecting good features

For the remaining portion of the chapter, Raschka covers a few common tactics to address overfitting (high-variance models). Ideally, we would address overfitting by collecting more training data, but this is often impractical.

Raschka focuses on two tactics:

  1. Penalizing complexity through regularization
  2. Reducing dimensionality through feature selection

For now, we'll only touch on the basics of 2. The next chapter will focus on it exclusively.

Penalizing complexity with regularization

Let's recall from Chapter 3 that regularization is a technique that reduces variance by deliberately adding bias (noise). When a model is high variance, regularization can be a powerful way of penalizing complexity.

In Chapter 3 we covered a particular form of regularization known as L2 Regularization. Here, Raschka focuses on L1 Regularization, which has the nice property of reducing dimensionality.

L1 Regularization is a linear form of regularization, which means that the bias term is proportional to the length of the weight vector (as opposed to the square of the length, as in L2):

$$ \mid\mid w \mid\mid_{1} = \sum_{j=1}^{m} \mid w_{j} \mid $$

The linear property of L1 Regularization means that the model is more likely to fully eliminate features that contribute very little to the overall information gain of the model. Hence, L1 Regularization produces sparse matrices, and can be seen as a type of dimensionality reduction.

Reducing dimensionality through feature selection

In addition to L1 Regularization, we can also reduce the dimensionality of a dataset by algorithmically removing features and measuring the loss of performance represented by each removal. In set terms, we'll select a high-information subset of the original sample space, using a greedy search algorithm called Sequential Backwards Selection (SBS) to find a local minimum in information loss.

A crucial component of SBS is the criterion function $J$ which measures how good our partition of the feature space is. There are many different ways to measure $J$: a simple one might be, "how accurate is the model when using this set of features?" Raschka doesn't go into the particulars of common criterion functions in the text.

Once we've defined a criterion function $J$, we can find the $k$ best-performing features of a dataset with $d$ total features like so:

  1. Initialize the algorithm with $k = d$
  2. Find the feature $x^{-}$ such that: $$ x^{-} = argmax \; J(X_{k} - x), \; x \in X_{k} $$
  3. Remove $x^{-}$ from the feature set: $$ X_{k-1} = X_{k} - x^{-}; \; k := k - 1 $$
  4. If $k$ = the desired number of features, exit; otherwise, goto 2

In this way, SBS finds the local maximum performance at each iteration of the algorithm; this is why it's a greedy search and not an exhaustive search. The tradeoff is that SBS won't always find the global maximum performance, but it will find a good compromise between high performance and low search time.

Feature selection with random forests

The last topic Raschka covers is a cool perk of random forests that can help make informed decisions about removing features.

Since each individual tree in a random forest makes a decision about how to partition a sample space based on the information gain of a given partition, we can average out their impurity decisions to find the feature importance of any given feature. In scikit-learn, the feature importances of a dataset can be accessed through the handy class attribtue sklearn.ensemble.RandomForestClassifier.feature_importances_.

One word of caution: when features are highly correlated, their importances as measured by a random forest may not capture actual importance. Hence, we should take care to use measured feature importance only as a tool in improving our model performance, not for causal inference.