Feature Engineering, Scaling, and Cross Validation

Feature Engineering

Dealing with Categorical Features

Every machine learning process starts with a data set from which you wish to extract information. Feature engineering lies at the beginning of this process. It deals with cleaning data and the extraction or creation of features from the data set in order to facilitate the prediction of a response. The tools for data cleaning we discussed in the first three chapters. Thankfully, the data sets we use in this course are mostly already cleansed but you should be aware that in practice this is what consumes a significant amount of time. As for the feature extraction part, we saw an example of it in the chapter on KNN where we downloaded share prices and "generated" features (i.e. lagged returns) out of it. In that setup all of the feature values were of numerical kind. In the chapters on logistic regression, LDA and QDA we worked with the 'Default' data set. This set had a binary categorical feature (e.g. student: yes/no). In order to work with these string values we simply used Pandas' factorize() function. Here's the code we applied:

# Factorize 'No' and 'Yes' in columns 'default' and 'student' df['defaultFac'] = df.default.factorize()[0] df['studentFac'] = df.student.factorize()[0]

For the case of binary categories this works perfectly fine. But what if the number of categories is greater than two? How should we deal with it? To illustrate this, imagine the following sample data set:


In [1]:
import pandas as pd
data = [{'price': 1390000, 'rooms': 3.5, 'dist': 5, 'fab': 'bricks'},
        {'price': 1300000, 'rooms': 4.5, 'dist': 2, 'fab': 'concrete'},
        {'price':  840000, 'rooms': 3.5, 'dist': 12, 'fab': 'wood'},
        {'price': 1400000, 'rooms': 5.5, 'dist': 9, 'fab': 'concrete'}]
data = pd.DataFrame(data)
data


Out[1]:
price rooms dist fab
0 1390000 3.5 5 bricks
1 1300000 4.5 2 concrete
2 840000 3.5 12 wood
3 1400000 5.5 9 concrete

How do we deal with the 'fab' column? Easy, you might think: we just use a numerical mapping, for example 0 = bricks, 1 = concrete, 2 = wood. This is precisely what the output of pd.factorize() would be.


In [2]:
pd.factorize(data['fab'])


Out[2]:
(array([0, 1, 2, 1], dtype=int64),
 Index(['bricks', 'concrete', 'wood'], dtype='object'))

Well, if we would follow through with this approach and feed these values into a Scikit-learn ML function, the model would make the fundamental assumption that bricks > concrete > wood. Furthermore, we have numerical values for dist (the districts of Zurich) where we know precisely that these represent categorical values (2 = Wollishofen, 5 = Industrie, 9 = Altstetten, 12 = Schwamendingen). And this - geographic or demographic jokes aside - would not make much sense (VanderPlas (2016)). Houston we've got a problem!

NOTE:

  • In the binary case with $X_i \in \{0, 1\}$ ordering is not an issue in scikit-learn.
  • Ordering is relevant for features. Class labels for response $y$ are not ordinal for the ML algorithm we discuss in this course. Therefore it doesn't matter what number we assign them, and factorization is still valid (e.g. $y \in \{\text{Product X}=0, \text{Product Y}=1, \text{Product Z}=2\}$.

As soon as we have more than 2 categories, factorizing is in most cases no longer the appropriate solution. Instead, we should make use of a method called one-hot encoding, which effectively creates extra columns with dummy variables (booleans) indicating the presence of a category with a value of 1 or 0, respectively. There are numerous ways of encoding categorical values. In this chapter we will briefly touch upon Pandas' and Scikit-Learn's tools. For those looking for a thorough tutorial please refer to Chris Moffit's (2017) Guide to Encoding Categorical Values in Python .

get_dummies()

We start with Pandas' pd.get_dummies() function, which works of course seamlessly with DataFrames and is truly easy to work with. The two caveats it has is that (a) if a column contains numbers, it will not transform them into categorical values and (b) this approach does not work in scikit-learn piepelines (Pipelines are a ML workflow management tool that we will discuss in more detail in later chapters.). Regarding (a) here's what is meant:


In [3]:
pd.get_dummies(data)


Out[3]:
price rooms dist fab_bricks fab_concrete fab_wood
0 1390000 3.5 5 1 0 0
1 1300000 4.5 2 0 1 0
2 840000 3.5 12 0 0 1
3 1400000 5.5 9 0 1 0

Only the column 'fab', which contained strings, was converted. To make get_dummies() bend to our will, we have to convert the district values into strings first. Then we can apply the conversion.


In [4]:
data['dist'] = data['dist'].astype(str)
pd.get_dummies(data)


Out[4]:
price rooms dist_12 dist_2 dist_5 dist_9 fab_bricks fab_concrete fab_wood
0 1390000 3.5 0 0 1 0 1 0 0
1 1300000 4.5 0 1 0 0 0 1 0
2 840000 3.5 1 0 0 0 0 0 1
3 1400000 5.5 0 0 0 1 0 1 0

LabelEncoder()

The alternatives in Scikit-learn are called LabelEncoder (equivalent to pd.facorize()) and OneHotEncoder() (similar to pd.get_dummies()). Their setup is a bit more abstract and cumbersome, but it also has advantages compared to Pandas' solution. For example the OneHotEncoder() can be used in Scikit-learn pipelines and it returns a sparse matrix which is highly efficient computationally. But for the sake of brefity we will not go into too much detail here but simply show for reference how these Scikit-learn functions are applied. Ultimately it is at this stage a question of preference what functions you want to use for your task. Important is just that this step is done at the very beginning of the process - even before a train-test split is applied. That way you make sure the mapping is the same for both train and test split.


In [5]:
from sklearn import preprocessing as pp

# Factorize 'fab' column (similar to pd.factorize())
le = pp.LabelEncoder()
data_le = le.fit_transform(data['fab'])
data_le


Out[5]:
array([0, 1, 2, 1])

To convert the integer class labels back into their original representation we can simply use the .inverse_transform() method.


In [6]:
le.inverse_transform(data_le)


Out[6]:
array(['bricks', 'concrete', 'wood', 'concrete'], dtype=object)

OneHotEncoder

The corresponding Scikit-learn function to Panda's get_dummies() is called OneHotEncoder from the preprocessing sublibrary.


In [7]:
# Select categorical columns
X_cat = data[['dist', 'fab']]

# One-hot-encoding of said columns
ohe = pp.OneHotEncoder(sparse=True)
ohe.fit_transform(X_cat)


Out[7]:
<4x7 sparse matrix of type '<class 'numpy.float64'>'
	with 8 stored elements in Compressed Sparse Row format>

In [8]:
print(pd.DataFrame(ohe.fit_transform(X_cat).toarray()))


     0    1    2    3    4    5    6
0  0.0  0.0  1.0  0.0  1.0  0.0  0.0
1  0.0  1.0  0.0  0.0  0.0  1.0  0.0
2  1.0  0.0  0.0  0.0  0.0  0.0  1.0
3  0.0  0.0  0.0  1.0  0.0  1.0  0.0

In [9]:
ohe.categories_


Out[9]:
[array(['12', '2', '5', '9'], dtype=object),
 array(['bricks', 'concrete', 'wood'], dtype=object)]

The output here is a sparse matrix. Computationally and from a data storage perspective this is highly efficient because the matrix contains only integers as entries. However, the print-out of this sparse matrix is less indicative. All column labels are empty. To make sense of it we have to understand the setup of this sparse matrix. Each column is a binary representation of a value. The first column indicates which of the rows had a dist=12 (only row index 2), second dist=2 (only row index 1), third dist=5, fourth dist=9, fifth fab=bricks etc.

Beyond the convoluted output a further problem of the above approach is that this still leaves us with the task to combine numeric ('price', 'rooms') and categorical output (the aforementioned sparse matrix) into one feature matrix X. To simplify this task, there is a wrapper function ColumnTransformer that facilitates this in a simple way.


In [10]:
from sklearn.compose import ColumnTransformer

# Intatiate OneHotEncoder
ohe = pp.OneHotEncoder()

# create list of columns that should be transformed,
# i.e. columns 2 and 3 are categorical and should be
# one-hot-encoded
trnsfrms_list = [('cat', ohe, [2, 3])]

# Make use of the ColumnTransformer function,
# fit/transpose df and display output
trnsfrms = ColumnTransformer(transformers=trnsfrms_list, remainder='passthrough')
X_ohe = trnsfrms.fit_transform(data)
X_ohe


Out[10]:
array([[0.00e+00, 0.00e+00, 1.00e+00, 0.00e+00, 1.00e+00, 0.00e+00,
        0.00e+00, 1.39e+06, 3.50e+00],
       [0.00e+00, 1.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 1.00e+00,
        0.00e+00, 1.30e+06, 4.50e+00],
       [1.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00,
        1.00e+00, 8.40e+05, 3.50e+00],
       [0.00e+00, 0.00e+00, 0.00e+00, 1.00e+00, 0.00e+00, 1.00e+00,
        0.00e+00, 1.40e+06, 5.50e+00]])

Here's a short overview of the discussed functions and how they relate to each other:

No. of Categories Applicable Functions
2 pd.factorize(), sklearn.preprocessing.LabelEncoder()
> 2 pd.get_dummies(), sklearn.preprocessing.OneHotEncoder()

Incorrect class labeling is one of the more common mistake made in machine learning. The crux of the matter is that even if we forget to One-Hot-Encode our features, ML algorithms might still yield good results. Yet, as we know now, such results would be flawed. Therefore it is important to correctly preprocess the data set before applying any ML algorithms.

Partitioning a Data Set Into Train, Test Sets

In section 4.5 of the script we have briefly touched upon the importance of randomly splitting our data into a training set (to train/calibrate our ML algorithms) and a test set (to evaluate the accuracy of our model). As Pedregosa et al. (2011) put it in the scikit-learn documentation: "Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting. To avoid it, it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set X_test, y_test." The split ratio is somewhat arbitrary but literature suggests a range between 60:40 (train:test) and 80:20 for smaller samples and 90:10 to 99:1 for large sets (several thousands of observations) (Raschka (2015)).

In its model_selection submodule scikit-learn offers a convenient function called train_test_split that randomly splits a data set into separate train and holdout sets. To show how this function is applied we first load the publicly available 'adult' data set. It includes 14 features from the 1994 US census that measure an individual's characteristics. The response vector is income. The prediction task is to determine whether a person earns over 50K a year. For more information see the data description. This is of course a fairly simple data set but will do for our purposes. Nevertheless, in reality related tasks such as this are fairly common for credit card companies, leasing firms or banks that need to verify a customer's application for a credit (card), leasing, loan etc. (One could design this either as regression task or classification task with an adequate number of classes).

Before we apply the train_test_split function we convert categorical columns in a first step - as learned above.


In [11]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [12]:
df = pd.read_csv('Data/adult.csv', sep=',')
df.head(3)


Out[12]:
age fnlwgt education-num capital-gain capital-loss hours-per-week workclass education marital-status occupation relationship race native-country sex income
0 25 226802 7 0 0 40 Private 11th Never-married Machine-op-inspct Own-child Black United-States Male <=50K
1 38 89814 9 0 0 50 Private HS-grad Married-civ-spouse Farming-fishing Husband White United-States Male <=50K
2 28 336951 12 0 0 40 Local-gov Assoc-acdm Married-civ-spouse Protective-serv Husband White United-States Male >50K

The first six columns are numeric values. The remaining columns are categorical information and thus need to be converted.


In [13]:
# Get column names
cols = df.columns.values[6:]

# Factorize 'sex' and 'income' column (both binary)
df[cols[-2:]] = df[cols[-2:]].apply(lambda x: pd.factorize(x)[0])
df.head(3)


Out[13]:
age fnlwgt education-num capital-gain capital-loss hours-per-week workclass education marital-status occupation relationship race native-country sex income
0 25 226802 7 0 0 40 Private 11th Never-married Machine-op-inspct Own-child Black United-States 0 0
1 38 89814 9 0 0 50 Private HS-grad Married-civ-spouse Farming-fishing Husband White United-States 0 0
2 28 336951 12 0 0 40 Local-gov Assoc-acdm Married-civ-spouse Protective-serv Husband White United-States 0 1

In [14]:
# Assign response to y
y = df[cols[-1]]

# Factorize categorical values, assign output to X
X = pd.get_dummies(df.iloc[:, :-1])
X.head()


Out[14]:
age fnlwgt education-num capital-gain capital-loss hours-per-week sex workclass_ Federal-gov workclass_ Local-gov workclass_ Never-worked ... native-country_ Puerto-Rico native-country_ Scotland native-country_ South native-country_ Taiwan native-country_ Thailand native-country_ Trinadad&Tobago native-country_ United-States native-country_ Vietnam native-country_ Yugoslavia native-country_unknown
0 25 226802 7 0 0 40 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 0
1 38 89814 9 0 0 50 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 0
2 28 336951 12 0 0 40 0 0 1 0 ... 0 0 0 0 0 0 1 0 0 0
3 44 160323 10 7688 0 40 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 0
4 18 103497 10 0 0 30 1 0 0 0 ... 0 0 0 0 0 0 1 0 0 0

5 rows × 107 columns


In [15]:
X.shape


Out[15]:
(48842, 107)

Now we are ready to perform the train-test split. For this example we use a test set size of 30%. The parameter random_state=0 fixes the random split in a way such that results are reproducible.


In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=y)

A stratified sample is one that maintains the proportion of values as in the original data set. If, for example, the response vector $y$ is a binary categorical variable with 25% zeros and 75% ones, stratify=y ensures that the random splits have 25% zeros and 75% ones too. Note that stratify=y does not mean stratify=yes but rather tells the function to take the categorical proportions from response vector y.

Feature Scaling

Feature scaling is a crucial step in preparing data for ML applications. While some of the algorithms are invariant to the feature's scale (e.g. decision trees and random forests, which we will discuss in the next chapter), the majority of machine learning and optimization algorithms perform much better if features are scaled. The reasons are two-fold: for one, most ML algorithm have optimization functions to find the optimal coefficients/hyperparameter and these functions work more efficient on scaled values; for another, algorithms such as KNN, which use a distance measure, will put much more weight on features that have a larger scale than others (Raschka (2015), Müller and Guido (2017)). A good example for the latter is a data set that contains 'age' as well as 'income' variables. For a KNN model a difference in $1'000 salary is enormous compared to a difference of 10 years in age. Thus such a model would be driven by 'income' - and this would be contrary to our intuition (James et al. (2013)).

There are two common approaches to bringing different features onto the same scale: normalization and standardization. Unfortunately these terms are often used quite loosely in different fields so that the meaning has to be derived from the context they are mentioned (Raschka (2015)).

Normalization

In general, normalization refers to the process of rescaling the features to a range of $[0, 1]$. This can be viewed as a special case of min-max scaling. In our context, we normalize a feature column $X_i$ by the following convention:

$$\begin{equation} X_{i}^{\text{norm}} = \frac{X_i - \min(X_i)}{\max(X_i) - \min(X_i)} \end{equation}$$

With Scikit-learn this is applied as follows:


In [17]:
from sklearn.preprocessing import MinMaxScaler 

# Get cols to scale
cols_scl = X.columns.values[:6]

# Apply MinMaxScaler on continuous columns only
mms = MinMaxScaler()
X_train_norm = mms.fit_transform(X_train[cols_scl])  # fit & transform
X_test_norm  = mms.transform(X_test[cols_scl])  # ONLY transform

It is important to point out that we fit the MinMaxScaler only once on the training data, not the test data. On the test data we only run the .transform() method. The parameters we get out of the training set (i.e. mms.data_min_ and mms.data_max_) are then used to transform any new data (such as the holdout set).

Why aren't we applying the scaling on the full data set, before we split the data into train and test set? This would not be a good idea because of one particular reason: we would "contaminate" our model with information about the test set. If we were to do it we run the risk of having a very good model for our train AND test data but this model would perform poorly on new data that previously was not part of any train or test set. The effect of such an approach would be poor generalization. By the way, the same is true for feature selection. This too should not be done on the full set but only on a training set.

Standardization

When we need values in a bounded interval, normalization via min-max scaling is a useful technique. Yet for many machine learning algorithms standardization is a more practical approach. The rationale behind it is that most linear models (such as logistic regression, LDA or support vector machines) initialize coefficients/parameters/weights to small random values close to 0. By standardizing our features, we center the values at mean 0 with standard deviation of 1. This makes the computations easier to learn the coefficients/parameters/weights. Beyond that, standardization does not distort useful statistical information about outliers and with that makes the algorithm less sensitive to them in contrast to min-max scaling (Raschka (2015)). We express the process of standarizing a feature column by the following equation:

$$\begin{equation} X_{i}^{\text{std}} = \frac{X_i - \bar{X}_i}{\sigma_{X_i}} \end{equation}$$

Here we follow the common notation that $\bar{X}_i$ is the mean of vector $X_i$ and $\sigma_{X_i}$ the vector's standard deviation.


In [18]:
from sklearn.preprocessing import StandardScaler 

# Apply StandardScaler on continuous columns only
stdsc = StandardScaler()
X_train_std = stdsc.fit_transform(X_train[cols_scl])  # fit & transform
X_test_std  = stdsc.transform(X_test[cols_scl])  # ONLY transform

Again we fit the StandardScaler() only on the train set. The parameters of this first step (stdsc.mean_, stdsc.var_) are then applied to the test set with the code stdsc.transform(X_test). Note that the variance in StandardScaler() is based on the population variance (division by $n$; degree of freedom of 0, biased estimator). This explains differences if you compare stdsc.var_ with X_train.var() for small data sets. Use X_train.var(ddof=1) instead to get the same result as with the StandardScaler.

Other Scaling Methods

For our purposes the above introduced two scaling methods are sufficient. However, this is not to say others are not helpful and unnecessary. Scikit-learn of course offers many more scaling procedures. For an overview see Scikit-learn's guide on the topic. For those who like to see the effects of the different scaling methods in 2D plots, see Scikit-learn's tutorial "Compare the effect of different scalers on data with outliers". Finally, Sebastian Raschka published a text "About Feature Scaling and Normalization" that not only discusses the different scaling methods but also shows the positive effect scaling can have on ML predictions/scores. Please give it a read to understand the importance of scaling.

Cross Validation

Model Evaluation

To evaluate our models, we have discussed the necessity to split our data set into training and test set. For this we introduced the train_test_split() function. A ML algorithm is calibrated on the training set and then, to see how well our model generalizes to new, previously unseen data, applied to the test data. In this section we expand on aspects of evaluating a model by discussing cross validation (CV). This is a statistical method of performance evaluation that is more stable and thorough than a simple split into training and test set (Müller and Guido (2017)).

$k$-Fold Cross Validation

In $k$-fold CV we randomly split the training data set into $k$ folds without replacement, where $k − 1$ folds are used for the model training and the remaining fold is used for testing. This procedure is repeated $k$ times so that we obtain $k$ performance estimates. The performance of a ML algorithm is then simply the average of the performance for the $k$ different, individual folds.

  • What is the benefit of $k$-Fold Cross Validation? Performance estimates are less sensitive to the subpartitioning of the training data compared to a simple train/test split (Raschka (2015)). By applying it, we receive a more stable indication of a model's generalization performance.
  • How many folds $k$ should we use? The number of folds $k$ is usually 5 (large data sets) or 10 (small data sets) as suggested by Breiman and Spector (1992) or Kohavi (1995).

In general, CV is best used in combination with hyperparameter tuning. Raschka (2015, p. 175) writes:

"Typically, we use k-fold cross-validation for model tuning, that is, finding the optimal hyperparameter values that yield a satisfying generalization performance. Once we have found satisfactory hyperparameter values, we can retrain the model on the complete training set and obtain a final performance estimate using the independent test set."

The following figure displays the concept of $k$-fold CV with $k=5$. The training data set is divided into 5 folds. For each iteration 4 folds are used for training and 1 fold for model evaluation. The estimated performance $E$, this could be classification accuracy, an error rate etc., is calculated on the basis of the five performances $E_i$ (one for each iteration).

Despite the fact that many text books are applying CV on the full data set, students should be aware that in order to be truly consistent, $k$-fold CV should essentially be applied to the training data only. Similar to our discussion on feature scaling, the argument is again that if we use the full available data set, our results are somewhat contaminated or distorted.

Below we show how to use Scikit-learn to perform $k$-fold CV. The data is our 'adult' set from before and the algorithm we apply is logistic regression. Note here that we will use Scikit-learn's implementation of logistic regression (see here for the max_iter parameter). We start with the StratifiedKFold() function. This is a slight variation of the standard CV introduced above. Stratifying in this context means that proportions between classes are the same in each fold. To evaluate classifiers this StratifiedKFold() function is usually preferred (over the simple KFold() function) as it preserves the class proportions and thus results in more reliable estimates of generalization performance.


In [27]:
# Import necessary functions
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression

# Create k-Fold CV and LogReg object
kFold = StratifiedKFold(n_splits=5)
logReg = LogisticRegression(max_iter=1000)

# Run CV and print results
scores = cross_val_score(logReg, X_train_std, y_train, cv=kFold)
print(scores)
print('CV accuracy on train set: {0: .3f} +/- {1: .3f}'.format(np.mean(scores), np.std(scores)))


[0.81281076 0.8157356  0.81544311 0.81705177 0.81439228]
CV accuracy on train set:  0.815 +/-  0.001

As a comparison we can also run a test on the unscaled data set X_train. The results are indeed worse as literature suggests.


In [28]:
# Run CV on unscaled values and print results
scores = cross_val_score(logReg, X_train, y_train, cv=kFold)
print('CV accuracy on train set: {0: .3f} +/- {1: .3f}'.format(np.mean(scores), np.std(scores)))


CV accuracy on train set:  0.803 +/-  0.011

By default, cross_val_score returns the accuracy (or score) of the model. Recall that this is simply the number of correctly classified samples. If we instead wish to have another output, we do this by providing a scoring parameter. Below is an example to call for scoring='roc_auc', which will provide the ROC's area under the curve. The full list of available scoring parameter can be found here.


In [21]:
scores = cross_val_score(logReg, X_train_std, y_train, cv=kFold, scoring='roc_auc')
print('CV AUC on train set: {0: .3f} +/- {1: .3f}'.format(np.mean(scores), np.std(scores)))


CV AUC on train set:  0.829 +/-  0.005

Noteworthy is one last function: cross_validate. This allows us to define a list of measures that we can pass on to the function as scoring parameter. The output is a dictionary with fit_time and score_time (time elapsed to fit model/calculate scores), test_accuracy and train_accuracy (accuracy for the test set and training set, the output on the latter though only if return_train_score is manually set to True), etc.


In [22]:
from sklearn.model_selection import cross_validate

# Calculate return
measures = ['accuracy', 'recall', 'roc_auc']
scores = cross_validate(logReg, X_train_std, y_train, cv=kFold, 
                        scoring=measures, return_train_score=True, n_jobs=2)
scores


Out[22]:
{'fit_time': array([0.04886413, 0.04886413, 0.05485201, 0.05327678, 0.05185699]),
 'score_time': array([0.01197195, 0.01197195, 0.01238942, 0.01197052, 0.0129683 ]),
 'test_accuracy': array([0.81281076, 0.8157356 , 0.81544311, 0.81705177, 0.81439228]),
 'train_accuracy': array([0.81514387, 0.81488794, 0.81488794, 0.81503419, 0.81562591]),
 'test_recall': array([0.39401344, 0.400978  , 0.38508557, 0.38264059, 0.38569682]),
 'train_recall': array([0.38722494, 0.3854851 , 0.39174943, 0.3908327 , 0.39251337]),
 'test_roc_auc': array([0.83266176, 0.83483989, 0.82360967, 0.832222  , 0.82166198]),
 'train_roc_auc': array([0.82857395, 0.8277628 , 0.83060254, 0.82834432, 0.83085211])}

In [23]:
print('Train set accuracy (CV=5):    ', scores['train_accuracy'].mean())
print('Validation set scores (CV=5): ', scores['test_accuracy'].mean())
print('Test set accuracy:            ', logReg.fit(X_test_std, y_test).score(X_test_std, y_test))


Train set accuracy (CV=5):     0.8151159692448978
Validation set scores (CV=5):  0.8150867034886609
Test set accuracy:             0.8136900293455265

A great feature available in all discussed CV functions is the possibility to set the number of CPUs to use to do the computation on. For this define n_jobs=n, where n is the number of CPU's you want to use. n_jobs=-1 will use all available cores. This parallelization is especially helpful if you work on large data sets and/or computationally expensive tasks such as CV. If you want to know how many CPUs your computer runs on you can type the following code:


In [24]:
import multiprocessing
multiprocessing.cpu_count()


Out[24]:
4

Leave-One-Out Cross Validation

If we increase the number of folds to $n-1$ ($n$ = number of observations), that is, we train on all points but one in each trial, we call this Leave-One-Out CV or LOOCV. This can be seen as an extreme case of $k$-fold CV. On small data sets this might provide better estimates, but it can be very time consuming, particularly on larger data sets.


In [25]:
from sklearn.model_selection import LeaveOneOut

# Create objects
loocv = LeaveOneOut()
#logReg = LogisticRegression()

# Calculate & print scores
scores = cross_val_score(logReg, X_train_std, y_train)
print('LOOCV accuracy on train set:', np.mean(scores))


LOOCV accuracy on train set: 0.8150867034886609

Other Cross Validation Approaches

While we limit ourselves to the two most known and used splitting strategies, Scikit-learn offers more than just the two presented here. A good overview is provided in the packages' tutorial on cross validation. However, as an introduction to the topic and given the use cases in this seminar the above two serve the purpose well.

Further Ressources

In writing this notebook, many ressources were consulted. For internet ressources the links are provided within the textflow above and will therefore not be listed again. Beyond these links, the following ressources were consulted and are recommended as further reading on the discussed topics:

  • Breiman, Leo, and and Philip Spector, 1992, Submodel selection and Evaluation in Regression: the X-Random Case, International Statistical Review 60: 291–319.
  • Friedman, Jerome, Trevor Hastie, and Robert Tibshirani, 2001, The Elements of Statistical Learning (Springer, New York, NY).
  • James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani, 2013, An Introduction to Statistical Learning: With Applications in R (Springer Science & Business Media, New York, NY).
  • Kohavi, Ron, 1995, A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection, in International Joint Conference on Artificial Intelligence (IJCAI), 1137-145, Stanford, CA
  • Müller, Andreas C., and Sarah Guido, 2017, Introduction to Machine Learning with Python (O’Reilly Media, Sebastopol, CA).
  • Raschka, Sebastian, 2015, Python Machine Learning (Packt Publishing Ltd., Birmingham, UK)