Advanced Validation

Analyzing the data


In [1]:
import numpy as np
import pandas as pd
%matplotlib inline
from matplotlib import pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.svm import SVR
from sklearn.ensemble import GradientBoostingRegressor

We'll use one of the the toy datasets that scikit has available, since the focus of this example is too show how to use the validation tools and not how to deal with a "raw" dataset. We'll use the Boston House Prices dataset, which has a median value price for occupied home as target and 13 number of attributes, ranging from things like crime per capita rate to student to teacher ratio. The first thing we will do is to load the dataset and check its shape and values for one of the samples.


In [2]:
boston = datasets.load_boston()
boston.data.shape, boston.target.shape


Out[2]:
((506, 13), (506,))

In [3]:
print(boston.data[0])

print(boston.target[0])


[  6.32000000e-03   1.80000000e+01   2.31000000e+00   0.00000000e+00
   5.38000000e-01   6.57500000e+00   6.52000000e+01   4.09000000e+00
   1.00000000e+00   2.96000000e+02   1.53000000e+01   3.96900000e+02
   4.98000000e+00]
24.0

As we may see, we have 506 samples, with a part named data, where we have the 13 attributes and a part named target, with the target prices for each of those sets of 13 attributes.

We will now procede to train some regression models in order to predict the price of a house given the 13 attributes and introduce methods of validating your results in order to obtain a model able to generalize for unseen data.


Simple cross-validation on a test set

The logic of simple cross-validation is to train several models (both different algorithms and with different parameters) and in the end choose the one that yields the best accuracy on a test set.

We will start to tackle this problem by splitting the data set into four different arrays. X_train and y_train have the attributes and target prices for a subset of the samples we have in order to train the model, and X_test and y_test have the attributes and target prices for the rest of the subset, in order to verify the accuracy of the model. This might be done using the train_test_split function on scikit. In this example we will use 30% of the dataset as a test set.


In [4]:
X_train, X_test, y_train, y_test = train_test_split(
    boston.data, boston.target, test_size=0.3, random_state=0)

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)


(354, 13) (354,)
(152, 13) (152,)

Next we will train models to solve this problem. There are several models that can be used for training and in the beginning is hard to have the feeling for the best ones for each task. A good way to start is by following scikit-learn algorithm cheat-sheet. As we can see, from what we know from out dataset (predicting a quantity, less than 100k samples and 13 features that intuitively seem that might be important) it's good to start by using a Support vector regressor.

SV Regressor

Once again, since the focus of this example notebook is on the validation part we will train this model in a very straightforward way, not giving the importance to parameter tuning that we should and using only the test set, to highlight the importance of using a cross-validation.


In [5]:
clf = SVR(kernel='linear')

clf.fit(X=X_train, y=y_train)


Out[5]:
SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='linear', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

In [6]:
print("SVRegressor score is: {}".format(clf.score(X_test, y_test)))


SVRegressor score is: 0.6168014926864142

So we obtained roughly 62% of correct predictions using this model. It is not an awful number, but we can do better. Let's then try an Ensemble Regressor, as the cheat-sheet suggests. In this case, we're going to use a Gradient Boosting Regressor.

Gradient Boosting Regressor


In [7]:
clg = GradientBoostingRegressor(random_state=0)

clg.fit(X=X_train, y=y_train)


Out[7]:
GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=3, max_features=None,
             max_leaf_nodes=None, min_impurity_split=1e-07,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=100,
             presort='auto', random_state=0, subsample=1.0, verbose=0,
             warm_start=False)

In [8]:
print("Gradient Boosting Regressor score is: {}".format(clg.score(X_test, y_test)))


Gradient Boosting Regressor score is: 0.8465809732594964

The base model was able to achieve roughly 85%, which we can consider a good result. Let's now modify some parameters just to see if the results improve or not.


In [9]:
clg = GradientBoostingRegressor(max_depth=2, random_state=0)

clg.fit(X=X_train, y=y_train)

print("Gradient Boosting Regressor score is: {}".format(clg.score(X_test, y_test)))


Gradient Boosting Regressor score is: 0.7624149573168991

In [10]:
clg = GradientBoostingRegressor(learning_rate=0.2, random_state=0)

clg.fit(X=X_train, y=y_train)

print("Gradient Boosting Regressor score is: {}".format(clg.score(X_test, y_test)))


Gradient Boosting Regressor score is: 0.8474008501264455

So in the end we trained four models and validated them on a test set, with the following accuracies:

  1. Support Vector Regressor: 0.6168
  2. Gradient Boosting Regressor: 0.8465
  3. Gradient Boosting Regressor(max_depth=2): 0.7624
  4. Gradient Boosting Regressor(learning_rate=0.2): 0.8474

Based on this simple cross-validation method, the model we would choose was the Gradient Boosting Regressor with the parameter learning rate set to 0.2.

Notice that to validate our model we had to set apart 30% of the data we have available. In a world in which we can't ever have enough data to train models, this kind of methods can be costly.


K-fold Cross Validation

Using a technique called k-fold validation, we will be able to get the same end result without having to lose more samples on the training split. This method's procedure is to split the dataset into k splits, and iteratively use k-1 splits to train a model and the remaining split to validate the result. The average of those scores will then be the perfomance measure of the model we have trained. This is done in scikit using the funcion cross_val_score. In this case we will be using a 5-fold validation on the same models we trained on the simple cross-validation section.

An auxiliary visualization of the method is the following:

Image Source: https://static.oschina.net/uploads/img/201609/26155106_OfXx.png

After this procedure we can average the scores and also obtain a confidence level on those scores. The method that yields the highest average accuracy will be the chosen one.


In [11]:
from sklearn.model_selection import cross_val_score

Support Vector Regressor


In [12]:
clf = SVR(kernel='linear')

scores = cross_val_score(clf, boston.data, boston.target, cv=5)

And we can then print the scores obtained and also the mean score obtained aswell as the 95% confidence interval of the mean score value we obtained.


In [13]:
print("Array of scores is :{}\n".format(scores))

print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))


Array of scores is :[ 0.77328953  0.72833447  0.53795481  0.15209389  0.07729196]

Accuracy: 0.45 (+/- 0.58)

Gradient Boosting Regressor


In [14]:
clg = GradientBoostingRegressor(random_state=0)

scores = cross_val_score(clg, boston.data, boston.target, cv=5)

print("Array of scores is :{}\n".format(scores))

print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))


Array of scores is :[ 0.78572257  0.85705946  0.74913553  0.56263681  0.39369023]

Accuracy: 0.67 (+/- 0.34)

In [15]:
clg = GradientBoostingRegressor(max_depth=2, random_state=0)

scores = cross_val_score(clg, boston.data, boston.target, cv=5)

print("Array of scores is :{}\n".format(scores))

print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))


Array of scores is :[ 0.8083634   0.88500085  0.74621128  0.53904834  0.51805081]

Accuracy: 0.70 (+/- 0.29)

In [16]:
clg = GradientBoostingRegressor(learning_rate=0.2, random_state=0)

scores = cross_val_score(clg, boston.data, boston.target, cv=5)

print("Array of scores is :{}\n".format(scores))

print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))


Array of scores is :[ 0.78911916  0.79977633  0.72215222  0.60049527  0.41089731]

Accuracy: 0.66 (+/- 0.29)

As we can see the scenario is very different right now. Looking at individual scores is easy to verify that sometimes the test set influences a lot your final accuracy. This means that a method such as k-fold cross-validation is much more robust to this kind of possibilities.

Let's list all the results and check which one did best:

  1. Support Vector Regressor: 0.45 (+/- 0.58)
  2. Gradient Boosting Regressor: 0.67 (+/- 0.34)
  3. Gradient Boosting Regressor(max_depth=2): 0.70 (+/- 0.29)
  4. Gradient Boosting Regressor(learning_rate=0.2): 0.66 (+/- 0.29)

According to k-fold cross-validation, a Gradient Boosting Regressor with parameter max_depth set to 2, is the best model (highest score with lowest uncertainty) from the ones we tested.

Shuffle Split

Another strategy to use cross-validation is to use random splits instead of fixed splits with function ShuffleSplit. When using random splits (in which the data is randomly chosen from the dataset everytime you "create" a fold) it is not guaranteed that each fold will be different, even though this can be assumed if the dataset is big enough.

Let's see how this influences the results on the models we have been trying.

Support Vector Regressor


In [17]:
from sklearn.model_selection import ShuffleSplit

In [18]:
clf = SVR(kernel='linear')

cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=1)

scores = cross_val_score(clf, boston.data, boston.target, cv=cv)

print("Array of scores is :{}\n".format(scores))

print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))


Array of scores is :[ 0.78569463  0.59163902  0.74280836  0.69801539  0.64550047]

Accuracy: 0.69 (+/- 0.14)

Gradient Boosting Regressor


In [19]:
clg = GradientBoostingRegressor(random_state=0)

cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=1)

scores = cross_val_score(clg, boston.data, boston.target, cv=cv)

print("Array of scores is :{}\n".format(scores))

print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))


Array of scores is :[ 0.92216968  0.8710234   0.91212971  0.89990485  0.81722107]

Accuracy: 0.88 (+/- 0.08)

In [20]:
clg = GradientBoostingRegressor(max_depth=2, random_state=0)

cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=1)

scores = cross_val_score(clg, boston.data, boston.target, cv=cv)

print("Array of scores is :{}\n".format(scores))

print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))


Array of scores is :[ 0.90274258  0.84053833  0.90234705  0.8669969   0.84641735]

Accuracy: 0.87 (+/- 0.05)

In [21]:
clg = GradientBoostingRegressor(learning_rate=0.2, random_state=0)

cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=1)

scores = cross_val_score(clg, boston.data, boston.target, cv=cv)

print("Array of scores is :{}\n".format(scores))

print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))


Array of scores is :[ 0.92269094  0.88621864  0.91655039  0.89897041  0.82202122]

Accuracy: 0.89 (+/- 0.07)

The score results obtained for the Gradient Boosting Regressor are very high compared with the fixed split. This may mean that the data has some fixed portions in which the model does not work well and those portions keep lowering the scores we obtain. Shuffling the data allows us to have a more varied distribution of samples in each fold and it allows each of them to resemble more to the dataset as an whole.

In this case the results are similar for the three models tested and any of them would be a good choice according to this values.

Pipeline

When training and validating a model it is important to make sure that the same preprocessing and data transformations are done in both training and test/validation data. In scikit, the pipeline function is an easy way to take care of this step.


In [22]:
from sklearn import preprocessing

X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.3, random_state=0)

scaler = preprocessing.StandardScaler().fit(X_train)

X_train_transformed = scaler.transform(X_train)

clg = GradientBoostingRegressor(random_state=0).fit(X_train_transformed, y_train)

X_test_transformed = scaler.transform(X_test)

print("Accuracy: {}".format(clg.score(X_test_transformed, y_test)))


Accuracy: 0.846708961938753

Now, if we want to extend this to a k-fold cross-validation method, instead of doing this "manually" for each of the possibilities, we can use _make_pipeline_ and easily extend the preprocessing to all the tests being done.


In [23]:
from sklearn.pipeline import make_pipeline

clg = make_pipeline(preprocessing.StandardScaler(), GradientBoostingRegressor(random_state=0))
cross_val_score(clg, boston.data, boston.target, cv=5)

print("Array of scores is :{}\n".format(scores))
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))


Array of scores is :[ 0.92269094  0.88621864  0.91655039  0.89897041  0.82202122]

Accuracy: 0.89 (+/- 0.07)