===Cross-validation===

Here we consider a class of methods that estimate the test error (the average error that results from using a statistical learning method to predict the response on a new observation) by holding out a subset of the training observations from the fitting process, and then applying the statistical learning method to those held out observations.

Get the data

We use the dataset auto from the UCI Machine Learning Repository.

Specifically, we will use the feature horsepower (i.e., the car's HP number) to predict its fuel consumption (the mpg feature).



In [1]:

    
import pandas as pd



In [2]:

    
autoData = pd.read_csv('../datasets/Auto.csv', na_values='?')



In [3]:

    
autoData.shape









    Out[3]:





(397, 9)

397 examples of 9 features



In [4]:

    
autoData.columns









    Out[4]:





Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'year', 'origin', 'name'],
      dtype='object')

We are interested in the mpg and the horsepower features.



In [5]:

    
autoData.head(5)









    Out[5]:






  
    
      
      mpg
      cylinders
      displacement
      horsepower
      weight
      acceleration
      year
      origin
      name
    
  
  
    
      0
      18.0
      8
      307.0
      130.0
      3504
      12.0
      70
      1
      chevrolet chevelle malibu
    
    
      1
      15.0
      8
      350.0
      165.0
      3693
      11.5
      70
      1
      buick skylark 320
    
    
      2
      18.0
      8
      318.0
      150.0
      3436
      11.0
      70
      1
      plymouth satellite
    
    
      3
      16.0
      8
      304.0
      150.0
      3433
      12.0
      70
      1
      amc rebel sst
    
    
      4
      17.0
      8
      302.0
      140.0
      3449
      10.5
      70
      1
      ford torino



In [6]:

    
autoData.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 397 entries, 0 to 396
Data columns (total 9 columns):
mpg             397 non-null float64
cylinders       397 non-null int64
displacement    397 non-null float64
horsepower      392 non-null float64
weight          397 non-null int64
acceleration    397 non-null float64
year            397 non-null int64
origin          397 non-null int64
name            397 non-null object
dtypes: float64(4), int64(4), object(1)
memory usage: 28.0+ KB



In [7]:

    
autoData.isnull().values.any()









    Out[7]:





True

There are NaNs. We remove those observations before proceeding.



In [8]:

    
autoData[autoData.isnull().any(axis=1)]









    Out[8]:






  
    
      
      mpg
      cylinders
      displacement
      horsepower
      weight
      acceleration
      year
      origin
      name
    
  
  
    
      32
      25.0
      4
      98.0
      NaN
      2046
      19.0
      71
      1
      ford pinto
    
    
      126
      21.0
      6
      200.0
      NaN
      2875
      17.0
      74
      1
      ford maverick
    
    
      330
      40.9
      4
      85.0
      NaN
      1835
      17.3
      80
      2
      renault lecar deluxe
    
    
      336
      23.6
      4
      140.0
      NaN
      2905
      14.3
      80
      1
      ford mustang cobra
    
    
      354
      34.5
      4
      100.0
      NaN
      2320
      15.8
      81
      2
      renault 18i



In [9]:

    
autoData.dropna(axis=0, inplace=True) # Drop any row that has NaNs within it



In [10]:

    
autoData[autoData.isnull().any(axis=1)]









    Out[10]:






  
    
      
      mpg
      cylinders
      displacement
      horsepower
      weight
      acceleration
      year
      origin
      name



In [11]:

    
# horsepower feature is our input X
X = autoData[['horsepower']]
X.shape









    Out[11]:





(392, 1)



In [12]:

    
X.describe()



In [13]:

    
m = X.shape[0]  # number of examples
m









    Out[13]:





392



In [14]:

    
# mpg is our output Y, what we want to predict
Y = autoData.mpg.copy() # copy “y” column values out
len(Y)









    Out[14]:





392

Visualise the data



In [15]:

    
import matplotlib.pyplot as plt
%matplotlib inline



In [16]:

    
fig = plt.scatter(X, Y)
fig = plt.grid(True)
fig = plt.xlabel('Horsepower')
fig = plt.ylabel('Miles per Gallon (mpg)')

Fit a linear regression model



In [17]:

    
from sklearn import linear_model



In [18]:

    
#
# : Create the linear regression model 
#
model = linear_model.LinearRegression()



In [19]:

    
model.fit(X, Y)









    Out[19]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

These are the model's coefficients:



In [20]:

    
model.intercept_









    Out[20]:





39.93586102117046



In [21]:

    
model.coef_









    Out[21]:





array([-0.15784473])

The model would look like $mpg = 39.9 - 0.15 * horsepower$.
Let's display the linear regression line in the previous chart:



In [22]:

    
y_plot = model.predict([[40],[230]])
y_plot









    Out[22]:





array([ 33.62207169,   3.63157235])



In [23]:

    
fig = plt.scatter(X, Y)
fig = plt.grid(True)
fig = plt.xlabel('Horsepower')
fig = plt.ylabel('Miles per Gallon (mpg)')
fig = plt.plot([40,230], y_plot, '-', color='darkred', linewidth=2)

We can already see that a simple linear regression model will not be very accurate ... probably a curve would better approximate the real fuel consumption better than a line.
Let's measure the accuracy using the test-set hold-out approach.

Test-set approach

Here we randomly divide the available set of observations into two parts: a training set and a test or hold-out set (20% of the entire dataset).

In scikit-learn a random split into training and test sets can be quickly computed with the _train_testsplit helper function.



In [24]:

    
from sklearn.model_selection import train_test_split



In [25]:

    
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)

Measure the accuracy of the model

We fit the model on the train dataset:



In [26]:

    
model.fit(x_train, y_train)









    Out[26]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

And measure the accuracy on the test dataset.
Note: the score() function is a shortcut to run predict() and then calculate the accuracy.



In [27]:

    
print('Test Accuracy: {:.3f}'.format(model.score(x_test, y_test)))









    



Test Accuracy: 0.587

The validation estimate of the test error can be highly variable, depending on precisely which observations are included in the training set and which observations are included in the test set. Every time you split the data into different train/test datasets you get different models and scores. You can try it yourself by removing the fixed random seed above.

Cross-validation approach

A solution to this problem is a procedure called cross-validation (CV for short).

K-Fold

In the basic approach, called K-fold CV, the training set is split into K smaller sets.
We leave out part k, fit the model to the other K-1 parts (combined), and then obtain predictions for the left-out k-th part. This is done in turn for each part k = 1,2,...K, and then the results are combined.



In [28]:

    
from sklearn.model_selection import KFold
import numpy as np

KFold divides all the samples in K groups of samples, called folds, of equal sizes (if possible).



In [29]:

    
kf = KFold(n_splits = 10).split(X, Y)

Note: I use .iloc to index the Pandas dataframe to avoid that some data become NaN after the split and get this error:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').



In [30]:

    
scores = []
for k, (train_index, test_index) in enumerate(kf):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    Y_train, Y_test = Y.iloc[train_index], Y.iloc[test_index]
    model.fit(X_train, Y_train)
    scoreK = model.score(X_train, Y_train)
    scores.append(scoreK)

    print('Fold: {:2d}, Acc: {:.3f}'.format(k+1, scoreK))

print('\nCV accuracy: {:.3f} +/- {:.3f}'.format(np.mean(scores), 
                                                np.std(scores)))









    



Fold:  1, Acc: 0.608
Fold:  2, Acc: 0.595
Fold:  3, Acc: 0.608
Fold:  4, Acc: 0.613
Fold:  5, Acc: 0.614
Fold:  6, Acc: 0.604
Fold:  7, Acc: 0.605
Fold:  8, Acc: 0.602
Fold:  9, Acc: 0.610
Fold: 10, Acc: 0.610

CV accuracy: 0.607 +/- 0.005

Leave One Out (LOO) cross-validation

LOO is a special case of the Kfold splitting approach: if K = n, this is equivalent to the Leave One Out strategy.



In [31]:

    
from sklearn.model_selection import LeaveOneOut



In [32]:

    
LOO = LeaveOneOut().split(X, Y)

Note: not printing each fold here because LOO is computationally more expensive.



In [33]:

    
scores = []
for k, (train_index, test_index) in enumerate(LOO):

    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    Y_train, Y_test = Y.iloc[train_index], Y.iloc[test_index]
    
    model.fit(X_train, Y_train)
    scoreK = model.score(X_train, Y_train)
    scores.append(scoreK)

print('\nCV accuracy: {:.3f} +/- {:.3f}'.format(np.mean(scores), 
                                                np.std(scores)))









    



CV accuracy: 0.606 +/- 0.001

Now the accuracy of 0.606 is a much more reliable measure than the one from the test-set approach, which could change depending on which test set has been randomly selected.

A generic cross-validation score evaluating function.

The sklearn helper _cross_valscore can quickly estimate the score of a model, using one CV approach, therefore substituting the above code with a more compact form:



In [34]:

    
from sklearn.model_selection import cross_val_score



In [35]:

    
# We are passing the entirety of X and y, not X_train or y_train
scores = cross_val_score(model, X, Y, cv=KFold(n_splits=5, shuffle=True,
                                              random_state=1))
print ("Cross-validated scores:", scores)
print('CV accuracy: {:.3f}  +/- {:.3f} '.format (np.mean(scores), np.std(scores)))









    



Cross-validated scores: [ 0.58656677  0.65131449  0.56557775  0.54515335  0.66048729]
CV accuracy: 0.602  +/- 0.046

In the first iteration, the accuracy is 58%, in the second iteration the accuracy is 65% and so on.

Generally a 10-folds is a good compromise between speed and accuracy.

A model with polynomial features

Now we try to improve our simple model's accuracy by adding polynomial terms to our feature, i.e. transforming it into its power.



In [36]:

    
from sklearn.preprocessing import PolynomialFeatures



In [37]:

    
poly = PolynomialFeatures(degree=2) # quadratic function



In [38]:

    
# we first transform our data
x_train2 = poly.fit_transform(x_train)
x_test2  = poly.fit_transform(x_test)



In [39]:

    
x_train2[0:5]









    Out[39]:





array([[  1.00000000e+00,   8.40000000e+01,   7.05600000e+03],
       [  1.00000000e+00,   1.10000000e+02,   1.21000000e+04],
       [  1.00000000e+00,   8.30000000e+01,   6.88900000e+03],
       [  1.00000000e+00,   6.50000000e+01,   4.22500000e+03],
       [  1.00000000e+00,   7.40000000e+01,   5.47600000e+03]])

Then we fit a new model to the transformed data:



In [40]:

    
model2 = linear_model.LinearRegression()
model2.fit(x_train2, y_train)









    Out[40]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [41]:

    
model2.score(x_test2, y_test)









    Out[41]:





0.69134645952506046

It looks better, so we can try with a cubic function:



In [42]:

    
poly = PolynomialFeatures(degree=3) # cubic function
x_train3 = poly.fit_transform(x_train) # again, transform the data
x_test3  = poly.fit_transform(x_test)



In [43]:

    
model3 = linear_model.LinearRegression()
model3.fit(x_train3, y_train) # fit a new model









    Out[43]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [44]:

    
model3.score(x_test3, y_test)









    Out[44]:





0.69099400625639762

These results show that a model which predicts mpg using a quadratic function of horsepower performs better than a model that involves only a linear function of horsepower, and there is little evidence in favour of a model that uses a cubic function of horsepower.

Using the CV approach to tune parameters (model selection)

We can repeat this procedure for increasingly complex polynomial fits. This is called model selection.

To automate the process, we use a for() loop, which iteratively fits polynomial regressions for polynomials of order i = 1 to i = 15, computes the associated cross-validation score and stores it.



In [45]:

    
# search for an optimal value of poly n for LR

  # range of n we want to try
n_range = range(1, 15)
  # empty list to store scores
n_scores = []

model_n = linear_model.LinearRegression()

  # 1. we will loop through values of n
for n in n_range:
      # 2. transform features with n degree
    poly = PolynomialFeatures(degree=n)
    x_val_n = poly.fit_transform(X)    

      # 3. obtain cross_val_score for model_n
    scores = cross_val_score(model_n, x_val_n, Y, cv=KFold(n_splits=5, 
                                                           shuffle=True,
                                                           random_state=1)) 
    # 4. append mean of scores to n_scores list
    n_scores.append(scores.mean())

    #print ("Cross-validated scores:", scores)

print('CV accuracy: {:.3f} +/- {:.3f} '.format(np.mean(n_scores), np.std(n_scores)))









    



CV accuracy: 0.641 +/- 0.075



In [46]:

    
# in essence, this is basically running the k-fold cross-validation method n times :
print('Length of list', len(n_scores))
print('Max of list', max(n_scores))









    



Length of list 14
Max of list 0.687040033801



In [47]:

    
# plot how accuracy changes as we vary the polynomial degree

plt.plot(n_range, n_scores)
plt.xlabel('Degree for horsepower')
plt.ylabel('Cross-validated accuracy')









    Out[47]:





<matplotlib.text.Text at 0x11b7a3cf8>

This is quite typical shape of the curve when examining the model complexity and accuracy, first increasing then reaching the maximum CV accuracy (in this case, it occurs from degree=2 to 12 and then quickly decreasing (overfit).
This is an example of bias-variance trade off.

We see a sharp improvement in the estimated accuracy between the linear and quadratic fits, but then no clear improvement from using higher-order polynomials.
Therefore, the best model is the one with the quadratic function.

	horsepower
count	392.000000
mean	104.469388
std	38.491160
min	46.000000
25%	75.000000
50%	93.500000
75%	126.000000
max	230.000000

	mpg	cylinders	displacement	horsepower	weight	acceleration	year	origin	name
0	18.0	8	307.0	130.0	3504	12.0	70	1	chevrolet chevelle malibu
1	15.0	8	350.0	165.0	3693	11.5	70	1	buick skylark 320
2	18.0	8	318.0	150.0	3436	11.0	70	1	plymouth satellite
3	16.0	8	304.0	150.0	3433	12.0	70	1	amc rebel sst
4	17.0	8	302.0	140.0	3449	10.5	70	1	ford torino

	mpg	cylinders	displacement	horsepower	weight	acceleration	year	origin	name
32	25.0	4	98.0	NaN	2046	19.0	71	1	ford pinto
126	21.0	6	200.0	NaN	2875	17.0	74	1	ford maverick
330	40.9	4	85.0	NaN	1835	17.3	80	2	renault lecar deluxe
336	23.6	4	140.0	NaN	2905	14.3	80	1	ford mustang cobra
354	34.5	4	100.0	NaN	2320	15.8	81	2	renault 18i