Here we consider a class of methods that estimate the test error (the average error that results from using a statistical learning method to predict the response on a new observation) by holding out a subset of the training observations from the fitting process, and then applying the statistical learning method to those held out observations.
We use the dataset auto from the UCI Machine Learning Repository.
Specifically, we will use the feature horsepower (i.e., the car's HP number) to predict its fuel consumption (the mpg feature).
In [1]:
import pandas as pd
In [2]:
autoData = pd.read_csv('../datasets/Auto.csv', na_values='?')
In [3]:
autoData.shape
Out[3]:
397 examples of 9 features
In [4]:
autoData.columns
Out[4]:
We are interested in the mpg and the horsepower features.
In [5]:
autoData.head(5)
Out[5]:
In [6]:
autoData.info()
In [7]:
autoData.isnull().values.any()
Out[7]:
There are NaNs. We remove those observations before proceeding.
In [8]:
autoData[autoData.isnull().any(axis=1)]
Out[8]:
In [9]:
autoData.dropna(axis=0, inplace=True) # Drop any row that has NaNs within it
In [10]:
autoData[autoData.isnull().any(axis=1)]
Out[10]:
In [11]:
# horsepower feature is our input X
X = autoData[['horsepower']]
X.shape
Out[11]:
In [12]:
X.describe()
Out[12]:
In [13]:
m = X.shape[0] # number of examples
m
Out[13]:
In [14]:
# mpg is our output Y, what we want to predict
Y = autoData.mpg.copy() # copy “y” column values out
len(Y)
Out[14]:
In [15]:
import matplotlib.pyplot as plt
%matplotlib inline
In [16]:
fig = plt.scatter(X, Y)
fig = plt.grid(True)
fig = plt.xlabel('Horsepower')
fig = plt.ylabel('Miles per Gallon (mpg)')
In [17]:
from sklearn import linear_model
In [18]:
#
# : Create the linear regression model
#
model = linear_model.LinearRegression()
In [19]:
model.fit(X, Y)
Out[19]:
These are the model's coefficients:
In [20]:
model.intercept_
Out[20]:
In [21]:
model.coef_
Out[21]:
The model would look like $mpg = 39.9 - 0.15 * horsepower$.
Let's display the linear regression line in the previous chart:
In [22]:
y_plot = model.predict([[40],[230]])
y_plot
Out[22]:
In [23]:
fig = plt.scatter(X, Y)
fig = plt.grid(True)
fig = plt.xlabel('Horsepower')
fig = plt.ylabel('Miles per Gallon (mpg)')
fig = plt.plot([40,230], y_plot, '-', color='darkred', linewidth=2)
We can already see that a simple linear regression model will not be very accurate ... probably a curve would better approximate the real fuel consumption better than a line.
Let's measure the accuracy using the test-set hold-out approach.
Here we randomly divide the available set of observations into two parts: a training set and a test or hold-out set (20% of the entire dataset).
In scikit-learn a random split into training and test sets can be quickly computed with the _train_testsplit helper function.
In [24]:
from sklearn.model_selection import train_test_split
In [25]:
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)
We fit the model on the train dataset:
In [26]:
model.fit(x_train, y_train)
Out[26]:
And measure the accuracy on the test dataset.
Note: the score() function is a shortcut to run predict() and then calculate the accuracy.
In [27]:
print('Test Accuracy: {:.3f}'.format(model.score(x_test, y_test)))
The validation estimate of the test error can be highly variable, depending on precisely which observations are included in the training set and which observations are included in the test set. Every time you split the data into different train/test datasets you get different models and scores. You can try it yourself by removing the fixed random seed above.
A solution to this problem is a procedure called cross-validation (CV for short).
In the basic approach, called K-fold CV, the training set is split into K smaller sets.
We leave out part k, fit the model to the other K-1 parts (combined), and then obtain predictions for the left-out k-th part.
This is done in turn for each part k = 1,2,...K, and then the results are combined.
In [28]:
from sklearn.model_selection import KFold
import numpy as np
KFold divides all the samples in K groups of samples, called folds, of equal sizes (if possible).
In [29]:
kf = KFold(n_splits = 10).split(X, Y)
Note: I use .iloc to index the Pandas dataframe to avoid that some data become NaN after the split and get this error:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
In [30]:
scores = []
for k, (train_index, test_index) in enumerate(kf):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
Y_train, Y_test = Y.iloc[train_index], Y.iloc[test_index]
model.fit(X_train, Y_train)
scoreK = model.score(X_train, Y_train)
scores.append(scoreK)
print('Fold: {:2d}, Acc: {:.3f}'.format(k+1, scoreK))
print('\nCV accuracy: {:.3f} +/- {:.3f}'.format(np.mean(scores),
np.std(scores)))
LOO is a special case of the Kfold splitting approach: if K = n, this is equivalent to the Leave One Out strategy.
In [31]:
from sklearn.model_selection import LeaveOneOut
In [32]:
LOO = LeaveOneOut().split(X, Y)
Note: not printing each fold here because LOO is computationally more expensive.
In [33]:
scores = []
for k, (train_index, test_index) in enumerate(LOO):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
Y_train, Y_test = Y.iloc[train_index], Y.iloc[test_index]
model.fit(X_train, Y_train)
scoreK = model.score(X_train, Y_train)
scores.append(scoreK)
print('\nCV accuracy: {:.3f} +/- {:.3f}'.format(np.mean(scores),
np.std(scores)))
Now the accuracy of 0.606 is a much more reliable measure than the one from the test-set approach, which could change depending on which test set has been randomly selected.
The sklearn helper _cross_valscore can quickly estimate the score of a model, using one CV approach, therefore substituting the above code with a more compact form:
In [34]:
from sklearn.model_selection import cross_val_score
In [35]:
# We are passing the entirety of X and y, not X_train or y_train
scores = cross_val_score(model, X, Y, cv=KFold(n_splits=5, shuffle=True,
random_state=1))
print ("Cross-validated scores:", scores)
print('CV accuracy: {:.3f} +/- {:.3f} '.format (np.mean(scores), np.std(scores)))
In the first iteration, the accuracy is 58%, in the second iteration the accuracy is 65% and so on.
Generally a 10-folds is a good compromise between speed and accuracy.
Now we try to improve our simple model's accuracy by adding polynomial terms to our feature, i.e. transforming it into its power.
In [36]:
from sklearn.preprocessing import PolynomialFeatures
In [37]:
poly = PolynomialFeatures(degree=2) # quadratic function
In [38]:
# we first transform our data
x_train2 = poly.fit_transform(x_train)
x_test2 = poly.fit_transform(x_test)
In [39]:
x_train2[0:5]
Out[39]:
Then we fit a new model to the transformed data:
In [40]:
model2 = linear_model.LinearRegression()
model2.fit(x_train2, y_train)
Out[40]:
In [41]:
model2.score(x_test2, y_test)
Out[41]:
It looks better, so we can try with a cubic function:
In [42]:
poly = PolynomialFeatures(degree=3) # cubic function
x_train3 = poly.fit_transform(x_train) # again, transform the data
x_test3 = poly.fit_transform(x_test)
In [43]:
model3 = linear_model.LinearRegression()
model3.fit(x_train3, y_train) # fit a new model
Out[43]:
In [44]:
model3.score(x_test3, y_test)
Out[44]:
These results show that a model which predicts mpg using a quadratic function of horsepower performs better than a model that involves only a linear function of horsepower, and there is little evidence in favour of a model that uses a cubic function of horsepower.
We can repeat this procedure for increasingly complex polynomial fits. This is called model selection.
To automate the process, we use a for() loop, which iteratively fits polynomial regressions for polynomials of order i = 1 to i = 15, computes the associated cross-validation score and stores it.
In [45]:
# search for an optimal value of poly n for LR
# range of n we want to try
n_range = range(1, 15)
# empty list to store scores
n_scores = []
model_n = linear_model.LinearRegression()
# 1. we will loop through values of n
for n in n_range:
# 2. transform features with n degree
poly = PolynomialFeatures(degree=n)
x_val_n = poly.fit_transform(X)
# 3. obtain cross_val_score for model_n
scores = cross_val_score(model_n, x_val_n, Y, cv=KFold(n_splits=5,
shuffle=True,
random_state=1))
# 4. append mean of scores to n_scores list
n_scores.append(scores.mean())
#print ("Cross-validated scores:", scores)
print('CV accuracy: {:.3f} +/- {:.3f} '.format(np.mean(n_scores), np.std(n_scores)))
In [46]:
# in essence, this is basically running the k-fold cross-validation method n times :
print('Length of list', len(n_scores))
print('Max of list', max(n_scores))
In [47]:
# plot how accuracy changes as we vary the polynomial degree
plt.plot(n_range, n_scores)
plt.xlabel('Degree for horsepower')
plt.ylabel('Cross-validated accuracy')
Out[47]:
This is quite typical shape of the curve when examining the model complexity and accuracy, first increasing then reaching the maximum CV accuracy (in this case, it occurs from degree=2 to 12 and then quickly decreasing (overfit).
This is an example of bias-variance trade off.
We see a sharp improvement in the estimated accuracy between the linear and quadratic fits, but then no clear improvement from using higher-order polynomials.
Therefore, the best model is the one with the quadratic function.