Cross-validation for parameter tuning, model selection, and feature selection

From the video series: Introduction to machine learning with scikit-learn

Agenda

  • What is the drawback of using the train/test split procedure for model evaluation?
  • How does K-fold cross-validation overcome this limitation?
  • How can cross-validation be used for selecting tuning parameters, choosing between models?
  • What are some possible improvements to cross-validation?
  • How to ensure cross-validation is correctly done.

Review of model evaluation procedures

Motivation: Need a way to choose between machine learning models

  • Goal is to estimate likely performance of a model on out-of-sample data

Initial idea: Train and test on the same data

  • But, maximizing training accuracy rewards overly complex models which overfit the training data

Alternative idea: Train/test split

  • Split the dataset into two pieces, so that the model can be trained and tested on different data
  • Testing accuracy is a better estimate than training accuracy of out-of-sample performance
  • But, it provides a high variance estimate since changing which observations happen to be in the testing set can significantly change testing accuracy

In [1]:
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

In [2]:
# read in the iris data
iris = load_iris()

# create X (features) and y (response)
X = iris.data
y = iris.target

In [3]:
# use train/test split with different random_state values
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=4)

# check classification accuracy of KNN with K=5
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print metrics.accuracy_score(y_test, y_pred)


0.973684210526

Question: What if we created a bunch of train/test splits, calculated the testing accuracy for each, and averaged the results together?

Answer: That's the essense of K-fold cross-validation!

Steps for K-fold cross-validation

  1. Split the dataset into K equal partitions (or "folds").
  2. Use fold 1 as the testing set and the union of the other folds as the training set.
  3. Calculate testing accuracy.
  4. Repeat steps 2 and 3 K times, using a different fold as the testing set each time.
  5. Use the average testing accuracy as the estimate of out-of-sample accuracy.

Diagram of 5-fold cross-validation:


In [4]:
# simulate splitting a dataset of 25 observations into 5 folds
from sklearn.cross_validation import KFold
kf = KFold(25, n_folds=5, shuffle=False)

# print the contents of each training and testing set
print '{} {:^61} {}'.format('Iteration', 'Training set observations', 'Testing set observations')
for iteration, data in enumerate(kf, start=1):
    print '{:^9} {} {:^25}'.format(iteration, data[0], data[1])


Iteration                   Training set observations                   Testing set observations
    1     [ 5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24]        [0 1 2 3 4]       
    2     [ 0  1  2  3  4 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24]        [5 6 7 8 9]       
    3     [ 0  1  2  3  4  5  6  7  8  9 15 16 17 18 19 20 21 22 23 24]     [10 11 12 13 14]     
    4     [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 20 21 22 23 24]     [15 16 17 18 19]     
    5     [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]     [20 21 22 23 24]     
  • Dataset contains 25 observations (numbered 0 through 24)
  • 5-fold cross-validation, thus it runs for 5 iterations
  • For each iteration, every observation is either in the training set or the testing set, but not both
  • Every observation is in the testing set exactly once

Comparing cross-validation to train/test split

Advantages of cross-validation:

  • More accurate estimate of out-of-sample accuracy
  • More "efficient" use of data (every observation is used for both training and testing)

Advantages of train/test split:

  • Runs K times faster than K-fold cross-validation
  • Simpler to examine the detailed results of the testing process

Cross-validation recommendations

  1. K can be any number, but K=10 is generally recommended
  2. For classification problems, stratified sampling is recommended for creating the folds
    • Each response class should be represented with equal proportions in each of the K folds
    • scikit-learn's cross_val_score function does this by default

Cross-validation example: parameter tuning

Goal: Select the best tuning parameters (aka "hyperparameters") for KNN on the iris dataset


In [5]:
from sklearn.cross_validation import cross_val_score

In [6]:
# 10-fold cross-validation with K=5 for KNN (the n_neighbors parameter)
knn = KNeighborsClassifier(n_neighbors=5)
scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')
print scores


[ 1.          0.93333333  1.          1.          0.86666667  0.93333333
  0.93333333  1.          1.          1.        ]

In [7]:
# use average accuracy as an estimate of out-of-sample accuracy
print scores.mean()


0.966666666667

In [8]:
# search for an optimal value of K for KNN
k_range = range(1, 31)
k_scores = []

In [9]:
import matplotlib.pyplot as plt
%matplotlib inline

# plot the value of K for KNN (x-axis) versus the cross-validated accuracy (y-axis)
plt.plot(k_range, k_scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated Accuracy')


/Users/dtamayo/miniconda2/envs/ml2/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')
Out[9]:
<matplotlib.text.Text at 0x114fe2510>

Cross-validation example: model selection

Goal: Compare the best KNN model with logistic regression on the iris dataset


In [10]:
# 10-fold cross-validation with the best KNN model
knn = KNeighborsClassifier(n_neighbors=20)
print cross_val_score(knn, X, y, cv=10, scoring='accuracy').mean()


0.98

In [11]:
# 10-fold cross-validation with logistic regression
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
print cross_val_score(logreg, X, y, cv=10, scoring='accuracy').mean()


0.953333333333

Improvements to cross-validation

Repeated cross-validation

  • Repeat cross-validation multiple times (with different random splits of the data) and average the results
  • More reliable estimate of out-of-sample performance by reducing the variance associated with a single trial of cross-validation

Creating a hold-out set

  • "Hold out" a portion of the data before beginning the model building process
  • Locate the best model using cross-validation on the remaining data, and test it using the hold-out set
  • More reliable estimate of out-of-sample performance since hold-out set is truly out-of-sample

Splitting a Second Time? Train/Cross-Validation/Test Split


In [12]:
import matplotlib.pyplot as plt
%matplotlib inline

# plot the value of K for KNN (x-axis) versus the cross-validated accuracy (y-axis)
plt.plot(k_range, k_scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated Accuracy')


Out[12]:
<matplotlib.text.Text at 0x1157345d0>

Summary

Split data into 3 groups

  • Training data (to train the models)
  • Cross-validation data (to tune your hyperparameters)
  • Testing data (to have a reliable estimate of out-of-sample performance)

Scikit-learn makes this easy!

  • Lots of great documentation and examples online