*From the video series: Introduction to machine learning with scikit-learn*

- What is the drawback of using the
**train/test split**procedure for model evaluation? - How does
**K-fold cross-validation**overcome this limitation? - How can cross-validation be used for selecting
**tuning parameters**, choosing between**models**? - What are some possible
**improvements**to cross-validation? - How to ensure cross-validation is correctly done.

**Motivation:** Need a way to choose between machine learning models

- Goal is to estimate likely performance of a model on
**out-of-sample data**

**Initial idea:** Train and test on the same data

- But, maximizing
**training accuracy**rewards overly complex models which**overfit**the training data

**Alternative idea:** Train/test split

- Split the dataset into two pieces, so that the model can be trained and tested on
**different data** **Testing accuracy**is a better estimate than training accuracy of out-of-sample performance- But, it provides a
**high variance**estimate since changing which observations happen to be in the testing set can significantly change testing accuracy

```
In [1]:
```from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

```
In [2]:
```# read in the iris data
iris = load_iris()
# create X (features) and y (response)
X = iris.data
y = iris.target

```
In [3]:
```# use train/test split with different random_state values
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=4)
# check classification accuracy of KNN with K=5
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print metrics.accuracy_score(y_test, y_pred)

```
```

**Question:** What if we created a bunch of train/test splits, calculated the testing accuracy for each, and averaged the results together?

**Answer:** That's the essense of K-fold cross-validation!

- Split the dataset into K
**equal**partitions (or "folds"). - Use fold 1 as the
**testing set**and the union of the other folds as the**training set**. - Calculate
**testing accuracy**. - Repeat steps 2 and 3 K times, using a
**different fold**as the testing set each time. - Use the
**average testing accuracy**as the estimate of out-of-sample accuracy.

Diagram of **5-fold cross-validation:**

```
In [4]:
```# simulate splitting a dataset of 25 observations into 5 folds
from sklearn.cross_validation import KFold
kf = KFold(25, n_folds=5, shuffle=False)
# print the contents of each training and testing set
print '{} {:^61} {}'.format('Iteration', 'Training set observations', 'Testing set observations')
for iteration, data in enumerate(kf, start=1):
print '{:^9} {} {:^25}'.format(iteration, data[0], data[1])

```
```

- Dataset contains
**25 observations**(numbered 0 through 24) - 5-fold cross-validation, thus it runs for
**5 iterations** - For each iteration, every observation is either in the training set or the testing set,
**but not both** - Every observation is in the testing set
**exactly once**

Advantages of **cross-validation:**

- More accurate estimate of out-of-sample accuracy
- More "efficient" use of data (every observation is used for both training and testing)

Advantages of **train/test split:**

- Runs K times faster than K-fold cross-validation
- Simpler to examine the detailed results of the testing process

- K can be any number, but
**K=10**is generally recommended - For classification problems,
**stratified sampling**is recommended for creating the folds- Each response class should be represented with equal proportions in each of the K folds
- scikit-learn's
`cross_val_score`

function does this by default

**Goal:** Select the best tuning parameters (aka "hyperparameters") for KNN on the iris dataset

```
In [5]:
```from sklearn.cross_validation import cross_val_score

```
In [6]:
```# 10-fold cross-validation with K=5 for KNN (the n_neighbors parameter)
knn = KNeighborsClassifier(n_neighbors=5)
scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')
print scores

```
```

```
In [7]:
```# use average accuracy as an estimate of out-of-sample accuracy
print scores.mean()

```
```

```
In [8]:
```# search for an optimal value of K for KNN
k_range = range(1, 31)
k_scores = []

```
In [9]:
```import matplotlib.pyplot as plt
%matplotlib inline
# plot the value of K for KNN (x-axis) versus the cross-validated accuracy (y-axis)
plt.plot(k_range, k_scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated Accuracy')

```
Out[9]:
```

**Goal:** Compare the best KNN model with logistic regression on the iris dataset

```
In [10]:
```# 10-fold cross-validation with the best KNN model
knn = KNeighborsClassifier(n_neighbors=20)
print cross_val_score(knn, X, y, cv=10, scoring='accuracy').mean()

```
```

```
In [11]:
```# 10-fold cross-validation with logistic regression
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
print cross_val_score(logreg, X, y, cv=10, scoring='accuracy').mean()

```
```

**Repeated cross-validation**

- Repeat cross-validation multiple times (with
**different random splits**of the data) and average the results - More reliable estimate of out-of-sample performance by
**reducing the variance**associated with a single trial of cross-validation

**Creating a hold-out set**

- "Hold out" a portion of the data
**before**beginning the model building process - Locate the best model using cross-validation on the remaining data, and test it
**using the hold-out set** - More reliable estimate of out-of-sample performance since hold-out set is
**truly out-of-sample**

```
In [12]:
```import matplotlib.pyplot as plt
%matplotlib inline
# plot the value of K for KNN (x-axis) versus the cross-validated accuracy (y-axis)
plt.plot(k_range, k_scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated Accuracy')

```
Out[12]:
```

- scikit-learn documentation: Cross-validation, Model evaluation
- scikit-learn issue on GitHub: MSE is negative when returned by cross_val_score
- Section 5.1 of An Introduction to Statistical Learning (11 pages) and related videos: K-fold and leave-one-out cross-validation (14 minutes), Cross-validation the right and wrong ways (10 minutes)
- Scott Fortmann-Roe: Accurately Measuring Model Prediction Error
- Machine Learning Mastery: An Introduction to Feature Selection
- Harvard CS109: Cross-Validation: The Right and Wrong Way
- Journal of Cheminformatics: Cross-validation pitfalls when selecting and assessing regression and classification models