Training Accuracy

Prediction accuracy on the same set of data you trained your model with.

Problems with training and testing on the same data

  • Goal is to estimate likely performance of a model on out-of-sample data
  • But, maximizing training accuracy rewards overly complex models that won't necessarily generalize
  • Unnecessarily complex models overfit the training data

Image Credit: Overfitting by Chabacano. Licensed under GFDL via Wikimedia Commons.

How Can We Avoid Overfitting?

Evaluation procedure #2: Train/test split

  1. Split the dataset into two pieces: a training set and a testing set.
  2. Train the model on the training set.
  3. Test the model on the testing set, and evaluate how well we did.

In [ ]:
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target

# print the shapes of X and y
print X.shape
print y.shape

In [ ]:
# STEP 1: split X and y into training and testing sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=4)

What did this accomplish?

  • Model can be trained and tested on different data
  • Response values are known for the testing set, and thus predictions can be evaluated
  • Testing accuracy is a better estimate than training accuracy of out-of-sample performance

In [ ]:
# STEP 1: split X and y into training and testing sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=4)

In [ ]:
# print the shapes of the new X objects
print X_train.shape
print X_test.shape

In [ ]:
# print the shapes of the new y objects
print y_train.shape
print y_test.shape

In [ ]:
from sklearn.linear_model import LogisticRegression
# STEP 2: train the model on the training set
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

In [ ]:
# STEP 3: make predictions on the testing set
y_pred = logreg.predict(X_test)

from sklearn import metrics
# compare actual response values (y_test) with predicted response values (y_pred)
print metrics.accuracy_score(y_test, y_pred)

Repeat for KNN with K=5:


In [ ]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print metrics.accuracy_score(y_test, y_pred)

Repeat for KNN with K=1:


In [ ]:
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print metrics.accuracy_score(y_test, y_pred)

Can you find an even better value for K?


In [ ]:
# try K=1 through K=25 and record testing accuracy
k_range = range(1, 26)
scores = [] # calculate accuracies for each value of K!

#Now we plot:

import matplotlib.pyplot as plt
# allow plots to appear within the notebook
%matplotlib inline

plt.plot(k_range, scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Testing Accuracy')