Outline of the session:
In [1]:
%pylab inline
import pylab as pl
import numpy as np
# Some nice default configuration for plots
pl.rcParams['figure.figsize'] = 10, 7.5
pl.rcParams['axes.grid'] = True
pl.gray()
Let's load a simple dataset of 8x8 gray level images of handwritten digits (bundled in the sklearn source code):
In [2]:
from sklearn.datasets import load_digits
digits = load_digits()
print(digits.DESCR)
In [3]:
X, y = digits.data, digits.target
print("data shape: %r, target shape: %r" % (X.shape, y.shape))
print("classes: %r" % list(np.unique(y)))
In [4]:
n_samples, n_features = X.shape
print("n_samples=%d" % n_samples)
print("n_features=%d" % n_features)
In [9]:
for i, j in enumerate(np.random.permutation(X.shape[0])[:5]):
pl.subplot(1, 5, (i + 1))
pl.imshow(X[j].reshape((8, 8)), interpolation='nearest')
pl.title("true class: %d" % y[j])
pl.xticks(()), pl.yticks(())
Let's visualize the dataset on a 2D plane using a projection on the first 2 axis extracted by Principal Component Analysis:
In [10]:
from sklearn.decomposition import RandomizedPCA
%time X_pca = RandomizedPCA(n_components=2).fit_transform(X)
X_pca.shape
Out[10]:
In [12]:
from itertools import cycle
colors = ['b', 'g', 'r', 'c', 'm', 'y', 'k']
for i, c in zip(np.unique(y), cycle(colors)):
pl.scatter(X_pca[y == i, 0], X_pca[y == i, 1],
c=c, label=i, alpha=0.5)
#_ = pl.legend(loc='best')
We can observe that even in 2D, the groups of digits are quite well separated, especially the digit "0" that is very different from any other (the closest being "6" as it often share most the left hand side pixels). We can also observe that at least in 2D, there is quite a bit of overlap between the "1", "2" and "7" digits.
Overfitting is the problem of learning the training data by heart and being unable to generalize by making correct predictions on data samples unseen while training.
To illustrate this, let's train a Support Vector Machine naively on the digits dataset:
In [17]:
from sklearn.svm import SVC
SVC().fit(X, y).score(X, y)
Out[17]:
In [18]:
SVC()
Out[18]:
Did we really learn a perfect model that can recognize the correct digit class 100% of the time? Without new data it's impossible to tell.
Let's start again and split the dataset into two random, non overlapping subsets:
In [23]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=0) # random state can split like 1
print("train data shape: %r, train target shape: %r"
% (X_train.shape, y_train.shape))
print("test data shape: %r, test target shape: %r"
% (X_test.shape, y_test.shape))
Let's retrain a new model on the first subset call the training set:
In [15]:
svc = SVC(kernel='rbf').fit(X_train, y_train)
train_score = svc.score(X_train, y_train)
train_score
Out[15]:
We can now compute the performance of the model on new, held out data from the test set:
In [16]:
test_score = svc.score(X_test, y_test)
test_score
Out[16]:
This score is clearly not as good as expected! The model cannot generalize so well to new, unseen data.
Whenever the test data score is not as good as the train score the model is overfitting
Whenever the train score is not close to 100% accuracy the model is underfitting
Ideally we want to neither overfit nor underfit: test_score ~= train_score ~= 1.0
.
The previous example failed to generalized well to test data because we naively used the default parameters of the SVC
class:
In [19]:
svc
Out[19]:
Let's try again with another parameterization:
In [20]:
svc_2 = SVC(kernel='rbf', C=100, gamma=0.001).fit(X_train, y_train)
svc_2
Out[20]:
In [21]:
svc_2.score(X_train, y_train)
Out[21]:
In [22]:
svc_2.score(X_test, y_test)
Out[22]:
In this case the model is almost perfectly able to generalize, at least according to our random train, test split.
Cross Validation is a procedure to repeat the train / test split several times to as to get a more accurate estimate of the real test score by averaging the values found of the individual runs.
The sklearn.cross_validation
package provides many strategies to compute such splits using class that implement the python iterator API:
In [26]:
from sklearn.cross_validation import ShuffleSplit
cv = ShuffleSplit(n_samples, n_iter=3, test_size=0.1,
random_state=0)
for cv_index, (train, test) in enumerate(cv):
print("# Cross Validation Iteration #%d" % cv_index)
print("train indices: {0}...".format(train[:10]))
print("test indices: {0}...".format(test[:10]))
svc = SVC(kernel="rbf", C=1, gamma=0.001).fit(X[train], y[train])
print("train score: {0:.3f}, test score: {1:.3f}\n".format(
svc.score(X[train], y[train]), svc.score(X[test], y[test])))
Instead of doing the above manually, sklearn.cross_validation
provides a little utility function to compute the cross validated test scores automatically:
In [27]:
from sklearn.cross_validation import cross_val_score
svc = SVC(kernel="rbf", C=1, gamma=0.001)
cv = ShuffleSplit(n_samples, n_iter=10, test_size=0.1,
random_state=0)
test_scores = cross_val_score(svc, X, y, cv=cv, n_jobs=2)
test_scores
Out[27]:
In [31]:
from scipy.stats import sem #standard error mean
def mean_score(scores):
"""Print the empirical mean score and standard error of the mean."""
return ("Mean score: {0:.3f} (+/-{1:.3f})").format(
np.mean(scores), sem(scores))
In [29]:
print(mean_score(test_scores))
Exercise:
X
and y
(use sklearn.cross_validation.ShuffleSplit
).SVC(C=1, gamma=0.01)
Retry with SVC(C=10, gamma=0.005)
, then SVC(C=10, gamma=0.001)
with 500 samples.
Optional: use a smoothed kernel density estimation scipy.stats.kde.gaussian_kde
instead of an histogram to visualize the test error distribution.
Hints, type:
from sklearn.cross_validation import ShuffleSplit
ShuffleSplit? # to read the docstring of the shuffle split
pl.hist? # to read the docstring of the histogram plot
In [33]:
from sklearn.cross_validation import cross_val_score
svc = SVC(kernel="rbf", C=1, gamma=0.001)
cv = ShuffleSplit(n_samples, n_iter=50, test_size=0.1,
random_state=0)
test_scores = cross_val_score(svc, X, y, cv=cv, n_jobs=2)
test_scores
Out[33]:
In [54]:
from sklearn.cross_validation import ShuffleSplit
cv = ShuffleSplit(n_samples, n_iter=50, train_size=500, test_size=500,
random_state=0)
for cv_index, (train, test) in enumerate(cv):
print("# Cross Validation Iteration #%d" % cv_index)
print("train indices: {0}...".format(train[:10]))
print("test indices: {0}...".format(test[:10]))
svc = SVC(kernel="rbf", C=1, gamma=0.001).fit(X[train], y[train])
print("train score: {0:.3f}, test score: {1:.3f}\n".format(
svc.score(X[train], y[train]), svc.score(X[test], y[test])))
Cross Validation makes it possible to evaluate the performance of a model class and its hyper parameters on the task at hand.
A natural extension is thus to run CV several times for various values of the parameters so as to find the best. For instance, let's fix the SVC parameter to C=10
and compute the cross validated test score for various values of gamma
:
In [56]:
n_gammas = 10
n_iter = 5
cv = ShuffleSplit(n_samples, n_iter=n_iter, train_size=500, test_size=500,
random_state=0)
train_scores = np.zeros((n_gammas, n_iter))
test_scores = np.zeros((n_gammas, n_iter))
gammas = np.logspace(-7, -1, n_gammas)
for i, gamma in enumerate(gammas):
for j, (train, test) in enumerate(cv):
clf = SVC(C=10, gamma=gamma).fit(X[train], y[train])
train_scores[i, j] = clf.score(X[train], y[train])
test_scores[i, j] = clf.score(X[test], y[test])
In [57]:
for i in range(n_iter):
pl.semilogx(gammas, train_scores[:, i], alpha=0.4, lw=2, c='b')
pl.semilogx(gammas, test_scores[:, i], alpha=0.4, lw=2, c='g')
pl.ylabel("score for SVC(C=10, gamma=gamma)")
pl.xlabel("gamma")
#pl.text(1e-6, 0.5, "Underfitting", fontsize=16, ha='center', va='bottom')
#pl.text(1e-4, 0.5, "Good", fontsize=16, ha='center', va='bottom')
#pl.text(1e-2, 0.5, "Overfitting", fontsize=16, ha='center', va='bottom')
Out[57]:
We can see that, for this model class, on this unscaled dataset: when C=10
, there is a sweet spot region for gamma around $10^4$ to $10^3$. Both the train and test scores are high (low errors).
If gamma is too low, train score is low (and thus test scores too as it generally cannot be better than the train score): the model is not expressive enough to represent the data: the model is in an underfitting regime.
If gamma is too high, train score is ok but there is a high discrepency between test and train score. The model is learning the training data and its noise by heart and fails to generalize to new unseen data: the model is in an overfitting regime.
We can do the same kind analysis to identify good values for C when gamma is fixed to $10^3$:
In [38]:
n_Cs = 10
n_iter = 5
cv = ShuffleSplit(n_samples, n_iter=n_iter, train_size=500, test_size=500,
random_state=0)
train_scores = np.zeros((n_Cs, n_iter))
test_scores = np.zeros((n_Cs, n_iter))
Cs = np.logspace(-5, 5, n_Cs)
for i, C in enumerate(Cs):
for j, (train, test) in enumerate(cv):
clf = SVC(C=C, gamma=1e-3).fit(X[train], y[train])
train_scores[i, j] = clf.score(X[train], y[train])
test_scores[i, j] = clf.score(X[test], y[test])
In [43]:
for i in range(n_iter):
pl.semilogx(Cs, train_scores[:, i], alpha=0.4, lw=2, c='b')
pl.semilogx(Cs, test_scores[:, i], alpha=0.4, lw=2, c='g')
pl.ylabel("score for SVC(C=C, gamma=1e-3)")
pl.xlabel("C")
#pl.text(1e-3, 0.5, "Underfitting", fontsize=16, ha='center', va='bottom')
#pl.text(1e3, 0.5, "Few Overfitting", fontsize=16, ha='center', va='bottom')
Out[43]:
Doing this procedure several for each parameter combination is tedious, hence it's possible to automate the procedure by computing the test score for all possible combinations of parameters using the GridSearchCV
helper.
In [45]:
from sklearn.grid_search import GridSearchCV
#help(GridSearchCV)
In [46]:
from pprint import pprint
svc_params = {
'C': np.logspace(-1, 2, 4),
'gamma': np.logspace(-4, 0, 5),
}
pprint(svc_params)
In [47]:
n_subsamples = 500
X_small_train, y_small_train = X_train[:n_subsamples], y_train[:n_subsamples]
In [48]:
gs_svc = GridSearchCV(SVC(), svc_params, cv=3, n_jobs=-1)
%time _ = gs_svc.fit(X_small_train, y_small_train)
In [49]:
gs_svc.best_params_, gs_svc.best_score_
Out[49]:
Let's define a couple of helper function to help us introspect the details of the grid search outcome:
In [50]:
def display_scores(params, scores, append_star=False):
"""Format the mean score +/- std error for params"""
params = ", ".join("{0}={1}".format(k, v)
for k, v in params.items())
line = "{0}:\t{1:.3f} (+/-{2:.3f})".format(
params, np.mean(scores), sem(scores))
if append_star:
line += " *"
return line
def display_grid_scores(grid_scores, top=None):
"""Helper function to format a report on a grid of scores"""
grid_scores = sorted(grid_scores, key=lambda x: x[1], reverse=True)
if top is not None:
grid_scores = grid_scores[:top]
# Compute a threshold for staring models with overlapping
# stderr:
_, best_mean, best_scores = grid_scores[0]
threshold = best_mean - 2 * sem(best_scores)
for params, mean_score, scores in grid_scores:
append_star = mean_score + 2 * sem(scores) > threshold
print(display_scores(params, scores, append_star=append_star))
In [51]:
display_grid_scores(gs_svc.grid_scores_, top=20)
One can see that Support Vector Machine with RBF kernel are very sensitive wrt. the gamma
parameter (the badwith of the kernel) and to some lesser extend to the C
parameter as well. If those parameter are not grid searched, the predictive accurracy of the support vector machine is almost no better than random guessing!
By default, the GridSearchCV
class refits a final model on the complete training set with the best parameters found by during the grid search:
In [52]:
gs_svc.score(X_test, y_test)
Out[52]:
Evaluating this final model on the real test set will often yield a better score because of the larger training set, especially when the training set is small and the number of cross validation folds is small (cv=3
here).
Exercise:
sklearn.tree.DecisionTreeClassifier
on the X_small_train
/ y_small_train
digits dataset to reach at least 75% accuracy on the sample dataset (500 training samples)criterion
, min_samples_split
and max_depth
sklearn.ensemble.ExtraTreesClassifier(n_estimators=30)
which is a randomized ensemble of decision trees. Does the parameters that make the single trees work best also make the ensemble model work best?Hints:
cv
constructor parameter. The default value is cv=3
. Increasing it to cv=5
or cv=10
often yield more stable results but at the price of longer evaluation times.criterion
and 3 for min_samples_split
only to avoid having to wait for too long at first.Type:
from sklearn.tree.DecisionTreeClassifier
DecisionTreeClassifier? # to read the docstring and know the list of important parameters
print(DecisionTreeClassifier()) # to show the list of default values
from sklearn.ensemble.ExtraTreesClassifier
ExtraTreesClassifier?
print(ExtraTreesClassifier())
In [ ]:
In order to better understand the behavior of model (model class + contructor parameters), is it possible to run several cross validation steps for various random sub-samples of the training set and then plot the mean training and test errors.
These plots are called the learning curves.
sklearn does not yet provide turn-key utilities to plot such learning curves but is not very complicated to compute them by leveraging the ShuffleSplit
class. First let's define a range of data set sizes for subsampling the training set:
In [58]:
train_sizes = np.logspace(2, 3, 5).astype(np.int)
train_sizes
Out[58]:
For each training set sizes we will compute n_iter
cross validation iterations. Let's pre-allocate the arrays to store the results:
In [59]:
n_iter = 5
train_scores = np.zeros((train_sizes.shape[0], n_iter), dtype=np.float)
test_scores = np.zeros((train_sizes.shape[0], n_iter), dtype=np.float)
We can now loop over training set sizes and CV iterations:
In [60]:
svc = SVC(C=1, gamma=0.0005)
for i, train_size in enumerate(train_sizes):
cv = ShuffleSplit(n_samples, n_iter=n_iter, train_size=train_size)
for j, (train, test) in enumerate(cv):
svc.fit(X[train], y[train])
train_scores[i, j] = svc.score(X[train], y[train])
test_scores[i, j] = svc.score(X[test], y[test])
We can now plot the mean scores with error bars that reflect the standard errors of the means:
In [61]:
mean_train = np.mean(train_scores, axis=1)
confidence = sem(train_scores, axis=1) * 2
pl.fill_between(train_sizes, mean_train - confidence, mean_train + confidence,
color = 'b', alpha = .2)
pl.plot(train_sizes, mean_train, 'o-k', c='b', label='Train score')
mean_test = np.mean(test_scores, axis=1)
confidence = sem(test_scores, axis=1) * 2
pl.fill_between(train_sizes, mean_test - confidence, mean_test + confidence,
color = 'g', alpha = .2)
pl.plot(train_sizes, mean_test, 'o-k', c='g', label='Test score')
pl.xlabel('Training set size')
pl.ylabel('Score')
pl.xlim(0, X_train.shape[0])
pl.ylim((None, 1.01)) # The best possible score is 1.0
pl.legend(loc='best')
pl.title('Main train and test scores +/- 2 standard errors')
pl.text(250, 0.9, "Overfitting a lot", fontsize=16, ha='center', va='bottom')
pl.text(800, 0.9, "Overfitting a little", fontsize=16, ha='center', va='bottom')
Out[61]:
If the training set error is high (e.g. more than 5% misclassification) at the end of the learning curve, the model suffers from high bias and is said to underfit the training set.
If the testing set error is significantly larger than the training set error, the model suffers from high variance and is said to overfit the training set.
Another possible source of high training and testing error is label noise: the data is too noisy and there is nothing few signal learn from it.
C
for SVMalpha
for penalized linear modelsC
for SVMalpha
for penalized linear modelsGrid Search parameters tuning can it-self be considered a (meta-)learning algorithm. Hence there is a risk of not taking into account the overfitting of the grid search procedure it-self.
To quantify and mitigate this risk we can nest the train / test split concept one level up:
Maker a top level "Development / Evaluation" sets split:
For dataset sampled over time, it is highly recommended to use a temporal split for the Development / Evaluation split: for instance, if you have collected data over the 2008-2013 period, you can:
In this session we applied the SVC model with RBF kernel on unormalized features: this is bad! If we had used a normalizer, the default parameters for C
and gamma
of SVC would directly have led to close to optimal performance:
In [ ]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
clf = SVC().fit(X_train_scaled, y_train) # Look Ma'! Default params!
print("Train score: {0:.3f}".format(clf.score(X_train_scaled, y_train)))
print("Test score: {0:.3f}".format(clf.score(X_test_scaled, y_test)))
This is because once normalized, the digits is very regular and fits the assumptions of the default parameters of the SVC
class very well. This is rarely the case though and usually it's always necessary to grid search the parameters.
Nonetheless, scaling should be a mandatory preprocessing step when using SVC, especially with a RBF kernel.