Support Vector Machines

Instructor: Alessandro Gagliardi
TA: Kevin Perko

Last Time:

  • SciKit-Learn
    • Training a predictive model on numerical features
    • Model evaluation and interpretation
    • Cross-validation
    • More feature engineering and richer models
  • Decision Trees
    • Building Decision Trees
    • Optimization Functions
    • Preventing Overfit
    • Implementing Decision Trees with Scikit-Learn

Questions?

Agenda

  1. Theory
    1. Intro to Support Vector Machines (SVMs)
    2. Maximum Margin Hyperplanes
    3. Slack Variables
    4. Nonlinear Classification with Kernels
  2. Practice
    1. Cross-Validation
    2. Model Selection with Grid Search
    3. Learning Curves and the Bias-Variance trade-off
    4. Final Model Assessment

Intro to Support Vector Machines (SVMs)

What is a support vector machine?

Support vector machine is an algorithm that generates a linear binary classification using geometric functions

Goal: Explicity constructed to minimize generalization error

Binary: Solves a two-class problem

Linear: creates linear decision boundary

How is the decision boundary derived?

SVMs use geometric reasoning to determine classification

More importantly, they use the geometric concept of margin

Geometry break: margins

Here, margins mean the region along the decision boundary that is free of data points.

Margins provide the most impact in SVMs--even moving only one point on the margin can completely change the decision boundary!

Margin: Example

source: [Data Analysis with Open Source Tools](http://shop.oreilly.com/product/9780596802363.do), by Philipp K. Janert. O’Reilly Media, 2011.

The goal of an SVM is to create the linear decision boundary with the largest margin. This is commonly called the maximum margin hyperplane.

A hyperplane is just a high-dimensional generalization of a line.

If SVM is a linear classifier, how can you use it for nonlinear classification?

Using a clever maneuver called the kernel trick.
(more on that later)

Nonlinear applications of SVM rely on an implicit (nonlinear) mapping $\Phi$ that sends vectors from the original feature space $K$ into a higher-dimensional feature space $K'$.

Nonlinear classification in $K$ is then obtained by creating a linear decision boundary in $K'$.

In practice, this involves no computations in the higher dimensional space!

Maximum Margin Hyperplanes

What's our goal for optimization?

SVM will always solve for a decision boundary that minimizes generalization error

This is the same as finding a boundary that has the greatest margin

Wider the margin, clearer the distinction between two classes!

Margin: Example

source: [Data Analysis with Open Source Tools](http://shop.oreilly.com/product/9780596802363.do), by Philipp K. Janert. O’Reilly Media, 2011.

Hyperplanes and Support Vectors

To optimize a support vector machine, it should be able to determine the maximum margin hyperplane (mmh).

Notice that the margin depends only on a subset of the training data; namely, those points that are nearest to the decision boundary.

These points are called the support vectors.

The other points (far from the decision boundary) don’t affect the construction of the mmh at all!

All of the decision boundaries we’ve seen so far have split the data perfectly; e.g. the data are linearly separable, and therefore the training error is 0.

This means that...

We are always solving for one optimum (global, instead of local optima)

If our data (like in the previous examples) is linearly seperable, the training error is 0.

However... when is data ever really like that?

We need an approach to improve data where this is not the case

Optimizing Large Margin: A Review

SVMs will always optimize using Support Vectors and hyperplanes

Support Vectors represent the margin data points where classification is less clear

With the best hyperplane, we get the best, large margin.

Slack Variables

Recall that in building the hard margin classifier, we assumed that our data was linearly separable (eg, that we could perfectly classify each record with a linear decision boundary).

Suppose that this was not true, or suppose that we wanted to use a larger margin at the expense of incurring some training error.

This can be done using by introducing slack variables.

Slack variables $\xi$ generalize the optimization problem to permit some misclassified training records (which come at a cost $C$).

Result: Soft Margin Classifier. Attempts to split data as cleanly as possible, maximizing the margin of correctly classified data

This an example of bias-variance tradeoff.

Example of soft margins


source: [pyml.sourceforge.net/doc/howto.pdf](http://pyml.sourceforge.net/doc/howto.pdf)

Selecting slack variables

Often, the optimization of slack variables comes between exponentially growing sequences of $C$

Higher values of variable $C =$ higher accuracy in model

Lower values of $C =$ training error and better generalization

Review: Slack Variables

Slack variables allow for inaccuracies in classification for a more generalized, accurate model

Soft Margins are created as a result of generalization variable C

We trade off accuracy for generalization by picking a lower value C to prevent overfitting

Nonlinear Classification with Kernels

Nonlinear Classification

Suppose we need a more complex classifier than a linear decision boundary allows.

One possibility is to add nonlinear combinations of features to the data, and then to create a linear decision boundary in the enhanced (higher-dimensional) feature space.

This linear decision boundary will be mapped to a nonlinear decision boundary in the original feature space.

Example: working with non-linear classification

Nonlinear Classification

The logic of this approach is sound, but there are a few problems with this version.

In particular, this will not scale well, since it requires many high-dimensional calculations.

It will likely lead to more complexity (both modeling complexity and computational complexity) than we want.

Nonlinear Classification

Let’s hang on to the logic of the previous example, namely:

  • remap the feature vectors $x_i$ into a higher-dimensional space $K'$
  • create a linear decision boundary in $K'$
  • back out the nonlinear decision boundary in $K$ from the result

But we want to save ourselves the trouble of doing a lot of additional high-dimensional calculations. How can we do this?

Kernel Trick

Instead, we use kernel functions that map two vectors in a higher-dimensional feature space $K'$

The upshot is that we can use a kernel function to implicitly train our model in a higher-dimensional feature space, without incurring additional computational complexity!

As long as the kernel function satisfies certain conditions, our conclusions above regarding the mmh continue to hold.

These conditions are contained in a result called Mercer’s theorem.

In other words, no algorithmic changes are necessary, and all the benefits of a linear SVM are maintained.

Linear Kernel: $$ k(x,x') = \langle x,x' \rangle $$

Polynomial Kernel: $$ k(x,x') = (x^\top x'+1)^d $$

Gaussian Kernel: $$ k(x,x') = exp(-\gamma||x-x'||^2) $$

Radial Basis Kernel: $$ k(x,x') = exp(-\frac{||x-x'||^2_2}{2\sigma^2}) $$

The hyperparameters $d, \gamma$ affect the flexibility of the decision boundary.

Example: Linear & Polynomial Kernels

Kernels: Review

Kernel methods are powerful and popular techniques that can produce accurate results

Computationally efficient ways to deal with non linear classification problems

Beware: entering dangerous territory here: tread lightly!

SVMs (and kernel methods in general) are versatile, powerful, and popular techniques that can produce accurate results for a wide array of classification problems.

The main disadvantage of SVMs is the lack of intuition they produce. These models are truly black boxes!

Model Selection and Assessment

cd parallel_ml_tutorial/notebooks  
ipython notebook  
02 - Model Selection and Assessment.ipynb

Outline of the session:

  • Model performance evaluation and detection of overfitting with Cross-Validation
  • Hyper parameter tuning and model selection with Grid Search
  • Error analysis with learning curves and the Bias-Variance trade-off
  • Overfitting via Model Selection and the Development / Evaluation set split

In [1]:
#%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

# Some nice default configuration for plots
plt.rcParams['figure.figsize'] = 10, 7.5
plt.rcParams['axes.grid'] = True
plt.gray()


<matplotlib.figure.Figure at 0x1059b78d0>

The Hand Written Digits Dataset

Let's load a simple dataset of 8x8 gray level images of handwritten digits (bundled in the sklearn source code):


In [2]:
from sklearn.datasets import load_digits
digits = load_digits()
print(digits.DESCR)


 Optical Recognition of Handwritten Digits Data Set

Notes
-----
Data Set Characteristics:
    :Number of Instances: 5620
    :Number of Attributes: 64
    :Attribute Information: 8x8 image of integer pixels in the range 0..16.
    :Missing Attribute Values: None
    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
    :Date: July; 1998

This is a copy of the test set of the UCI ML hand-written digits datasets
http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

The data set contains images of hand-written digits: 10 classes where
each class refers to a digit.

Preprocessing programs made available by NIST were used to extract
normalized bitmaps of handwritten digits from a preprinted form. From a
total of 43 people, 30 contributed to the training set and different 13
to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
4x4 and the number of on pixels are counted in each block. This generates
an input matrix of 8x8 where each element is an integer in the range
0..16. This reduces dimensionality and gives invariance to small
distortions.

For info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.
T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.
L. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,
1994.

References
----------
  - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their
    Applications to Handwritten Digit Recognition, MSc Thesis, Institute of
    Graduate Studies in Science and Engineering, Bogazici University.
  - E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.
  - Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.
    Linear dimensionalityreduction using relevance weighted LDA. School of
    Electrical and Electronic Engineering Nanyang Technological University.
    2005.
  - Claudio Gentile. A New Approximate Maximal Margin Classification
    Algorithm. NIPS. 2000.


In [3]:
X, y = digits.data, digits.target

print("data shape: %r, target shape: %r" % (X.shape, y.shape))
print("classes: %r" % list(np.unique(y)))


data shape: (1797, 64), target shape: (1797,)
classes: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [4]:
n_samples, n_features = X.shape
print("n_samples=%d" % n_samples)
print("n_features=%d" % n_features)


n_samples=1797
n_features=64

In [5]:
def plot_gallery(data, labels, shape, interpolation='nearest'):
    for i in range(data.shape[0]):
        plt.subplot(1, data.shape[0], (i + 1))
        plt.imshow(data[i].reshape(shape), interpolation=interpolation)
        plt.title(labels[i])
        plt.xticks(()), plt.yticks(())

In [6]:
subsample = np.random.permutation(X.shape[0])[:5]
images = X[subsample]
labels = ['True class: %d' % l for l in y[subsample]]
plot_gallery(images, labels, shape=(8, 8))


Let's visualize the dataset on a 2D plane using a projection on the first 2 axis extracted by Principal Component Analysis:


In [7]:
from sklearn.decomposition import RandomizedPCA

pca = RandomizedPCA(n_components=2)
%time X_pca = pca.fit_transform(X)

X_pca.shape


CPU times: user 11.6 ms, sys: 3.81 ms, total: 15.4 ms
Wall time: 10.9 ms
Out[7]:
(1797, 2)

In [8]:
from itertools import cycle

colors = ['b', 'g', 'r', 'c', 'm', 'y', 'k']
markers = ['+', 'o', '^', 'v', '<', '>', 'D', 'h', 's']
for i, c, m in zip(np.unique(y), cycle(colors), cycle(markers)):
    plt.scatter(X_pca[y == i, 0], X_pca[y == i, 1],
        c=c, marker=m, label=i, alpha=0.5)
    
_ = plt.legend(loc='best')


We can observe that even in 2D, the groups of digits are quite well separated, especially the digit "0" that is very different from any other (the closest being "6" as it often share most the left hand side pixels). We can also observe that at least in 2D, there is quite a bit of overlap between the "1", "2" and "7" digits.

To better understand the meaning of the "x" and "y" axes of this plot it is also visualize the values of the first two principal components that are used to compute this projection:


In [9]:
labels = ['Component #%d' % i for i in range(len(pca.components_))]
plot_gallery(pca.components_, labels, shape=(8, 8))


Overfitting

Overfitting is the problem of learning the training data by heart and being unable to generalize by making correct predictions on data samples unseen while training.

To illustrate this, let's train a Support Vector Machine naively on the digits dataset:


In [10]:
from sklearn.svm import SVC
SVC().fit(X, y).score(X, y)


Out[10]:
1.0

Did we really learn a perfect model that can recognize the correct digit class 100% of the time? Without new data it's impossible to tell.

Let's start again and split the dataset into two random, non overlapping subsets:


In [11]:
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=0)

print("train data shape: %r, train target shape: %r"
      % (X_train.shape, y_train.shape))
print("test data shape: %r, test target shape: %r"
      % (X_test.shape, y_test.shape))


train data shape: (1347, 64), train target shape: (1347,)
test data shape: (450, 64), test target shape: (450,)

Let's retrain a new model on the first subset call the training set:


In [12]:
svc = SVC(kernel='rbf').fit(X_train, y_train)
train_score = svc.score(X_train, y_train) 
train_score


Out[12]:
1.0

We can now compute the performance of the model on new, held out data from the test set:


In [13]:
test_score = svc.score(X_test, y_test)
test_score


Out[13]:
0.48666666666666669

This score is clearly not as good as expected! The model cannot generalize so well to new, unseen data.

  • Whenever the test data score is not as good as the train score the model is overfitting

  • Whenever the train score is not close to 100% accuracy the model is underfitting

Ideally we want to neither overfit nor underfit: test_score ~= train_score ~= 1.0.

The previous example failed to generalized well to test data because we naively used the default parameters of the SVC class:


In [14]:
svc


Out[14]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

Let's try again with another parameterization:


In [15]:
svc_2 = SVC(kernel='rbf', C=100, gamma=0.001).fit(X_train, y_train)
svc_2


Out[15]:
SVC(C=100, cache_size=200, class_weight=None, coef0=0.0, degree=3,
  gamma=0.001, kernel='rbf', max_iter=-1, probability=False,
  random_state=None, shrinking=True, tol=0.001, verbose=False)

In [16]:
svc_2.score(X_train, y_train)


Out[16]:
1.0

In [17]:
svc_2.score(X_test, y_test)


Out[17]:
0.99333333333333329

In this case the model is almost perfectly able to generalize, at least according to our random train, test split.

Cross Validation

Cross Validation is a procedure to repeat the train / test split several times to as to get a more accurate estimate of the real test score by averaging the values found of the individual runs.

The sklearn.cross_validation package provides many strategies to compute such splits using class that implement the python iterator API:


In [18]:
from sklearn.cross_validation import ShuffleSplit

cv = ShuffleSplit(n_samples, n_iter=3, test_size=0.1,
    random_state=0)

for cv_index, (train, test) in enumerate(cv):
    print("# Cross Validation Iteration #%d" % cv_index)
    print("train indices: {0}...".format(train[:10]))
    print("test indices: {0}...".format(test[:10]))
    
    svc = SVC(kernel="rbf", C=1, gamma=0.001).fit(X[train], y[train])
    print("train score: {0:.3f}, test score: {1:.3f}\n".format(
        svc.score(X[train], y[train]), svc.score(X[test], y[test])))


# Cross Validation Iteration #0
train indices: [ 353    5   58 1349 1025  575 1074 1110 1745  689]...
test indices: [1081 1707  927  713  262  182  303  895  933 1266]...
train score: 0.999, test score: 0.989

# Cross Validation Iteration #1
train indices: [1336  608  977   22  526 1587 1130  569 1481  962]...
test indices: [1014  755 1633  117  181  501  948 1076   45  659]...
train score: 0.998, test score: 0.994

# Cross Validation Iteration #2
train indices: [ 451  409  911 1551  133  691 1306  111  852  825]...
test indices: [ 795  697  655  573  412  743  635  851 1466 1383]...
train score: 0.999, test score: 0.994

Instead of doing the above manually, sklearn.cross_validation provides a little utility function to compute the cross validated test scores automatically:


In [19]:
from sklearn.cross_validation import cross_val_score

svc = SVC(kernel="rbf", C=1, gamma=0.001)
cv = ShuffleSplit(n_samples, n_iter=10, test_size=0.1,
    random_state=0)

test_scores = cross_val_score(svc, X, y, cv=cv, n_jobs=2)
test_scores


Out[19]:
array([ 0.98888889,  0.99444444,  0.99444444,  0.99444444,  0.99444444,
        0.99444444,  0.98888889,  0.99444444,  0.98888889,  1.        ])

In [20]:
from scipy.stats import sem

def mean_score(scores):
    """Print the empirical mean score and standard error of the mean."""
    return ("Mean score: {0:.3f} (+/-{1:.3f})").format(
        np.mean(scores), sem(scores))

In [21]:
print(mean_score(test_scores))


Mean score: 0.993 (+/-0.001)

Exercise:

  • Perform 50 iterations of cross validation with randomly sampled folds of 500 training samples and 500 test samples randomly sampled from X and y (use sklearn.cross_validation.ShuffleSplit).
  • Try with SVC(C=1, gamma=0.01)
  • Plot distribution the test error using an histogram with 50 bins.
  • Try to increase the training size
  • Retry with SVC(C=10, gamma=0.005), then SVC(C=10, gamma=0.001) with 500 samples.

  • Optional: use a smoothed kernel density estimation scipy.stats.kde.gaussian_kde instead of an histogram to visualize the test error distribution.

Hints, type:

from sklearn.cross_validation import ShuffleSplit
ShuffleSplit?  # to read the docstring of the shuffle split
plt.hist?  # to read the docstring of the histogram plot

In [ ]:

Cross Validation makes it possible to evaluate the performance of a model class and its hyper parameters on the task at hand.

A natural extension is thus to run CV several times for various values of the parameters so as to find the best. For instance, let's fix the SVC parameter to C=10 and compute the cross validated test score for various values of gamma:


In [22]:
n_gammas = 10
n_iter = 5
cv = ShuffleSplit(n_samples, n_iter=n_iter, train_size=500, test_size=500,
    random_state=0)

train_scores = np.zeros((n_gammas, n_iter))
test_scores = np.zeros((n_gammas, n_iter))
gammas = np.logspace(-7, -1, n_gammas)

for i, gamma in enumerate(gammas):
    for j, (train, test) in enumerate(cv):
        clf = SVC(C=10, gamma=gamma).fit(X[train], y[train])
        train_scores[i, j] = clf.score(X[train], y[train])
        test_scores[i, j] = clf.score(X[test], y[test])

In [23]:
for i in range(n_iter):
    plt.semilogx(gammas, train_scores[:, i], alpha=0.4, lw=2, c='b')
    plt.semilogx(gammas, test_scores[:, i], alpha=0.4, lw=2, c='g')
plt.ylabel("score for SVC(C=10, gamma=gamma)")
plt.xlabel("gamma")
plt.text(1e-6, 0.5, "Underfitting", fontsize=16, ha='center', va='bottom')
plt.text(1e-4, 0.5, "Good", fontsize=16, ha='center', va='bottom')
plt.text(1e-2, 0.5, "Overfitting", fontsize=16, ha='center', va='bottom')


Out[23]:
<matplotlib.text.Text at 0x1075a6050>

We can see that, for this model class, on this unscaled dataset: when C=10, there is a sweet spot region for gamma around $10^{-4}$ to $10^{-3}$. Both the train and test scores are high (low errors).

  • If gamma is too low, train score is low (and thus test scores too as it generally cannot be better than the train score): the model is not expressive enough to represent the data: the model is in an underfitting regime.

  • If gamma is too high, train score is ok but there is a high discrepency between test and train score. The model is learning the training data and its noise by heart and fails to generalize to new unseen data: the model is in an overfitting regime.

We can do the same kind analysis to identify good values for C when gamma is fixed to $10^{-3}$:


In [24]:
n_Cs = 10
n_iter = 5
cv = ShuffleSplit(n_samples, n_iter=n_iter, train_size=500, test_size=500,
    random_state=0)

train_scores = np.zeros((n_Cs, n_iter))
test_scores = np.zeros((n_Cs, n_iter))
Cs = np.logspace(-5, 5, n_Cs)

for i, C in enumerate(Cs):
    for j, (train, test) in enumerate(cv):
        clf = SVC(C=C, gamma=1e-3).fit(X[train], y[train])
        train_scores[i, j] = clf.score(X[train], y[train])
        test_scores[i, j] = clf.score(X[test], y[test])

In [25]:
for i in range(n_iter):
    plt.semilogx(Cs, train_scores[:, i], alpha=0.4, lw=2, c='b')
    plt.semilogx(Cs, test_scores[:, i], alpha=0.4, lw=2, c='g')
plt.ylabel("score for SVC(C=C, gamma=1e-3)")
plt.xlabel("C")
plt.text(1e-3, 0.5, "Underfitting", fontsize=16, ha='center', va='bottom')
plt.text(1e3, 0.5, "Few Overfitting", fontsize=16, ha='center', va='bottom')


Out[25]:
<matplotlib.text.Text at 0x10759ef90>

Doing this procedure several for each parameter combination is tedious, hence it's possible to automate the procedure by computing the test score for all possible combinations of parameters using the GridSearchCV helper.


In [26]:
from sklearn.grid_search import GridSearchCV
#help(GridSearchCV)

In [27]:
from pprint import pprint
svc_params = {
    'C': np.logspace(-1, 2, 4),
    'gamma': np.logspace(-4, 0, 5),
}
pprint(svc_params)


{'C': array([   0.1,    1. ,   10. ,  100. ]),
 'gamma': array([  1.00000000e-04,   1.00000000e-03,   1.00000000e-02,
         1.00000000e-01,   1.00000000e+00])}

As Grid Search is a costly procedure, let's do the some experiments with a smaller dataset:


In [28]:
n_subsamples = 500
X_small_train, y_small_train = X_train[:n_subsamples], y_train[:n_subsamples]

In [29]:
gs_svc = GridSearchCV(SVC(), svc_params, cv=3, n_jobs=-1)

%time _ = gs_svc.fit(X_small_train, y_small_train)


CPU times: user 110 ms, sys: 51.3 ms, total: 161 ms
Wall time: 1.12 s

In [30]:
gs_svc.best_params_, gs_svc.best_score_


Out[30]:
({'C': 1.0, 'gamma': 0.001}, 0.96599999999999997)

Let's define a couple of helper function to help us introspect the details of the grid search outcome:


In [31]:
def display_scores(params, scores, append_star=False):
    """Format the mean score +/- std error for params"""
    params = ", ".join("{0}={1}".format(k, v)
                      for k, v in params.items())
    line = "{0}:\t{1:.3f} (+/-{2:.3f})".format(
        params, np.mean(scores), sem(scores))
    if append_star:
        line += " *"
    return line

def display_grid_scores(grid_scores, top=None):
    """Helper function to format a report on a grid of scores"""
    
    grid_scores = sorted(grid_scores, key=lambda x: x[1], reverse=True)
    if top is not None:
        grid_scores = grid_scores[:top]
        
    # Compute a threshold for staring models with overlapping
    # stderr:
    _, best_mean, best_scores = grid_scores[0]
    threshold = best_mean - 2 * sem(best_scores)
    
    for params, mean_score, scores in grid_scores:
        append_star = mean_score + 2 * sem(scores) > threshold
        print(display_scores(params, scores, append_star=append_star))

In [32]:
display_grid_scores(gs_svc.grid_scores_, top=20)


C=1.0, gamma=0.001:	0.966 (+/-0.009) *
C=10.0, gamma=0.0001:	0.966 (+/-0.008) *
C=10.0, gamma=0.001:	0.966 (+/-0.002) *
C=100.0, gamma=0.001:	0.966 (+/-0.002) *
C=100.0, gamma=0.0001:	0.962 (+/-0.007) *
C=1.0, gamma=0.0001:	0.922 (+/-0.003)
C=0.1, gamma=0.001:	0.722 (+/-0.008)
C=10.0, gamma=0.01:	0.314 (+/-0.018)
C=100.0, gamma=0.01:	0.314 (+/-0.018)
C=1.0, gamma=0.01:	0.266 (+/-0.012)
C=0.1, gamma=0.0001:	0.168 (+/-0.003)
C=0.1, gamma=0.01:	0.128 (+/-0.002)
C=0.1, gamma=0.1:	0.128 (+/-0.002)
C=0.1, gamma=1.0:	0.128 (+/-0.002)
C=1.0, gamma=0.1:	0.128 (+/-0.002)
C=1.0, gamma=1.0:	0.128 (+/-0.002)
C=10.0, gamma=0.1:	0.128 (+/-0.002)
C=10.0, gamma=1.0:	0.128 (+/-0.002)
C=100.0, gamma=0.1:	0.128 (+/-0.002)
C=100.0, gamma=1.0:	0.128 (+/-0.002)

One can see that Support Vector Machine with RBF kernel are very sensitive wrt. the gamma parameter (the badwith of the kernel) and to some lesser extend to the C parameter as well. If those parameter are not grid searched, the predictive accurracy of the support vector machine is almost no better than random guessing!

By default, the GridSearchCV class refits a final model on the complete training set with the best parameters found by during the grid search:


In [33]:
gs_svc.score(X_test, y_test)


Out[33]:
0.98444444444444446

Evaluating this final model on the real test set will often yield a better score because of the larger training set, especially when the training set is small and the number of cross validation folds is small (cv=3 here).

Exercise:

  1. Find a set of parameters for an sklearn.tree.DecisionTreeClassifier on the X_small_train / y_small_train digits dataset to reach at least 75% accuracy on the sample dataset (500 training samples)
  2. In particular you can grid search good values for criterion, min_samples_split and max_depth
  3. Which parameter(s) seems to be the most important to tune?
  4. Retry with sklearn.ensemble.ExtraTreesClassifier(n_estimators=30) which is a randomized ensemble of decision trees. Does the parameters that make the single trees work best also make the ensemble model work best?

Hints:

  • If the outcome of the grid search is too instable (overlapping std errors), increase the number of CV folds with cv constructor parameter. The default value is cv=3. Increasing it to cv=5 or cv=10 often yield more stable results but at the price of longer evaluation times.
  • Start with a small grid, e.g. 2 values criterion and 3 for min_samples_split only to avoid having to wait for too long at first.

Type:

from sklearn.tree.DecisionTreeClassifier
DecisionTreeClassifier?  # to read the docstring and know the list of important parameters
print(DecisionTreeClassifier())  # to show the list of default values

from sklearn.ensemble.ExtraTreesClassifier
ExtraTreesClassifier? 
print(ExtraTreesClassifier())

In [ ]:

Plotting Learning Curves for Bias-Variance analysis

In order to better understand the behavior of model (model class + contructor parameters), is it possible to run several cross validation steps for various random sub-samples of the training set and then plot the mean training and test errors.

These plots are called the learning curves.

sklearn does not yet provide turn-key utilities to plot such learning curves but is not very complicated to compute them by leveraging the ShuffleSplit class. First let's define a range of data set sizes for subsampling the training set:


In [34]:
train_sizes = np.logspace(2, 3, 5).astype(np.int)
train_sizes


Out[34]:
array([ 100,  177,  316,  562, 1000])

For each training set sizes we will compute n_iter cross validation iterations. Let's pre-allocate the arrays to store the results:


In [35]:
n_iter = 20
train_scores = np.zeros((train_sizes.shape[0], n_iter), dtype=np.float)
test_scores = np.zeros((train_sizes.shape[0], n_iter), dtype=np.float)

We can now loop over training set sizes and CV iterations:


In [36]:
svc = SVC(C=1, gamma=0.0005)

for i, train_size in enumerate(train_sizes):
    cv = ShuffleSplit(n_samples, n_iter=n_iter, train_size=train_size)
    for j, (train, test) in enumerate(cv):
        svc.fit(X[train], y[train])
        train_scores[i, j] = svc.score(X[train], y[train])
        test_scores[i, j] = svc.score(X[test], y[test])

We can now plot the mean scores with error bars that reflect the standard errors of the means:


In [37]:
mean_train = np.mean(train_scores, axis=1)
confidence = sem(train_scores, axis=1) * 2

plt.fill_between(train_sizes, mean_train - confidence, mean_train + confidence,
                color = 'b', alpha = .2)
plt.plot(train_sizes, mean_train, 'o-k', c='b', label='Train score')

mean_test = np.mean(test_scores, axis=1)
confidence = sem(test_scores, axis=1) * 2

plt.fill_between(train_sizes, mean_test - confidence, mean_test + confidence,
                color = 'g', alpha = .2)
plt.plot(train_sizes, mean_test, 'o-k', c='g', label='Test score')

plt.xlabel('Training set size')
plt.ylabel('Score')
plt.xlim(0, X_train.shape[0])
plt.ylim((None, 1.01))  # The best possible score is 1.0
plt.legend(loc='best')
plt.title('Main train and test scores +/- 2 standard errors')

plt.text(250, 0.9, "Overfitting a lot", fontsize=16, ha='center', va='bottom')
plt.text(800, 0.9, "Overfitting a little", fontsize=16, ha='center', va='bottom')


Out[37]:
<matplotlib.text.Text at 0x108c269d0>

Interpreting Learning Curves

  • If the training set error is high (e.g. more than 5% misclassification) at the end of the learning curve, the model suffers from high bias and is said to underfit the training set.

  • If the testing set error is significantly larger than the training set error, the model suffers from high variance and is said to overfit the training set.

Another possible source of high training and testing error is label noise: the data is too noisy and there is nothing few signal learn from it.

What to do against overfitting?

  • Try to get rid of noisy features using feature selection methods (or better let the model do it if the regularization is able to do so: for instance l1 penalized linear models)
  • Try to tune parameters to add more regularization:
    • Smaller values of C for SVM
    • Larger values of alpha for penalized linear models
    • Restrict to shallower trees (decision stumps) and lower numbers of samples per leafs for tree-based models
  • Try simpler model families such as penalized linear models (e.g. Linear SVM, Logistic Regression, Naive Bayes)
  • Try the ensemble strategies that average several independently trained models (e.g. bagging or blending ensembles): average the predictions of independently trained models
  • Collect more labeled samples if the learning curves of the test score has a non-zero slope on the right hand side.

What to do against underfitting?

  • Give more freedom to the model by relaxing some parameters that act as regularizers:
    • Larger values of C for SVM
    • Smaller values of alpha for penalized linear models
    • Allow deeper trees and lower numbers of samples per leafs for tree-based models
  • Try more complex / expressive model families:
    • Non linear kernel SVMs,
    • Ensemble of Decision Trees...
  • Construct new features:
    • bi-gram frequencies for text classifications
    • feature cross-products (possibly using the hashing trick)
    • unsupervised features extraction (e.g. triangle k-means, auto-encoders...)
    • non-linear kernel approximations + linear SVM instead of simple linear SVM

Final Model Assessment

Grid Search parameters tuning can it-self be considered a (meta-)learning algorithm. Hence there is a risk of not taking into account the overfitting of the grid search procedure it-self.

To quantify and mitigate this risk we can nest the train / test split concept one level up:

Maker a top level "Development / Evaluation" sets split:

  • Development set used for Grid Search and training of the model with optimal parameter set
  • Hold out evaluation set used only for estimating the predictive performance of the resulting model

For dataset sampled over time, it is highly recommended to use a temporal split for the Development / Evaluation split: for instance, if you have collected data over the 2008-2013 period, you can:

  • use 2008-2011 for development (grid search optimal parameters and model class),
  • 2012-2013 for evaluation (compute the test score of the best model parameters).

One Final Note About kernel SVM Parameters Tuning

In this session we applied the SVC model with RBF kernel on unormalized features: this is bad! If we had used a normalizer, the default parameters for C and gamma of SVC would directly have led to close to optimal performance:


In [38]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

clf = SVC().fit(X_train_scaled, y_train)  # Look Ma'! Default params!
print("Train score: {0:.3f}".format(clf.score(X_train_scaled, y_train)))
print("Test score: {0:.3f}".format(clf.score(X_test_scaled, y_test)))


Train score: 0.996
Test score: 0.984

This is because once normalized, the digits is very regular and fits the assumptions of the default parameters of the SVC class very well. This is rarely the case though and usually it's always necessary to grid search the parameters.

Nonetheless, scaling should be a mandatory preprocessing step when using SVC, especially with a RBF kernel.

Lab/Homework

03 - Distributed Model Selection and Assessment

How to use IPython.parallel (e.g. StarCluster) to distribute model selection and assessment across many engines.