In [ ]:
%load_ext load_style
%load_style talk.css

Classification with scikit-learn

What we're going to do during this session is give an example of supervised learning, and more specifically we're going to see how to solve a classification problem in scikit-learn, with a focus on how one evaluates the performance of a model.

We're going to use a dataset that comes with scikit-learn, which consists in representation of hand-written digits (8 x 8 pixels normalized images) with the associated label (the correct digit)

This example is treated in a more comprehensive manner by Olivier Grisel (see his notebooks here)

In [ ]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from IPython.display import Image, HTML
%matplotlib inline

In [ ]:
from sklearn.datasets import load_digits
digits = load_digits()

In [ ]:
X, y =,

print("data shape: %r, target shape: %r" % (X.shape, y.shape))
print("labels: %r" % list(np.unique(y)))

In [ ]:
def plot_gallery(data, labels, shape, interpolation='nearest'):
    f,ax = plt.subplots(1,5,figsize=(16,5))
    for i in range(data.shape[0]):
        ax[i].imshow(data[i].reshape(shape), interpolation=interpolation,
        ax[i].set_xticks(()), ax[i].set_yticks(())

In [ ]:
subsample = np.random.permutation(X.shape[0])[:5]
images = X[subsample]
labels = ['True label: %d' % l for l in y[subsample]]
plot_gallery(images, labels, shape=(8, 8))

example of hand-written digit classification with Support Vector Machines (SVM)

We are importing the svm.SVC (Support Vector Classifier class) from scikit-learn

In [ ]:
from sklearn.svm import SVC


In [ ]:
svc = SVC()


In [ ]:, y)


In [ ]:


In [ ]:
y_hat = svc.predict(X)

In [ ]:
np.alltrue(y_hat == y)

Have we got a perfect model ???

Here we are making an important methodological mistake: we are using all the instances available to train the model, and using the same instances to evaluate the model in terms of accuracy. It tell us (almost) nothing about the actual performance in production of the model, just how well it can reproduce the data it's been exposed too ...

A way to work around that is to train the model over a subset of the available instances (the training set), calculate the train score, and test the model (i.e. calculate the test score) over the remaining of the instances (the test set).

Cross-validation consists into repeating this operation several times using successive splits of the original dataset into training and test sets, and calculating a summary statistic of the train and test scores over the iterations (usually average).

Several splits can be used:

  • Random split: a given percentage of the data is selected at random (with replacement)
  • K-folds: the dataset is divided into K exhaustive splits, each split is used as the test set, while the K-1 splits are using as the training set
  • Stratified K-folds: for classification mainly. The folds are constructed so that the class distribution is approximately the same in each fold (e.g. the relative frequency of each class is preserved)
  • Leave One Out: like K-fold with K = 1. One instance is left out, the model is built on the N-1 remaining instances, this procedure is repeated until all the instances have been used.

cross-validation in scikit-learn

In [ ]:
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, \
                                test_size=0.25, random_state=1)

print("train data shape: %r, train target shape: %r"
      % (X_train.shape, y_train.shape))
print("test data shape: %r, test target shape: %r"
      % (X_test.shape, y_test.shape))

In [ ]:
svc = SVC().fit(X_train, y_train)
train_score = svc.score(X_train, y_train) 

In [ ]:
test_score = svc.score(X_test, y_test)

Ok that seems more like a 'normal' result ...

  • if the test data score is not as good as the train score the model is overfitting

  • if the train score is not close to 100% accuracy the model is underfitting

Ideally we want to neither overfit nor underfit: test_score ~= train_score ~= 1.0.

When setting up a Support Vector Machine classifier, one needs to set up 2 parameters (hyper-parameters) which are NOT tuned at the fitting stage (they are NOT learned). These are C and $\gamma$ (see the relevant section in the wikipedia article). What we did before is to instanciate the SVC class without specifying these parameters, which means that the default are used. Let's try something else.

In [ ]:
svc_2 = SVC(C=100, gamma=0.001).fit(X_train, y_train)

In [ ]:
svc_2.score(X_train, y_train)

In [ ]:
svc_2.score(X_test, y_test)

In [ ]:
sum(svc_2.predict(X_test) == y_test) / float(len(y_test))

Could be luck (we only used one train / test split here): Now we're going to use cross validation to repeat the train / test split several times to as to get a more accurate estimate of the real test score by averaging the values found of the individual runs

scikit-learn provides a very convenient interface to do that: sklearn.cross_validation

In [ ]:
from sklearn import cross_validation

In [ ]:

In [ ]:

In [ ]:
cv = cross_validation.ShuffleSplit(len(X), n_iter=3, test_size=0.2,

for cv_index, (train, test) in enumerate(cv):
    print("# Cross Validation Iteration #%d" % cv_index)
    print("train indices: {0}...".format(train[:10]))
    print("test indices: {0}...".format(test[:10]))
    svc = SVC(C=100, gamma=0.001).fit(X[train], y[train])
    print("train score: {0:.3f}, test score: {1:.3f}\n".format(
        svc.score(X[train], y[train]), svc.score(X[test], y[test])))

There's a wrapper for estimating cross validated scores directly, you just have to pass the cross validation method instanciated before

In [ ]:
from sklearn.cross_validation import cross_val_score

svc = SVC(C=100, gamma=0.001)

cv = cross_validation.ShuffleSplit(len(X), n_iter=10, test_size=0.2,

test_scores = cross_val_score(svc, X, y, cv=cv, n_jobs=4) # n_jobs = 4 if you have a quad-core machine ...

Cross validation can be used to estimate the best hyperparameters for a model

Let's see what happens when we fix C but vary $\gamma$

In [ ]:
n_iter = 5 # the number of iterations should be more than that ... 

gammas = np.logspace(-7, -1, 10) # should be more fine grained ... 

cv = cross_validation.ShuffleSplit(len(X), n_iter=n_iter, test_size=0.2)

train_scores = np.zeros((len(gammas), n_iter))
test_scores = np.zeros((len(gammas), n_iter))

for i, gamma in enumerate(gammas):
    for j, (train, test) in enumerate(cv):
        C = 1
        clf = SVC(C=C, gamma=gamma).fit(X[train], y[train])
        train_scores[i, j] = clf.score(X[train], y[train])
        test_scores[i, j] = clf.score(X[test], y[test])

In [ ]:
f, ax = plt.subplots(figsize=(12,8))
#for i in range(n_iter):
#    ax.semilogx(gammas, train_scores[:, i], alpha=0.2, lw=2, c='b')
#    ax.semilogx(gammas, test_scores[:, i], alpha=0.2, lw=2, c='g')
ax.semilogx(gammas, test_scores.mean(1), lw=4, c='g', label='test score')
ax.semilogx(gammas, train_scores.mean(1), lw=4, c='b', label='train score')

ax.fill_between(gammas, train_scores.min(1), train_scores.max(1), color = 'b', alpha=0.2)
ax.fill_between(gammas, test_scores.min(1), test_scores.max(1), color = 'g', alpha=0.2)

ax.set_ylabel("score for SVC(C=%4.2f, $\gamma=\gamma$)" % ( C ),fontsize=16)
best_gamma = gammas[np.argmax(test_scores.mean(1))]
best_score = test_scores.mean(1).max()
ax.text(best_gamma, best_score+0.05, "$\gamma$ = %6.4f | score=%6.4f" % (best_gamma, best_score),\
        fontsize=15, bbox=dict(facecolor='w',alpha=0.5))
[x.set_fontsize(16) for x in ax.xaxis.get_ticklabels()]
[x.set_fontsize(16) for x in ax.yaxis.get_ticklabels()]
ax.legend(fontsize=16,  loc=0)
ax.set_ylim(0, 1.1)

You can search the (hyper) parameter space and find the best hyperparameters using grid search in scikit-learn

In [ ]:
from sklearn.grid_search import GridSearchCV

In [ ]:
svc_params = {
    'C': np.logspace(-1, 2, 4),
    'gamma': np.logspace(-4, 0, 5),

In [ ]:
gs_svc = GridSearchCV(SVC(), svc_params, cv=3, n_jobs=4)

In [ ]:, y)

In [ ]:
gs_svc.best_params_, gs_svc.best_score_

Exercise: predicting the quality of a wine given a set of physicochemical measurements

Two datasets were created, using red and white wine samples. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

This dataset is available from the UC Irvine Machine Learning Repo

You can try several classification approaches for the quality (10 discrete classes for quality) or you can try (using either statsmodels or sklearn) regressions approaches: e.g. predicting the alcohol content given the other (or subset thereof) measurements.

In [ ]:
wine  = pd.read_csv('./data/winequality-red.csv', sep=';')

In [ ]:

Below an example of classification (using the same SVC classifier)

you need to add the cross-validation step

In [ ]:
quality = wine.pop('quality')

In [ ]:
y = quality.values

In [ ]:
X = wine.values

In [ ]:
from sklearn.preprocessing import StandardScaler as scaler

In [ ]:
scaler = scaler()

In [ ]:

In [ ]:
Xscaled = scaler.transform(X)

In [ ]:
from sklearn.svm import SVC

In [ ]:
svc = SVC()

In [ ]:, y)

In [ ]:
y_hat = svc.predict(Xscaled)

In [ ]:

In [ ]:

In [ ]:
svc.score(X, y)

In [ ]:
from sklearn.metrics import confusion_matrix

In [ ]:
confusion_matrix(y, y_hat)