Model Selection and Assessment

Outline of the session:

Model performance evaluation and detection of overfitting with Cross-Validation
Hyper parameter tuning and model selection with Grid Search
Error analysis with learning curves and the Bias-Variance trade-off
Overfitting via Model Selection and the Development / Evaluation set split



In [1]:

    
%pylab inline
import pylab as pl
import numpy as np

# Some nice default configuration for plots
pl.rcParams['figure.figsize'] = 10, 7.5
pl.rcParams['axes.grid'] = True
pl.gray()









    



Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.zmq.pylab.backend_inline].
For more information, type 'help(pylab)'.

The Hand Written Digits Dataset

Let's load a simple dataset of 8x8 gray level images of handwritten digits (bundled in the sklearn source code):



In [2]:

    
from sklearn.datasets import load_digits
digits = load_digits()
print(digits.DESCR)









    



 Optical Recognition of Handwritten Digits Data Set

Notes
-----
Data Set Characteristics:
    :Number of Instances: 5620
    :Number of Attributes: 64
    :Attribute Information: 8x8 image of integer pixels in the range 0..16.
    :Missing Attribute Values: None
    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
    :Date: July; 1998

This is a copy of the test set of the UCI ML hand-written digits datasets
http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

The data set contains images of hand-written digits: 10 classes where
each class refers to a digit.

Preprocessing programs made available by NIST were used to extract
normalized bitmaps of handwritten digits from a preprinted form. From a
total of 43 people, 30 contributed to the training set and different 13
to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
4x4 and the number of on pixels are counted in each block. This generates
an input matrix of 8x8 where each element is an integer in the range
0..16. This reduces dimensionality and gives invariance to small
distortions.

For info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.
T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.
L. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,
1994.

References
----------
  - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their
    Applications to Handwritten Digit Recognition, MSc Thesis, Institute of
    Graduate Studies in Science and Engineering, Bogazici University.
  - E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.
  - Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.
    Linear dimensionalityreduction using relevance weighted LDA. School of
    Electrical and Electronic Engineering Nanyang Technological University.
    2005.
  - Claudio Gentile. A New Approximate Maximal Margin Classification
    Algorithm. NIPS. 2000.



In [3]:

    
X, y = digits.data, digits.target
print("data shape: %r, target shape: %r" % (X.shape, y.shape))
print("classes: %r" % list(np.unique(y)))









    



data shape: (1797, 64), target shape: (1797,)
classes: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]



In [4]:

    
n_samples, n_features = X.shape
print("n_samples=%d" % n_samples)
print("n_features=%d" % n_features)









    



n_samples=1797
n_features=64



In [9]:

    
for i, j in enumerate(np.random.permutation(X.shape[0])[:5]):
    pl.subplot(1, 5, (i + 1))
    pl.imshow(X[j].reshape((8, 8)), interpolation='nearest')
    pl.title("true class: %d" % y[j])
    pl.xticks(()), pl.yticks(())

Let's visualize the dataset on a 2D plane using a projection on the first 2 axis extracted by Principal Component Analysis:



In [10]:

    
from sklearn.decomposition import RandomizedPCA

%time X_pca = RandomizedPCA(n_components=2).fit_transform(X)

X_pca.shape









    



CPU times: user 0.02 s, sys: 0.00 s, total: 0.02 s
Wall time: 0.03 s






    Out[10]:





(1797, 2)



In [12]:

    
from itertools import cycle

colors = ['b', 'g', 'r', 'c', 'm', 'y', 'k']
for i, c in zip(np.unique(y), cycle(colors)):
    pl.scatter(X_pca[y == i, 0], X_pca[y == i, 1],
        c=c, label=i, alpha=0.5)
    
#_ = pl.legend(loc='best')

We can observe that even in 2D, the groups of digits are quite well separated, especially the digit "0" that is very different from any other (the closest being "6" as it often share most the left hand side pixels). We can also observe that at least in 2D, there is quite a bit of overlap between the "1", "2" and "7" digits.

Overfitting

Overfitting is the problem of learning the training data by heart and being unable to generalize by making correct predictions on data samples unseen while training.

To illustrate this, let's train a Support Vector Machine naively on the digits dataset:



In [17]:

    
from sklearn.svm import SVC
SVC().fit(X, y).score(X, y)









    Out[17]:





1.0



In [18]:

    
SVC()









    Out[18]:





SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='rbf', max_iter=-1, probability=False, shrinking=True, tol=0.001,
  verbose=False)

Did we really learn a perfect model that can recognize the correct digit class 100% of the time? Without new data it's impossible to tell.

Let's start again and split the dataset into two random, non overlapping subsets:



In [23]:

    
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=0) # random state can split like 1

print("train data shape: %r, train target shape: %r"
      % (X_train.shape, y_train.shape))
print("test data shape: %r, test target shape: %r"
      % (X_test.shape, y_test.shape))









    



train data shape: (1347, 64), train target shape: (1347,)
test data shape: (450, 64), test target shape: (450,)

Let's retrain a new model on the first subset call the training set:



In [15]:

    
svc = SVC(kernel='rbf').fit(X_train, y_train)
train_score = svc.score(X_train, y_train) 
train_score









    Out[15]:





1.0

We can now compute the performance of the model on new, held out data from the test set:



In [16]:

    
test_score = svc.score(X_test, y_test)
test_score









    Out[16]:





0.48666666666666669

This score is clearly not as good as expected! The model cannot generalize so well to new, unseen data.

Whenever the test data score is not as good as the train score the model is overfitting
Whenever the train score is not close to 100% accuracy the model is underfitting

Ideally we want to neither overfit nor underfit: test_score ~= train_score ~= 1.0.

The previous example failed to generalized well to test data because we naively used the default parameters of the SVC class:



In [19]:

    
svc









    Out[19]:





SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='rbf', max_iter=-1, probability=False, shrinking=True, tol=0.001,
  verbose=False)

Let's try again with another parameterization:



In [20]:

    
svc_2 = SVC(kernel='rbf', C=100, gamma=0.001).fit(X_train, y_train)
svc_2









    Out[20]:





SVC(C=100, cache_size=200, class_weight=None, coef0=0.0, degree=3,
  gamma=0.001, kernel='rbf', max_iter=-1, probability=False,
  shrinking=True, tol=0.001, verbose=False)



In [21]:

    
svc_2.score(X_train, y_train)









    Out[21]:





1.0



In [22]:

    
svc_2.score(X_test, y_test)









    Out[22]:





0.99333333333333329

In this case the model is almost perfectly able to generalize, at least according to our random train, test split.

Cross Validation

Cross Validation is a procedure to repeat the train / test split several times to as to get a more accurate estimate of the real test score by averaging the values found of the individual runs.

The sklearn.cross_validation package provides many strategies to compute such splits using class that implement the python iterator API:



In [26]:

    
from sklearn.cross_validation import ShuffleSplit

cv = ShuffleSplit(n_samples, n_iter=3, test_size=0.1,
    random_state=0)

for cv_index, (train, test) in enumerate(cv):
    print("# Cross Validation Iteration #%d" % cv_index)
    print("train indices: {0}...".format(train[:10]))
    print("test indices: {0}...".format(test[:10]))
    
    svc = SVC(kernel="rbf", C=1, gamma=0.001).fit(X[train], y[train])
    print("train score: {0:.3f}, test score: {1:.3f}\n".format(
        svc.score(X[train], y[train]), svc.score(X[test], y[test])))









    



# Cross Validation Iteration #0
train indices: [ 353    5   58 1349 1025  575 1074 1110 1745  689]...
test indices: [1081 1707  927  713  262  182  303  895  933 1266]...
train score: 0.999, test score: 0.989

# Cross Validation Iteration #1
train indices: [1336  608  977   22  526 1587 1130  569 1481  962]...
test indices: [1014  755 1633  117  181  501  948 1076   45  659]...
train score: 0.998, test score: 0.994

# Cross Validation Iteration #2
train indices: [ 451  409  911 1551  133  691 1306  111  852  825]...
test indices: [ 795  697  655  573  412  743  635  851 1466 1383]...
train score: 0.999, test score: 0.994

Instead of doing the above manually, sklearn.cross_validation provides a little utility function to compute the cross validated test scores automatically:



In [27]:

    
from sklearn.cross_validation import cross_val_score

svc = SVC(kernel="rbf", C=1, gamma=0.001)
cv = ShuffleSplit(n_samples, n_iter=10, test_size=0.1,
    random_state=0)

test_scores = cross_val_score(svc, X, y, cv=cv, n_jobs=2)
test_scores









    Out[27]:





array([ 0.98888889,  0.99444444,  0.99444444,  0.99444444,  0.99444444,
        0.99444444,  0.98888889,  0.99444444,  0.98888889,  1.        ])



In [31]:

    
from scipy.stats import sem #standard error mean

def mean_score(scores):
    """Print the empirical mean score and standard error of the mean."""
    return ("Mean score: {0:.3f} (+/-{1:.3f})").format(
        np.mean(scores), sem(scores))



In [29]:

    
print(mean_score(test_scores))









    



Mean score: 0.993 (+/-0.001)

Exercise:

Perform 50 iterations of cross validation with randomly sampled folds of 500 training samples and 500 test samples randomly sampled from X and y (use sklearn.cross_validation.ShuffleSplit).
Try with SVC(C=1, gamma=0.01)
Plot distribution the test error using an histogram with 50 bins.
Try to increase the training size
Retry with SVC(C=10, gamma=0.005), then SVC(C=10, gamma=0.001) with 500 samples.
Optional: use a smoothed kernel density estimation scipy.stats.kde.gaussian_kde instead of an histogram to visualize the test error distribution.

Hints, type:

from sklearn.cross_validation import ShuffleSplit
ShuffleSplit?  # to read the docstring of the shuffle split
pl.hist?  # to read the docstring of the histogram plot



In [33]:

    
from sklearn.cross_validation import cross_val_score

svc = SVC(kernel="rbf", C=1, gamma=0.001)
cv = ShuffleSplit(n_samples, n_iter=50, test_size=0.1,
    random_state=0)

test_scores = cross_val_score(svc, X, y, cv=cv, n_jobs=2)
test_scores









    Out[33]:





array([ 0.98888889,  0.99444444,  0.99444444,  0.99444444,  0.99444444,
        0.99444444,  0.98888889,  0.99444444,  0.98888889,  1.        ,
        0.98333333,  0.98333333,  1.        ,  0.98888889,  0.99444444,
        0.99444444,  0.98888889,  0.99444444,  0.99444444,  0.99444444,
        0.98888889,  1.        ,  0.98333333,  0.98888889,  0.97777778,
        1.        ,  1.        ,  0.98888889,  1.        ,  0.98888889,
        0.99444444,  0.99444444,  1.        ,  1.        ,  1.        ,
        0.98888889,  0.99444444,  0.98888889,  0.99444444,  1.        ,
        1.        ,  0.98888889,  0.98888889,  0.98888889,  1.        ,
        0.99444444,  0.99444444,  0.99444444,  0.98888889,  0.98333333])



In [54]:

    
from sklearn.cross_validation import ShuffleSplit

cv = ShuffleSplit(n_samples, n_iter=50, train_size=500, test_size=500,
    random_state=0)

for cv_index, (train, test) in enumerate(cv):
    print("# Cross Validation Iteration #%d" % cv_index)
    print("train indices: {0}...".format(train[:10]))
    print("test indices: {0}...".format(test[:10]))
    
    svc = SVC(kernel="rbf", C=1, gamma=0.001).fit(X[train], y[train])
    print("train score: {0:.3f}, test score: {1:.3f}\n".format(
        svc.score(X[train], y[train]), svc.score(X[test], y[test])))









    



# Cross Validation Iteration #0
train indices: [1166 1471  432   77 1017  215  881  482  844  578]...
test indices: [1081 1707  927  713  262  182  303  895  933 1266]...
train score: 0.998, test score: 0.982

# Cross Validation Iteration #1
train indices: [1497  736  643 1308  316 1513 1148  234 1730  470]...
test indices: [1014  755 1633  117  181  501  948 1076   45  659]...
train score: 0.998, test score: 0.986

# Cross Validation Iteration #2
train indices: [ 397  235 1780 1145 1103  654  350 1270  756 1389]...
test indices: [ 795  697  655  573  412  743  635  851 1466 1383]...
train score: 0.998, test score: 0.982

# Cross Validation Iteration #3
train indices: [ 578 1720 1081  151 1689  821  469 1284  816    5]...
test indices: [ 771  303 1751 1387  703  765 1543  970 1026  743]...
train score: 1.000, test score: 0.972

# Cross Validation Iteration #4
train indices: [ 234 1708 1044  530  587 1641 1530 1706  831 1633]...
test indices: [ 970  659  408 1537 1106  134 1293  664 1524  717]...
train score: 0.998, test score: 0.970

# Cross Validation Iteration #5
train indices: [ 643   12 1470  602 1152  152 1035  534  685  194]...
test indices: [ 137  285  507  188 1790 1634  298  491  853 1592]...
train score: 1.000, test score: 0.992

# Cross Validation Iteration #6
train indices: [ 932  360  725  166  721 1628 1684  328 1574 1221]...
test indices: [1173  719 1309 1510  543  177 1086  385 1125 1564]...
train score: 1.000, test score: 0.984

# Cross Validation Iteration #7
train indices: [ 700  447  219 1269   78  313  178 1722   34 1299]...
test indices: [ 538 1315 1268 1165  167 1353  459  267  143  484]...
train score: 1.000, test score: 0.994

# Cross Validation Iteration #8
train indices: [ 957  147 1597  347  342 1211   84 1758  872  685]...
test indices: [ 326  869  436 1367  791 1720 1500 1002   44  148]...
train score: 1.000, test score: 0.980

# Cross Validation Iteration #9
train indices: [1364 1036 1122  191  208 1319  809  588 1587   61]...
test indices: [1317 1367 1706  589 1779  398  920  708  652  663]...
train score: 0.998, test score: 0.988

# Cross Validation Iteration #10
train indices: [1403 1514  492 1544  543  367  640  631 1013 1689]...
test indices: [1031 1682  125 1770  319  421  601  715  666 1084]...
train score: 1.000, test score: 0.988

# Cross Validation Iteration #11
train indices: [1207 1336  132 1746  414    2  348 1219  586   49]...
test indices: [1161  156  968  163  753 1536  311  798 1430  792]...
train score: 1.000, test score: 0.992

# Cross Validation Iteration #12
train indices: [1276  501 1556  351 1652  686 1656  893   24 1706]...
test indices: [ 669  444   89  223 1220 1386  830  995  251  202]...
train score: 1.000, test score: 0.984

# Cross Validation Iteration #13
train indices: [1657  519 1778  670 1792 1233  522  990 1783 1423]...
test indices: [975 366 180 759  57 739 454 570 927 828]...
train score: 1.000, test score: 0.976

# Cross Validation Iteration #14
train indices: [ 487 1202 1163 1509  865 1155 1349 1570  183  363]...
test indices: [ 242  240  795  184  709  810   76 1576 1005  303]...
train score: 1.000, test score: 0.980

# Cross Validation Iteration #15
train indices: [ 618 1248 1064  529 1668 1257  848 1214 1443  659]...
test indices: [ 930  258 1413  747 1522 1709  783   66   48  840]...
train score: 0.998, test score: 0.988

# Cross Validation Iteration #16
train indices: [1740  186  573 1611 1639  124 1532  378  941 1334]...
test indices: [1019 1773 1594   22  298  815  492 1729  376  217]...
train score: 1.000, test score: 0.972

# Cross Validation Iteration #17
train indices: [1191 1456  105 1603  362 1161 1646  451 1480 1710]...
test indices: [1154  566 1201 1224 1566  217 1037 1215 1455  616]...
train score: 1.000, test score: 0.978

# Cross Validation Iteration #18
train indices: [1027  795  988  711 1495    8 1384  287  252 1267]...
test indices: [1108 1067  161  701  276  694  735  479  231 1672]...
train score: 1.000, test score: 0.976

# Cross Validation Iteration #19
train indices: [1009  171 1290 1191 1762  165 1675 1164 1718  650]...
test indices: [1048 1364  105  334 1534  775  279  234 1027 1698]...
train score: 1.000, test score: 0.980

# Cross Validation Iteration #20
train indices: [1273   41 1336  636    2  193  223  306  234 1027]...
test indices: [1377 1689 1445 1452  701 1488 1742  964  293    8]...
train score: 0.998, test score: 0.986

# Cross Validation Iteration #21
train indices: [1631 1258  213  421  237 1024 1670  425  504 1118]...
test indices: [1478 1340  566 1223 1263  776 1237  889  329  782]...
train score: 1.000, test score: 0.976

# Cross Validation Iteration #22
train indices: [ 596  855 1668  400  674  187 1552 1576 1373  952]...
test indices: [ 980 1359  158 1785  829  819  499  947 1208 1230]...
train score: 1.000, test score: 0.966

# Cross Validation Iteration #23
train indices: [ 454  720 1128  425  314  897 1413  947 1791  146]...
test indices: [ 228  495 1110 1415 1483  925 1218  671 1575   45]...
train score: 0.998, test score: 0.980

# Cross Validation Iteration #24
train indices: [1271  955   80  777  675 1782 1680  287 1780 1750]...
test indices: [1753 1773 1116  509  876 1645  987 1360  598  988]...
train score: 0.998, test score: 0.982

# Cross Validation Iteration #25
train indices: [1724 1046 1130  563 1531 1583 1197  809 1495   10]...
test indices: [1146 1511  827 1573 1342  307 1754  683  355 1299]...
train score: 0.998, test score: 0.986

# Cross Validation Iteration #26
train indices: [1203  314  772 1554 1570  874  798  760  157 1646]...
test indices: [1278  153  163  765 1440 1795 1713 1202  515 1765]...
train score: 1.000, test score: 0.972

# Cross Validation Iteration #27
train indices: [1374 1223 1680 1349 1426  765  600  959 1361 1341]...
test indices: [ 980 1387  229  638  897  521 1761  501 1658  121]...
train score: 1.000, test score: 0.970

# Cross Validation Iteration #28
train indices: [ 904  223 1458 1316    7 1634 1506  887 1745 1184]...
test indices: [  96  590  893  194  886 1211  827 1267   71 1125]...
train score: 1.000, test score: 0.976

# Cross Validation Iteration #29
train indices: [ 572 1193 1081 1569 1760 1086  167 1457 1337   70]...
test indices: [ 397 1643  696  814 1516 1334 1125  296  487   33]...
train score: 0.998, test score: 0.974

# Cross Validation Iteration #30
train indices: [1665 1223  791  598  402  776    6  239 1735  976]...
test indices: [ 139 1619  829 1709  755  374  225  286  871 1560]...
train score: 1.000, test score: 0.978

# Cross Validation Iteration #31
train indices: [1177  379 1133  929 1528 1444  473 1523 1250  741]...
test indices: [1605  546 1469  589    8  547  965 1287 1079  628]...
train score: 0.998, test score: 0.974

# Cross Validation Iteration #32
train indices: [1407  742 1535 1202 1790 1791  538  117  788 1575]...
test indices: [1392 1299  173  632  584  961 1184  436  251  303]...
train score: 1.000, test score: 0.982

# Cross Validation Iteration #33
train indices: [ 460  531 1708 1128 1579  910 1486 1643 1740  158]...
test indices: [ 176 1311  483  219  503  967  529  945 1190 1071]...
train score: 1.000, test score: 0.986

# Cross Validation Iteration #34
train indices: [ 821  230  504  984 1107  239   47  336 1756 1323]...
test indices: [1002  809  910 1669 1157  659 1159  653 1220 1608]...
train score: 1.000, test score: 0.978

# Cross Validation Iteration #35
train indices: [ 771 1097  403  222  288 1671 1155 1293  209 1465]...
test indices: [1123 1041  513 1575  115  783  762 1796 1600 1222]...
train score: 0.998, test score: 0.980

# Cross Validation Iteration #36
train indices: [ 320 1771  358  718  374  914 1005  690 1154 1545]...
test indices: [ 110  111 1516  529 1248  930 1615 1684   39 1082]...
train score: 1.000, test score: 0.974

# Cross Validation Iteration #37
train indices: [ 187 1647  662 1395 1580  810  333 1655 1267 1711]...
test indices: [1320  640  893  857 1454  880  687 1388  303  223]...
train score: 1.000, test score: 0.984

# Cross Validation Iteration #38
train indices: [ 750 1057  497 1109  502 1728  541 1260 1584 1471]...
test indices: [ 362 1790 1551 1120 1636 1221  527  740  305  798]...
train score: 0.998, test score: 0.984

# Cross Validation Iteration #39
train indices: [ 358  939  519 1633 1698  630 1317 1321  326 1137]...
test indices: [1564  895 1615 1480 1388 1415  456  876 1540 1444]...
train score: 1.000, test score: 0.976

# Cross Validation Iteration #40
train indices: [ 110  290 1712  261  189  499 1532 1469  139 1021]...
test indices: [ 824 1214  769  643  825  596  866  892 1184  795]...
train score: 0.996, test score: 0.976

# Cross Validation Iteration #41
train indices: [1371  967 1525 1761  279 1120  106 1523  991 1318]...
test indices: [1153 1513 1229 1155   46  579 1421  689 1560  259]...
train score: 0.998, test score: 0.986

# Cross Validation Iteration #42
train indices: [1734  431  504  519  398  426   12 1150  277 1420]...
test indices: [ 790 1772   39 1235  367  707  418  987 1398 1241]...
train score: 0.998, test score: 0.984

# Cross Validation Iteration #43
train indices: [1323  677  321 1149 1704 1099  530 1779  363 1330]...
test indices: [1077  366  760 1488 1009  521  381 1063 1593  577]...
train score: 1.000, test score: 0.978

# Cross Validation Iteration #44
train indices: [  58  377 1202 1267  296  199  899  102  360 1079]...
test indices: [ 407 1201  719 1493  960 1033  970 1385  468 1199]...
train score: 0.996, test score: 0.974

# Cross Validation Iteration #45
train indices: [1678  812  911 1561 1698  242  784  873 1749  985]...
test indices: [ 451 1094  154  813  653  637 1084 1569  495  821]...
train score: 0.998, test score: 0.988

# Cross Validation Iteration #46
train indices: [ 166  928 1789  477 1368 1642  456 1718  244 1026]...
test indices: [1212  336  673 1233   83  421  111  429  267  529]...
train score: 1.000, test score: 0.978

# Cross Validation Iteration #47
train indices: [ 895  771 1317 1651  471 1370  361  939  908 1059]...
test indices: [ 538  252 1458  480 1737  708  528   37  717  649]...
train score: 0.996, test score: 0.982

# Cross Validation Iteration #48
train indices: [ 647 1753 1421 1089  328  712 1439 1352  110  872]...
test indices: [ 448  121  834 1121  696 1594 1307 1726  938 1551]...
train score: 1.000, test score: 0.984

# Cross Validation Iteration #49
train indices: [ 804  255 1578   36 1117  338  406  563 1761 1200]...
test indices: [ 896 1675  872 1005 1244 1760 1446  365  414 1471]...
train score: 1.000, test score: 0.972

Model Selection with Grid Search

Cross Validation makes it possible to evaluate the performance of a model class and its hyper parameters on the task at hand.

A natural extension is thus to run CV several times for various values of the parameters so as to find the best. For instance, let's fix the SVC parameter to C=10 and compute the cross validated test score for various values of gamma:



In [56]:

    
n_gammas = 10
n_iter = 5
cv = ShuffleSplit(n_samples, n_iter=n_iter, train_size=500, test_size=500,
    random_state=0)

train_scores = np.zeros((n_gammas, n_iter))
test_scores = np.zeros((n_gammas, n_iter))
gammas = np.logspace(-7, -1, n_gammas)

for i, gamma in enumerate(gammas):
    for j, (train, test) in enumerate(cv):
        clf = SVC(C=10, gamma=gamma).fit(X[train], y[train])
        train_scores[i, j] = clf.score(X[train], y[train])
        test_scores[i, j] = clf.score(X[test], y[test])



In [57]:

    
for i in range(n_iter):
    pl.semilogx(gammas, train_scores[:, i], alpha=0.4, lw=2, c='b')
    pl.semilogx(gammas, test_scores[:, i], alpha=0.4, lw=2, c='g')
pl.ylabel("score for SVC(C=10, gamma=gamma)")
pl.xlabel("gamma")
#pl.text(1e-6, 0.5, "Underfitting", fontsize=16, ha='center', va='bottom')
#pl.text(1e-4, 0.5, "Good", fontsize=16, ha='center', va='bottom')
#pl.text(1e-2, 0.5, "Overfitting", fontsize=16, ha='center', va='bottom')









    Out[57]:





<matplotlib.text.Text at 0x10c03efd0>

We can see that, for this model class, on this unscaled dataset: when C=10, there is a sweet spot region for gamma around $10^4$ to $10^3$. Both the train and test scores are high (low errors).

If gamma is too low, train score is low (and thus test scores too as it generally cannot be better than the train score): the model is not expressive enough to represent the data: the model is in an underfitting regime.
If gamma is too high, train score is ok but there is a high discrepency between test and train score. The model is learning the training data and its noise by heart and fails to generalize to new unseen data: the model is in an overfitting regime.

We can do the same kind analysis to identify good values for C when gamma is fixed to $10^3$:



In [38]:

    
n_Cs = 10
n_iter = 5
cv = ShuffleSplit(n_samples, n_iter=n_iter, train_size=500, test_size=500,
    random_state=0)

train_scores = np.zeros((n_Cs, n_iter))
test_scores = np.zeros((n_Cs, n_iter))
Cs = np.logspace(-5, 5, n_Cs)

for i, C in enumerate(Cs):
    for j, (train, test) in enumerate(cv):
        clf = SVC(C=C, gamma=1e-3).fit(X[train], y[train])
        train_scores[i, j] = clf.score(X[train], y[train])
        test_scores[i, j] = clf.score(X[test], y[test])



In [43]:

    
for i in range(n_iter):
    pl.semilogx(Cs, train_scores[:, i], alpha=0.4, lw=2, c='b')
    pl.semilogx(Cs, test_scores[:, i], alpha=0.4, lw=2, c='g')
pl.ylabel("score for SVC(C=C, gamma=1e-3)")
pl.xlabel("C")
#pl.text(1e-3, 0.5, "Underfitting", fontsize=16, ha='center', va='bottom')
#pl.text(1e3, 0.5, "Few Overfitting", fontsize=16, ha='center', va='bottom')









    Out[43]:





<matplotlib.text.Text at 0x10bdb9d50>

Doing this procedure several for each parameter combination is tedious, hence it's possible to automate the procedure by computing the test score for all possible combinations of parameters using the GridSearchCV helper.



In [45]:

    
from sklearn.grid_search import GridSearchCV
#help(GridSearchCV)



In [46]:

    
from pprint import pprint
svc_params = {
    'C': np.logspace(-1, 2, 4),
    'gamma': np.logspace(-4, 0, 5),
}
pprint(svc_params)









    



{'C': array([   0.1,    1. ,   10. ,  100. ]),
 'gamma': array([  1.00000000e-04,   1.00000000e-03,   1.00000000e-02,
         1.00000000e-01,   1.00000000e+00])}

As Grid Search is a costly procedure, let's do the some experiments with a smaller dataset:



In [47]:

    
n_subsamples = 500
X_small_train, y_small_train = X_train[:n_subsamples], y_train[:n_subsamples]



In [48]:

    
gs_svc = GridSearchCV(SVC(), svc_params, cv=3, n_jobs=-1)

%time _ = gs_svc.fit(X_small_train, y_small_train)









    



CPU times: user 0.11 s, sys: 0.05 s, total: 0.17 s
Wall time: 1.81 s



In [49]:

    
gs_svc.best_params_, gs_svc.best_score_









    Out[49]:





({'C': 1.0, 'gamma': 0.001}, 0.9659957194045643)

Let's define a couple of helper function to help us introspect the details of the grid search outcome:



In [50]:

    
def display_scores(params, scores, append_star=False):
    """Format the mean score +/- std error for params"""
    params = ", ".join("{0}={1}".format(k, v)
                      for k, v in params.items())
    line = "{0}:\t{1:.3f} (+/-{2:.3f})".format(
        params, np.mean(scores), sem(scores))
    if append_star:
        line += " *"
    return line

def display_grid_scores(grid_scores, top=None):
    """Helper function to format a report on a grid of scores"""
    
    grid_scores = sorted(grid_scores, key=lambda x: x[1], reverse=True)
    if top is not None:
        grid_scores = grid_scores[:top]
        
    # Compute a threshold for staring models with overlapping
    # stderr:
    _, best_mean, best_scores = grid_scores[0]
    threshold = best_mean - 2 * sem(best_scores)
    
    for params, mean_score, scores in grid_scores:
        append_star = mean_score + 2 * sem(scores) > threshold
        print(display_scores(params, scores, append_star=append_star))



In [51]:

    
display_grid_scores(gs_svc.grid_scores_, top=20)









    



C=1.0, gamma=0.001:	0.966 (+/-0.009) *
C=10.0, gamma=0.001:	0.966 (+/-0.002) *
C=100.0, gamma=0.001:	0.966 (+/-0.002) *
C=10.0, gamma=0.0001:	0.966 (+/-0.008) *
C=100.0, gamma=0.0001:	0.962 (+/-0.007) *
C=1.0, gamma=0.0001:	0.922 (+/-0.003)
C=0.1, gamma=0.001:	0.722 (+/-0.008)
C=10.0, gamma=0.01:	0.314 (+/-0.018)
C=100.0, gamma=0.01:	0.314 (+/-0.018)
C=1.0, gamma=0.01:	0.266 (+/-0.012)
C=0.1, gamma=0.0001:	0.168 (+/-0.003)
C=0.1, gamma=0.01:	0.128 (+/-0.002)
C=0.1, gamma=0.1:	0.128 (+/-0.002)
C=0.1, gamma=1.0:	0.128 (+/-0.002)
C=1.0, gamma=0.1:	0.128 (+/-0.002)
C=1.0, gamma=1.0:	0.128 (+/-0.002)
C=10.0, gamma=0.1:	0.128 (+/-0.002)
C=10.0, gamma=1.0:	0.128 (+/-0.002)
C=100.0, gamma=0.1:	0.128 (+/-0.002)
C=100.0, gamma=1.0:	0.128 (+/-0.002)

One can see that Support Vector Machine with RBF kernel are very sensitive wrt. the gamma parameter (the badwith of the kernel) and to some lesser extend to the C parameter as well. If those parameter are not grid searched, the predictive accurracy of the support vector machine is almost no better than random guessing!

By default, the GridSearchCV class refits a final model on the complete training set with the best parameters found by during the grid search:



In [52]:

    
gs_svc.score(X_test, y_test)









    Out[52]:





0.98444444444444446

Evaluating this final model on the real test set will often yield a better score because of the larger training set, especially when the training set is small and the number of cross validation folds is small (cv=3 here).

Exercise:

Find a set of parameters for an sklearn.tree.DecisionTreeClassifier on the X_small_train / y_small_train digits dataset to reach at least 75% accuracy on the sample dataset (500 training samples)
In particular you can grid search good values for criterion, min_samples_split and max_depth
Which parameter(s) seems to be the most important to tune?
Retry with sklearn.ensemble.ExtraTreesClassifier(n_estimators=30) which is a randomized ensemble of decision trees. Does the parameters that make the single trees work best also make the ensemble model work best?

Hints:

If the outcome of the grid search is too instable (overlapping std errors), increase the number of CV folds with cv constructor parameter. The default value is cv=3. Increasing it to cv=5 or cv=10 often yield more stable results but at the price of longer evaluation times.
Start with a small grid, e.g. 2 values criterion and 3 for min_samples_split only to avoid having to wait for too long at first.

Type:

from sklearn.tree.DecisionTreeClassifier
DecisionTreeClassifier?  # to read the docstring and know the list of important parameters
print(DecisionTreeClassifier())  # to show the list of default values

from sklearn.ensemble.ExtraTreesClassifier
ExtraTreesClassifier? 
print(ExtraTreesClassifier())



In [ ]:

Plotting Learning Curves for Bias-Variance analysis

In order to better understand the behavior of model (model class + contructor parameters), is it possible to run several cross validation steps for various random sub-samples of the training set and then plot the mean training and test errors.

These plots are called the learning curves.

sklearn does not yet provide turn-key utilities to plot such learning curves but is not very complicated to compute them by leveraging the ShuffleSplit class. First let's define a range of data set sizes for subsampling the training set:



In [58]:

    
train_sizes = np.logspace(2, 3, 5).astype(np.int)
train_sizes









    Out[58]:





array([ 100,  177,  316,  562, 1000])

For each training set sizes we will compute n_iter cross validation iterations. Let's pre-allocate the arrays to store the results:



In [59]:

    
n_iter = 5
train_scores = np.zeros((train_sizes.shape[0], n_iter), dtype=np.float)
test_scores = np.zeros((train_sizes.shape[0], n_iter), dtype=np.float)

We can now loop over training set sizes and CV iterations:



In [60]:

    
svc = SVC(C=1, gamma=0.0005)

for i, train_size in enumerate(train_sizes):
    cv = ShuffleSplit(n_samples, n_iter=n_iter, train_size=train_size)
    for j, (train, test) in enumerate(cv):
        svc.fit(X[train], y[train])
        train_scores[i, j] = svc.score(X[train], y[train])
        test_scores[i, j] = svc.score(X[test], y[test])

We can now plot the mean scores with error bars that reflect the standard errors of the means:



In [61]:

    
mean_train = np.mean(train_scores, axis=1)
confidence = sem(train_scores, axis=1) * 2

pl.fill_between(train_sizes, mean_train - confidence, mean_train + confidence,
                color = 'b', alpha = .2)
pl.plot(train_sizes, mean_train, 'o-k', c='b', label='Train score')

mean_test = np.mean(test_scores, axis=1)
confidence = sem(test_scores, axis=1) * 2

pl.fill_between(train_sizes, mean_test - confidence, mean_test + confidence,
                color = 'g', alpha = .2)
pl.plot(train_sizes, mean_test, 'o-k', c='g', label='Test score')

pl.xlabel('Training set size')
pl.ylabel('Score')
pl.xlim(0, X_train.shape[0])
pl.ylim((None, 1.01))  # The best possible score is 1.0
pl.legend(loc='best')
pl.title('Main train and test scores +/- 2 standard errors')

pl.text(250, 0.9, "Overfitting a lot", fontsize=16, ha='center', va='bottom')
pl.text(800, 0.9, "Overfitting a little", fontsize=16, ha='center', va='bottom')









    Out[61]:





<matplotlib.text.Text at 0x10b676d10>

Interpreting Learning Curves

If the training set error is high (e.g. more than 5% misclassification) at the end of the learning curve, the model suffers from high bias and is said to underfit the training set.
If the testing set error is significantly larger than the training set error, the model suffers from high variance and is said to overfit the training set.

Another possible source of high training and testing error is label noise: the data is too noisy and there is nothing few signal learn from it.

What to do against overfitting?

Try to get rid of noisy features using feature selection methods (or better let the model do it if the regularization is able to do so: for instance l1 penalized linear models)
Try to tune parameters to add more regularization:
- Smaller values of C for SVM
- Larger values of alpha for penalized linear models
- Restrict to shallower trees (decision stumps) and lower numbers of samples per leafs for tree-based models
Try simpler model families such as penalized linear models (e.g. Linear SVM, Logistic Regression, Naive Bayes)
Try the ensemble strategies that average several independently trained models (e.g. bagging or blending ensembles): average the predictions of independently trained models
Collect more labeled samples if the learning curves of the test score has a non-zero slope on the right hand side.

What to do against underfitting?

Give more freedom to the model by relaxing some parameters that act as regularizers:
- Larger values of C for SVM
- Smaller values of alpha for penalized linear models
- Allow deeper trees and lower numbers of samples per leafs for tree-based models
Try more complex / expressive model families:
- Non linear kernel SVMs,
- Ensemble of Decision Trees...
Construct new features:
- bi-gram frequencies for text classifications
- feature cross-products (possibly using the hashing trick)
- unsupervised features extraction (e.g. triangle k-means, auto-encoders...)
- non-linear kernel approximations + linear SVM instead of simple linear SVM

Final Model Assessment

Grid Search parameters tuning can it-self be considered a (meta-)learning algorithm. Hence there is a risk of not taking into account the overfitting of the grid search procedure it-self.

To quantify and mitigate this risk we can nest the train / test split concept one level up:

Maker a top level "Development / Evaluation" sets split:

Development set used for Grid Search and training of the model with optimal parameter set
Hold out evaluation set used only for estimating the predictive performance of the resulting model

For dataset sampled over time, it is highly recommended to use a temporal split for the Development / Evaluation split: for instance, if you have collected data over the 2008-2013 period, you can:

use 2008-2011 for development (grid search optimal parameters and model class),
2012-2013 for evaluation (compute the test score of the best model parameters).

One Final Note About kernel SVM Parameters Tuning

In this session we applied the SVC model with RBF kernel on unormalized features: this is bad! If we had used a normalizer, the default parameters for C and gamma of SVC would directly have led to close to optimal performance:



In [ ]:

    
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

clf = SVC().fit(X_train_scaled, y_train)  # Look Ma'! Default params!
print("Train score: {0:.3f}".format(clf.score(X_train_scaled, y_train)))
print("Test score: {0:.3f}".format(clf.score(X_test_scaled, y_test)))

This is because once normalized, the digits is very regular and fits the assumptions of the default parameters of the SVC class very well. This is rarely the case though and usually it's always necessary to grid search the parameters.

Nonetheless, scaling should be a mandatory preprocessing step when using SVC, especially with a RBF kernel.