Confidence Intervals In The Digits Dataset

This notebook illustrates finding confidence intervals in the Digits dataset. It is a version of the Scikit-Learn example Pipelining: chaining a PCA and a logistic regression.

The main point it shows is using pandas structures throughout the code, as well as the ease of creating pipelines using the | operator.

Loading The Data

First we load the dataset into a pandas.DataFrame.



In [1]:

    
import multiprocessing

import pandas as pd
import numpy as np
from sklearn import datasets
import seaborn as sns
sns.set_style('whitegrid')

from sklearn.externals import joblib

from ibex.sklearn import decomposition as pd_decomposition
from ibex.sklearn import linear_model as pd_linear_model
from ibex.sklearn import model_selection as pd_model_selection
from ibex.sklearn.model_selection import GridSearchCV as PdGridSearchCV


%pylab inline









    



Populating the interactive namespace from numpy and matplotlib



In [2]:

    
digits = datasets.load_digits()
features = ['f%d' % i for i in range(digits['data'].shape[1])]
digits = pd.DataFrame(
    np.c_[digits['data'], digits['target']], 
    columns=features+['digit'])
digits.head()









    Out[2]:






  
    
      
      f0
      f1
      f2
      f3
      f4
      f5
      f6
      f7
      f8
      f9
      ...
      f55
      f56
      f57
      f58
      f59
      f60
      f61
      f62
      f63
      digit
    
  
  
    
      0
      0
      0
      5
      13
      9
      1
      0
      0
      0
      0
      ...
      0
      0
      0
      6
      13
      10
      0
      0
      0
      0
    
    
      1
      0
      0
      0
      12
      13
      5
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      11
      16
      10
      0
      0
      1
    
    
      2
      0
      0
      0
      4
      15
      12
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      3
      11
      16
      9
      0
      2
    
    
      3
      0
      0
      7
      15
      13
      1
      0
      0
      0
      8
      ...
      0
      0
      0
      7
      13
      13
      9
      0
      0
      3
    
    
      4
      0
      0
      0
      1
      11
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      2
      16
      4
      0
      0
      4
    
  

5 rows × 65 columns

Repeating The Scikit-Learn Grid-Search CV Example

Following the sickit-learn example, we now pipe the PCA step to a logistic regressor.



In [3]:

    
clf = pd_decomposition.PCA() | pd_linear_model.LogisticRegression()

We now find the optimal fit parameters using grid-search CV.



In [4]:

    
estimator = PdGridSearchCV(                                                 
    clf,                                                                    
    {'pca__n_components': [20, 40, 64], 'logisticregression__C': np.logspace(-4, 4, 3)},
    n_jobs=multiprocessing.cpu_count())
estimator.fit(digits[features], digits.digit)









    Out[4]:





Adapter[GridSearchCV](cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('pca', Adapter[PCA](copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

It is interesting to look at the best parameters and the best score:



In [5]:

    
params = estimator.best_estimator_.get_params()
params['pca__n_components'], params['logisticregression__C']









    Out[5]:





(40, 1.0)



In [6]:

    
estimator.best_score_









    Out[6]:





0.92264885920979411

Finding The Scores' Confidence Intervals

How significant is the improvement in the score?

Using the parameters found in the grid-search CV, we perform 1000 jacknife (leave 15% out) iterations.



In [7]:

    
all_scores = pd_model_selection.cross_val_score(
    estimator.best_estimator_,
    digits[features],
    digits.digit,
    cv=pd_model_selection.ShuffleSplit(
        n_splits=100, 
        test_size=0.15),
    n_jobs=-1)



In [8]:

    
sns.boxplot(x=all_scores, color='grey', orient='v');
ylabel('classification score (mismatch)')
figtext(
    0, 
    -0.1, 
    'Classification scores for optimized-parameter PCA followed by logistic-regression.');

Using just logistic regression (which is much faster), we do the same.



In [9]:

    
all_scores = pd_model_selection.cross_val_score(
    pd_linear_model.LogisticRegression(),
    digits[features],
    digits.digit,
    cv=pd_model_selection.ShuffleSplit(
        n_splits=1000, 
        test_size=0.15),
    n_jobs=-1)



In [10]:

    
sns.boxplot(x=all_scores, color='grey', orient='v');
ylabel('classification score (mismatch)')
figtext(
    0, 
    -0.1, 
    'Classification scores for logistic-regression. The results do not seem significantly worse than the optimized-params' + 
        'PCA followed by logistic regression');

	f2	f3	f4	f5	f9	...	f58	f59	f60	f61	f62	digit
0	5	13	9	1	0	...	6	13	10	0	0	0
1	0	12	13	5	0	...	0	11	16	10	0	1
2	0	4	15	12	0	...	0	3	11	16	9	2
3	7	15	13	1	8	...	7	13	13	9	0	3
4	0	1	11	0	0	...	0	2	16	4	0	4

	f2	f3	f4	f5	f9	...	f58	f59	f60	f61	f62	digit
0	5	13	9	1	0	...	6	13	10	0	0	0
1	0	12	13	5	0	...	0	11	16	10	0	1
2	0	4	15	12	0	...	0	3	11	16	9	2
3	7	15	13	1	8	...	7	13	13	9	0	3
4	0	1	11	0	0	...	0	2	16	4	0	4

	f2	f3	f4	f5	f9	...	f58	f59	f60	f61	f62	digit
0	5	13	9	1	0	...	6	13	10	0	0	0
1	0	12	13	5	0	...	0	11	16	10	0	1
2	0	4	15	12	0	...	0	3	11	16	9	2
3	7	15	13	1	8	...	7	13	13	9	0	3
4	0	1	11	0	0	...	0	2	16	4	0	4