Confidence Intervals In The Digits Dataset

This notebook illustrates finding confidence intervals in the Digits dataset. It is a version of the Scikit-Learn example Pipelining: chaining a PCA and a logistic regression.

The main point it shows is using pandas structures throughout the code, as well as the ease of creating pipelines using the | operator.

Loading The Data

First we load the dataset into a pandas.DataFrame.


In [1]:
import multiprocessing

import pandas as pd
import numpy as np
from sklearn import datasets
import seaborn as sns
sns.set_style('whitegrid')

from sklearn.externals import joblib

from ibex.sklearn import decomposition as pd_decomposition
from ibex.sklearn import linear_model as pd_linear_model
from ibex.sklearn import model_selection as pd_model_selection
from ibex.sklearn.model_selection import GridSearchCV as PdGridSearchCV


%pylab inline


Populating the interactive namespace from numpy and matplotlib

In [2]:
digits = datasets.load_digits()
features = ['f%d' % i for i in range(digits['data'].shape[1])]
digits = pd.DataFrame(
    np.c_[digits['data'], digits['target']], 
    columns=features+['digit'])
digits.head()


Out[2]:
f0 f1 f2 f3 f4 f5 f6 f7 f8 f9 ... f55 f56 f57 f58 f59 f60 f61 f62 f63 digit
0 0 0 5 13 9 1 0 0 0 0 ... 0 0 0 6 13 10 0 0 0 0
1 0 0 0 12 13 5 0 0 0 0 ... 0 0 0 0 11 16 10 0 0 1
2 0 0 0 4 15 12 0 0 0 0 ... 0 0 0 0 3 11 16 9 0 2
3 0 0 7 15 13 1 0 0 0 8 ... 0 0 0 7 13 13 9 0 0 3
4 0 0 0 1 11 0 0 0 0 0 ... 0 0 0 0 2 16 4 0 0 4

5 rows × 65 columns

Repeating The Scikit-Learn Grid-Search CV Example

Following the sickit-learn example, we now pipe the PCA step to a logistic regressor.


In [3]:
clf = pd_decomposition.PCA() | pd_linear_model.LogisticRegression()

We now find the optimal fit parameters using grid-search CV.


In [4]:
estimator = PdGridSearchCV(                                                 
    clf,                                                                    
    {'pca__n_components': [20, 40, 64], 'logisticregression__C': np.logspace(-4, 4, 3)},
    n_jobs=multiprocessing.cpu_count())
estimator.fit(digits[features], digits.digit)


Out[4]:
Adapter[GridSearchCV](cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('pca', Adapter[PCA](copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

It is interesting to look at the best parameters and the best score:


In [5]:
params = estimator.best_estimator_.get_params()
params['pca__n_components'], params['logisticregression__C']


Out[5]:
(40, 1.0)

In [6]:
estimator.best_score_


Out[6]:
0.92264885920979411

Finding The Scores' Confidence Intervals

How significant is the improvement in the score?

Using the parameters found in the grid-search CV, we perform 1000 jacknife (leave 15% out) iterations.


In [7]:
all_scores = pd_model_selection.cross_val_score(
    estimator.best_estimator_,
    digits[features],
    digits.digit,
    cv=pd_model_selection.ShuffleSplit(
        n_splits=100, 
        test_size=0.15),
    n_jobs=-1)

In [8]:
sns.boxplot(x=all_scores, color='grey', orient='v');
ylabel('classification score (mismatch)')
figtext(
    0, 
    -0.1, 
    'Classification scores for optimized-parameter PCA followed by logistic-regression.');


Using just logistic regression (which is much faster), we do the same.

In [9]:
all_scores = pd_model_selection.cross_val_score(
    pd_linear_model.LogisticRegression(),
    digits[features],
    digits.digit,
    cv=pd_model_selection.ShuffleSplit(
        n_splits=1000, 
        test_size=0.15),
    n_jobs=-1)

In [10]:
sns.boxplot(x=all_scores, color='grey', orient='v');
ylabel('classification score (mismatch)')
figtext(
    0, 
    -0.1, 
    'Classification scores for logistic-regression. The results do not seem significantly worse than the optimized-params' + 
        'PCA followed by logistic regression');