This notebook illustrates finding confidence intervals in the Digits dataset. It is a version of the Scikit-Learn example Pipelining: chaining a PCA and a logistic regression.
The main point it shows is using pandas structures throughout the code, as well as the ease of creating pipelines using the | operator.
First we load the dataset into a pandas.DataFrame.
In [1]:
import multiprocessing
import pandas as pd
import numpy as np
from sklearn import datasets
import seaborn as sns
sns.set_style('whitegrid')
from sklearn.externals import joblib
from ibex.sklearn import decomposition as pd_decomposition
from ibex.sklearn import linear_model as pd_linear_model
from ibex.sklearn import model_selection as pd_model_selection
from ibex.sklearn.model_selection import GridSearchCV as PdGridSearchCV
%pylab inline
In [2]:
digits = datasets.load_digits()
features = ['f%d' % i for i in range(digits['data'].shape[1])]
digits = pd.DataFrame(
np.c_[digits['data'], digits['target']],
columns=features+['digit'])
digits.head()
Out[2]:
Following the sickit-learn example, we now pipe the PCA step to a logistic regressor.
In [3]:
clf = pd_decomposition.PCA() | pd_linear_model.LogisticRegression()
We now find the optimal fit parameters using grid-search CV.
In [4]:
estimator = PdGridSearchCV(
clf,
{'pca__n_components': [20, 40, 64], 'logisticregression__C': np.logspace(-4, 4, 3)},
n_jobs=multiprocessing.cpu_count())
estimator.fit(digits[features], digits.digit)
Out[4]:
It is interesting to look at the best parameters and the best score:
In [5]:
params = estimator.best_estimator_.get_params()
params['pca__n_components'], params['logisticregression__C']
Out[5]:
In [6]:
estimator.best_score_
Out[6]:
How significant is the improvement in the score?
Using the parameters found in the grid-search CV, we perform 1000 jacknife (leave 15% out) iterations.
In [7]:
all_scores = pd_model_selection.cross_val_score(
estimator.best_estimator_,
digits[features],
digits.digit,
cv=pd_model_selection.ShuffleSplit(
n_splits=100,
test_size=0.15),
n_jobs=-1)
In [8]:
sns.boxplot(x=all_scores, color='grey', orient='v');
ylabel('classification score (mismatch)')
figtext(
0,
-0.1,
'Classification scores for optimized-parameter PCA followed by logistic-regression.');
Using just logistic regression (which is much faster), we do the same.
In [9]:
all_scores = pd_model_selection.cross_val_score(
pd_linear_model.LogisticRegression(),
digits[features],
digits.digit,
cv=pd_model_selection.ShuffleSplit(
n_splits=1000,
test_size=0.15),
n_jobs=-1)
In [10]:
sns.boxplot(x=all_scores, color='grey', orient='v');
ylabel('classification score (mismatch)')
figtext(
0,
-0.1,
'Classification scores for logistic-regression. The results do not seem significantly worse than the optimized-params' +
'PCA followed by logistic regression');