Python Machine Learning - Code Examples

Bonus Material - A Basic Pipeline and Grid Search Setup

Note that the optional watermark extension is a small IPython notebook plugin that I developed to make the code reproducible. You can just skip the following line(s).


In [1]:
%load_ext watermark
%watermark -a '' -u -d -v -p numpy,pandas,matplotlib,sklearn


Sebastian Raschka 
Last updated: 01/20/2016 

CPython 3.5.1
IPython 4.0.1

numpy 1.10.1
pandas 0.17.1
matplotlib 1.5.0
scikit-learn 0.17

In [2]:
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split


# load and split data
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

# pipeline setup
cls = SVC(C=10.0, 
          kernel='rbf', 
          gamma=0.1, 
          decision_function_shape='ovr')

kernel_svm = Pipeline([('std', StandardScaler()), 
                       ('svc', cls)])

# gridsearch setup
param_grid = [
  {'svc__C': [1, 10, 100, 1000], 
   'svc__gamma': [0.001, 0.0001], 
   'svc__kernel': ['rbf']},
 ]

gs = GridSearchCV(estimator=kernel_svm, 
                  param_grid=param_grid, 
                  scoring='accuracy', 
                  n_jobs=-1, 
                  cv=5, 
                  verbose=1, 
                  refit=True,
                  pre_dispatch='2*n_jobs')

# run gridearch
gs.fit(X_train, y_train)


Fitting 5 folds for each of 8 candidates, totalling 40 fits
[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed:    0.2s finished
Out[2]:
GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(steps=[('std', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svc', SVC(C=10.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))]),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid=[{'svc__kernel': ['rbf'], 'svc__C': [1, 10, 100, 1000], 'svc__gamma': [0.001, 0.0001]}],
       pre_dispatch='2*n_jobs', refit=True, scoring='accuracy', verbose=1)

In [3]:
print('Best GS Score %.2f' % gs.best_score_)
print('best GS Params %s' % gs.best_params_)


# prediction on the training set
y_pred = gs.predict(X_train)
train_acc = (y_train == y_pred).sum()/len(y_train)
print('\nTrain Accuracy: %.2f' % (train_acc))

# evaluation on the test set
y_pred = gs.predict(X_test)
test_acc = (y_test == y_pred).sum()/len(y_test)
print('\nTest Accuracy: %.2f' % (test_acc))


Best GS Score 0.96
best GS Params {'svc__kernel': 'rbf', 'svc__C': 100, 'svc__gamma': 0.001}

Train Accuracy: 0.97

Test Accuracy: 0.97

A Note about GridSearchCV's best_score_ attribute

Please note that gs.best_score_ is the average k-fold cross-validation score. I.e., if we have a GridSearchCV object with 5-fold cross-validation (like the one above), the best_score_ attribute returns the average score over the 5-folds of the best model. To illustrate this with an example:


In [4]:
from sklearn.cross_validation import StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
import numpy as np

np.random.seed(0)
np.set_printoptions(precision=6)
y = [np.random.randint(3) for i in range(25)]
X = (y + np.random.randn(25)).reshape(-1, 1)

cv5_idx = list(StratifiedKFold(y, n_folds=5, shuffle=False, random_state=0))
cross_val_score(LogisticRegression(random_state=123), X, y, cv=cv5_idx)


Out[4]:
array([ 0.6,  0.4,  0.6,  0.2,  0.6])

By executing the code above, we created a simple data set of random integers that shall represent our class labels. Next, we fed the indices of 5 cross-validation folds (cv3_idx) to the cross_val_score scorer, which returned 5 accuracy scores -- these are the 5 accuracy values for the 5 test folds.

Next, let us use the GridSearchCV object and feed it the same 5 cross-validation sets (via the pre-generated cv3_idx indices):


In [5]:
from sklearn.grid_search import GridSearchCV
gs = GridSearchCV(LogisticRegression(), {}, cv=cv5_idx, verbose=3).fit(X, y)


Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV]  ................................................................
[CV] ....................................... , score=0.600000 -   0.0s
[CV]  ................................................................
[CV] ....................................... , score=0.400000 -   0.0s
[CV]  ................................................................
[CV] ....................................... , score=0.600000 -   0.0s
[CV]  ................................................................
[CV] ....................................... , score=0.200000 -   0.0s
[CV]  ................................................................
[CV] ....................................... , score=0.600000 -   0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.0s finished

As we can see, the scores for the 5 folds are exactly the same as the ones from cross_val_score earlier.
Now, the bestscore attribute of the GridSearchCV object, which becomes available after fitting, returns the average accuracy score of the best model:


In [6]:
gs.best_score_


Out[6]:
0.47999999999999998

As we can see, the result above is consistent with the average score computed the cross_val_score.


In [7]:
cross_val_score(LogisticRegression(), X, y, cv=cv5_idx).mean()


Out[7]:
0.47999999999999998