This notebook contains an excerpt from the book Machine Learning for OpenCV by Michael Beyeler. The code is released under the MIT license, and is available on GitHub.

Note that this excerpt contains only the raw code - the book is rich with additional explanations and illustrations. If you find this content useful, please consider supporting the work by buying the book!

Chaining Algorithms Together to Form a Pipeline

Most machine learning problems we have discussed so far consist of at least a preprocessing step and a classification step. The more complicated the problem, the longer this processing chain might get. One convenient way to glue multiple processing steps together and even use them in grid search is by using the Pipeline class from scikit-learn.

Implementing pipelines in scikit-learn

The Pipeline class itself has a fit, a predict, and a score method, which behave just like any other estimator in scikit-learn. The most common used case of the Pipeline class is to chain different preprocessing steps together with a supervised model like a classifier.

Let's return to the breast cancer dataset from Chapter 5, Using Decision Trees to Make a Medical Diagnosis. Using scikit-learn, we import the dataset and split it into training and test sets:


In [1]:
from sklearn.datasets import load_breast_cancer
import numpy as np
cancer = load_breast_cancer()
X = cancer.data.astype(np.float32)
y = cancer.target

In [2]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=37
)

Instead of the $k$-NN algorithm, we could fit a support vector machine (SVM) to the data:


In [3]:
from sklearn.svm import SVC
svm = SVC()
svm.fit(X_train, y_train)


Out[3]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

Without straining our brains too hard, this algorithm achieves an accuracy score of 65%:


In [4]:
svm.score(X_test, y_test)


Out[4]:
0.65034965034965031

Now if we wanted to run the algorithm again using some preprocessing step (for example, by scaling the data first with MinMaxScaler),we would do the preprocessing step by hand and then feed the preprocessed data into the classifiers fit method.

An alternative is to use a pipeline object. Here, we want to specify a list of processing steps, where each step is a tuple containing a name (any string of our choosing) and an instance of an estimator:


In [5]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
pipe = Pipeline([("scaler", MinMaxScaler()), ("svm", SVC())])

Here, we created two steps: the first, called "scaler", is an instance of MinMaxScaler, and the second, called "svm", is an instance of SVC. Now we can fit the pipeline like any other scikit-learn estimator:


In [6]:
pipe.fit(X_train, y_train)


Out[6]:
Pipeline(steps=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('svm', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

Here, the fit method first calls fit on the first step (the scaler), then it transforms the training data using the scaler, and finally it fits the SVM with the scaled data.

And voila! When we score the classifier on the test data, we see a drastic improvement in performance:


In [7]:
pipe.score(X_test, y_test)


Out[7]:
0.95104895104895104

Calling the score method on the pipeline first transforms the test data using the scaler and then calls the score method on the SVM using the scaled test data. And scikit-learn did all this with only four lines of code!

The main benefit of using the pipeline, however, is that we can now use this single estimator in cross_val_score or GridSearchCV.

Using pipelines in grid searches

Using a pipeline in a grid search works the same way as using any other estimator.

We define a parameter grid to search over and construct a GridSearchCV from the pipeline and the parameter grid. When specifying the parameter grid, there is, however, a slight change. We need to specify for each parameter which step of the pipeline it belongs to. Both parameters that we want to adjust, C and gamma, are parameters of SVC, the second step. In the preceding section, we gave this step the name "svm". The syntax to define a parameter grid for a pipeline is to specify for each parameter the step name, followed by __ (a double underscore), followed by the parameter name.

Hence, we would construct the parameter grid as follows:


In [8]:
param_grid = {'svm__C': [0.001, 0.01, 0.1, 1, 10, 100],
              'svm__gamma': [0.001, 0.01, 0.1, 1, 10, 100]}

With this parameter grid, we can use GridSearchCV as usual:


In [9]:
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(pipe, param_grid=param_grid, cv=10)
grid.fit(X_train, y_train)


Out[9]:
GridSearchCV(cv=10, error_score='raise',
       estimator=Pipeline(steps=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('svm', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'svm__C': [0.001, 0.01, 0.1, 1, 10, 100], 'svm__gamma': [0.001, 0.01, 0.1, 1, 10, 100]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

The best score in the grid is stored in best_score_:


In [10]:
grid.best_score_


Out[10]:
0.97652582159624413

Similarly, the best parameters are stored in best_params_:


In [11]:
grid.best_params_


Out[11]:
{'svm__C': 1, 'svm__gamma': 1}

But recall that the cross-validation score might be overly optimistic. In order to know the true performance of the classifier, we need to score it on the test set:


In [12]:
grid.score(X_test, y_test)


Out[12]:
0.965034965034965

In contrast to the grid search we did before, now for each split in the cross-validation, MinMaxScaler is refit with only the training splits, and no information is leaked from the test split into the parameter search.

This makes it easy to build a pipeline to chain together a whole variety of steps!

How would you mix and match different estimators in a single pipeline? Turn to page 330 to find the answer.