Note that this excerpt contains only the raw code - the book is rich with additional explanations and illustrations. If you find this content useful, please consider supporting the work by buying the book!
Most machine learning problems we have discussed so far consist of at least a
preprocessing step and a classification step. The more complicated the problem, the longer
this processing chain might get. One convenient way to glue multiple processing steps
together and even use them in grid search is by using the Pipeline
class from scikit-learn.
The Pipeline class itself has a fit
, a predict
, and a score
method, which behave just
like any other estimator in scikit-learn. The most common used case of the Pipeline
class
is to chain different preprocessing steps together with a supervised model like a classifier.
Let's return to the breast cancer dataset from Chapter 5, Using Decision Trees to Make a Medical Diagnosis. Using scikit-learn, we import the dataset and split it into training and test sets:
In [1]:
from sklearn.datasets import load_breast_cancer
import numpy as np
cancer = load_breast_cancer()
X = cancer.data.astype(np.float32)
y = cancer.target
In [2]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, random_state=37
)
Instead of the $k$-NN algorithm, we could fit a support vector machine (SVM) to the data:
In [3]:
from sklearn.svm import SVC
svm = SVC()
svm.fit(X_train, y_train)
Out[3]:
Without straining our brains too hard, this algorithm achieves an accuracy score of 65%:
In [4]:
svm.score(X_test, y_test)
Out[4]:
Now if we wanted to run the algorithm again using some preprocessing step (for example,
by scaling the data first with MinMaxScaler
),we would do the preprocessing step by hand
and then feed the preprocessed data into the classifiers fit
method.
An alternative is to use a pipeline object. Here, we want to specify a list of processing steps, where each step is a tuple containing a name (any string of our choosing) and an instance of an estimator:
In [5]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
pipe = Pipeline([("scaler", MinMaxScaler()), ("svm", SVC())])
Here, we created two steps: the first, called "scaler"
, is an instance of MinMaxScaler
, and
the second, called "svm"
, is an instance of SVC
. Now we can fit the pipeline like any other
scikit-learn estimator:
In [6]:
pipe.fit(X_train, y_train)
Out[6]:
Here, the fit
method first calls fit
on the first step (the scaler), then it transforms the
training data using the scaler, and finally it fits the SVM with the scaled data.
And voila! When we score the classifier on the test data, we see a drastic improvement in performance:
In [7]:
pipe.score(X_test, y_test)
Out[7]:
Calling the score method on the pipeline first transforms the test data using the scaler and then calls the score method on the SVM using the scaled test data. And scikit-learn did all this with only four lines of code!
The main benefit of using the pipeline, however, is that we can now use this single
estimator in cross_val_score
or GridSearchCV
.
Using a pipeline in a grid search works the same way as using any other estimator.
We define a parameter grid to search over and construct a GridSearchCV
from the pipeline
and the parameter grid. When specifying the parameter grid, there is, however, a slight
change. We need to specify for each parameter which step of the pipeline it belongs to. Both
parameters that we want to adjust, C
and gamma
, are parameters of SVC, the second step. In
the preceding section, we gave this step the name "svm"
. The syntax to define a parameter
grid for a pipeline is to specify for each parameter the step name, followed by __
(a double
underscore), followed by the parameter name.
Hence, we would construct the parameter grid as follows:
In [8]:
param_grid = {'svm__C': [0.001, 0.01, 0.1, 1, 10, 100],
'svm__gamma': [0.001, 0.01, 0.1, 1, 10, 100]}
With this parameter grid, we can use GridSearchCV
as usual:
In [9]:
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(pipe, param_grid=param_grid, cv=10)
grid.fit(X_train, y_train)
Out[9]:
The best score in the grid is stored in best_score_
:
In [10]:
grid.best_score_
Out[10]:
Similarly, the best parameters are stored in best_params_
:
In [11]:
grid.best_params_
Out[11]:
But recall that the cross-validation score might be overly optimistic. In order to know the true performance of the classifier, we need to score it on the test set:
In [12]:
grid.score(X_test, y_test)
Out[12]:
In contrast to the grid search we did before, now for each split in the cross-validation,
MinMaxScaler
is refit with only the training splits, and no information is leaked from the
test split into the parameter search.
This makes it easy to build a pipeline to chain together a whole variety of steps!
How would you mix and match different estimators in a single pipeline? Turn to page 330 to find the answer.