In [1]:
%load_ext watermark
In [2]:
%watermark -d -u -v -p scikit-learn,pandas,scipy,matplotlib
[More information](https://github.com/rasbt/watermark) about the `watermark` magic command extension.
The Pipeline
class in scikit-learn's pipeline
module offers a convenient way to execute a chain of normalization and pre-processing steps, as well as predictors and classifiers on a dataset.
Such a pipeline is especially useful in the context of cross-validation, where we are interested in testing and comparing different feature selection techniques, dimensionality reduction approaches, and classifiers. All in all, Pipeline
s are not only a great time-saver, but they also allow us to write clutter-free code and stay organized in the attempt to find the ideal combination of techniques for solving a pattern classification task.
In the current implementation of Pipeline
the different steps in the chain have to be classes that have fit
and transform
methods (or, alternatively, a fit_transform
method instead of transform
).
But before we get bogged down into details, let us prepare the Iris dataset in order to have a look at a Pipeline
in action.
An example of a simple pipeline that consists of three steps: standardizing the dataset, dimensionality reduction via PCA, and training a naive Bayes classifier:
In [ ]:
clf = Pipeline(steps=[
('scaler', StandardScaler()),
('reduce_dim', PCA(n_components=2)),
('classifier', GaussianNB())
])
clf.fit(X_train, y_train) # fitting on the training dataset
pred = clf_lda.predict(X_test) # classifying the test dataset
Since I don't want to distract from the main topic of this article, the Pipeline
s, I want to keep the descriptions about the other steps and concepts as concise as possible. Below is a list of the particular topics with links to other articles for further reading.
A dataset of 150 flowers from three iris classes with 4 different features.
Cross-validation is a statistical technique to estimate the prediction error rate by splitting the data into training, cross-validation, and test datasets. A prediction model is obtained using the training set, and model parameters are optimized by the cross-validation set, while the test set is held primarily for empirical error estimation.
A data pre-processing step for re-scaling features from different measurements to match proportions of a standard normal distribution (unit variance centered at mean=0).
A linear transformation technique that is commonly used to project a dataset (without utilizing class labels) onto a new feature space or feature subspace (for dimensionality reduction) where the new component axes are the directions that maximize the variance/spread of the data.
A linear transformation technique (related to Principal Component Analysis) that is commonly used to project a dataset onto a new feature space or feature subspace, where the new component axes maximize the spread between multiple classes, or for classification of data.
A classifier based on a statistical model (i.e., Bayes theorem: calculating posterior probabilities based on the prior probability and the so-called likelihood) in the field of pattern classification. Naive Bayes assumes that all attributes are conditionally independent, thereby, computing the likelihood is simplified to the product of the conditional probabilities of observing individual attributes given a particular class label.
In [3]:
import pandas as pd
df = pd.io.parsers.read_csv(
filepath_or_buffer='https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data',
header=None,
sep=',',
)
df.dropna(how="all", inplace=True) # to drop the empty line at file-end
feature_dict = {i:label for i,label in zip(
range(4),
('sepal length in cm',
'sepal width in cm',
'petal length in cm',
'petal width in cm', ))}
Since it is more convenient to work with numerical values, we will use the LabelEncoder
from the scikit-learn library to convert the class labels into numbers: 1, 2, and 3.
In [4]:
from sklearn.preprocessing import LabelEncoder
X = df[[0,1,2,3]].values
y = df[4].values
enc = LabelEncoder()
label_encoder = enc.fit(y)
y = label_encoder.transform(y) + 1
label_dict = {1: 'Setosa', 2: 'Versicolor', 3:'Virginica'}
In [5]:
%matplotlib inline
In [9]:
from matplotlib import pyplot as plt
import numpy as np
import math
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12,6))
for ax,cnt in zip(axes.ravel(), range(4)):
# set bin sizes
min_b = math.floor(np.min(X[:,cnt]))
max_b = math.ceil(np.max(X[:,cnt]))
bins = np.linspace(min_b, max_b, 25)
# plottling the histograms
for lab,col in zip(range(1,4), ('blue', 'red', 'green')):
ax.hist(X[y==lab, cnt],
color=col,
label='class %s' %label_dict[lab],
bins=bins,
alpha=0.5,)
ylims = ax.get_ylim()
# plot annotation
leg = ax.legend(loc='upper right', fancybox=True, fontsize=8)
leg.get_frame().set_alpha(0.5)
ax.set_ylim([0, max(ylims)+2])
ax.set_xlabel(feature_dict[cnt])
ax.set_title('Iris histogram #%s' %str(cnt+1))
# hide axis ticks
ax.tick_params(axis="both", which="both", bottom="off", top="off",
labelbottom="on", left="off", right="off", labelleft="on")
# remove axis spines
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.spines["bottom"].set_visible(False)
ax.spines["left"].set_visible(False)
axes[0][0].set_ylabel('count')
axes[1][0].set_ylabel('count')
fig.tight_layout()
plt.show()
Above, we used simple 1D-histograms as a quick exploratory tool to get a rough idea how the classes are separated for each particular feature.
We can see that the petal dimensions (petal width and petal width) are much more useful features than the sepal dimensions, which have a large overlap between the 3 different Iris species.
In this step, we split our dataset into two subsets: A training dataset (60% of the samples) and a test dataset (40% of the samples from the original dataset).
In [10]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.40, random_state=12345)
It is important to note that the test set should only be used once for evaluating the prediction error of a classifier after training it on the training dataset; repetitive evaluations of different models using the same test dataset would be prone to overfitting.
A common approach to compare and evaluate different pre-processing techniques and models is cross-validation. Here, we will use a variant called "k-fold cross-validation" in order to compare different preprocessing steps and demonstrate the usefulness of a Pipeline
.
In k-fold cross-validation, the original training dataset is split into k different subsets (the so-called "folds") where 1 fold is retained for testing the classifier and the other k-1 folds are used for training.
E.g., if we'd choose k=4 (4 folds) the classifier will be trained on the remaining using 3 different training set subsets and evaluated on the 4th fold (the test fold). This procedure is then repeated 4 times until every fold has been used as the test set once so that we can eventually calculate the average error rate of our model from the error rate of every iteration, which gives us an idea of how well our model generalizes (a more detailed article about different cross validation methods is in preparation).
Now, after we successfully prepared our example dataset, let us come back to the pipelines. Pipelines, as we have heard before, are extremely useful in a cross-validation task in order to compare different models.
And we can also write our own classes that can be used in such a pipeline. If we think back to the histograms of our Iris dataset, one interesting exploration could be to select the features "petal length" and "petal width" and compare a classifier trained on this subset against classifiers that were trained on the whole dataset, or classifiers that were trained on datasets after dimensionality reduction (e.g., Linear Discriminant Analysis or Principal Component Analysis).
Below, we will write a simple ColumnSelector
that will return a subset of certain features (columns) from a NumPy array that represents our dataset.
The requirement for a valid Pipeline
object is that it responds to the transform
method with exception of the last object (the final estimator), which has to respond to the fit
function.
In [11]:
class ColumnSelector(object):
""" A feature selector for scikit-learn's Pipeline class that returns
specified columns from a numpy array.
"""
def __init__(self, cols):
self.cols = cols
def transform(self, X, y=None):
return X[:, self.cols]
def fit(self, X, y=None):
return self
I thought that it was worthwhile to create a central module for such helper functions and classes (https://github.com/rasbt/mlxtend) from which they could not only be easily imported but also collected, documented, and optimized. I am happy about any suggestions, ideas, and contributions!
For a simple example, let us do cross-validation to compare different preprocessing steps for training a naive Bayes classifier.
In [12]:
from sklearn.cross_validation import cross_val_score, KFold
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
from sklearn.lda import LDA
from sklearn.decomposition import PCA
clf_all = Pipeline(steps=[
('scaler', StandardScaler()),
('classifier', GaussianNB())
])
clf_petal = Pipeline(steps=[
('scaler', StandardScaler()),
('reduce_dim', ColumnSelector(cols=(2,3))),
('classifier', GaussianNB())
])
clf_pca = Pipeline(steps=[
('scaler', StandardScaler()),
('reduce_dim', PCA(n_components=2)),
('classifier', GaussianNB())
])
clf_lda = Pipeline(steps=[
('scaler', StandardScaler()),
('reduce_dim', LDA(n_components=2)),
('classifier', GaussianNB())
])
# Constructing the k-fold cross validation iterator (k=5)
cv = KFold(n=X_train.shape[0], # total number of samples
n_folds=5, # number of folds the dataset is divided into
random_state=12345)
scores = [
cross_val_score(clf, X_train, y_train, cv=cv, scoring='accuracy')
for clf in [clf_all, clf_petal, clf_pca, clf_lda]
]
In [13]:
for score,label in zip(scores,
['all attributes',
'Petal dimensions (column 3 & 4)',
'PCA dim. red. (n=2)',
'LDA dim. red. (n=2)',
]
):
print("Accuracy: {:.2%} (+/- {:.2%}), {:}".format(score.mean(), score.std(), label))
In general, the prediction accuracy is calculated as the number of correct classification divided by the number of total classification. In this case, the accuracies displayed above represent the averages from all 5 iterations in a cross-validation procedure.
Here, the accuracies differ only slightly. Interestingly, the prediction accuracies are the same for the whole dataset and our feature subset that consists of only the petal widths and lengths. This indicates that we might lose minimum information by removing the sepal measurements (but note that the standard error increased). LDA tends to outperform PCA in this case, which is not unexpected: In contrast to PCA, LDA tries to maximize class separability for a supervised dataset.
Conviniently, we can reuse the pipeline objects from our cross-validation above to train the classifier on the whole training dataset and evaluate its performance on the separate test dataset.
Since the LDA-approach for preprocessing seemed to have performed best in cross validation, let us have a look how it performs on the "real" test dataset.
In [14]:
from sklearn import metrics
clf_lda.fit(X_train, y_train)
pred_test = clf_lda.predict(X_test)
print('Prediction accuracy for the test dataset')
print('{:.2%}'.format(metrics.accuracy_score(y_test, pred_test)))
print('\nConfusion Matrix of the Naive Bayes classifier')
print(metrics.confusion_matrix(y_test, clf_lda.predict(X_test)))