Model evaluation with pipeline

In this notebook, I will use pipeline to preprocess data, construct model and perform model evaluation with k-fold cross validation. The pipeline will do the following steps:

  1. Split the raw data into k folds. Select one for testing and two for training.
  2. Preprocess the data by scaling the training features.
  3. Train a classifier on the training data.
  4. Apply the classifier to the test data.
  5. Record the accuracy score.
  6. Repeat steps 1-5 for each fold.
  7. Calculate the mean score for all the folds.

In [2]:
from sklearn.datasets import load_iris
from sklearn.pipeline import make_pipeline
from sklearn import preprocessing
from sklearn import model_selection
from sklearn import svm

In [3]:
# load iris data
iris = load_iris()
X = iris.data
y = iris.target

Create classifier pipeline

  1. The pipeline preprocesses the data by scaling the feature variable's values to mean zero and unit variance.
  2. The pipeline trains a SVM classifier on the data with C=1. C is the cost function for the margins. The higher the C, the less tolerant the model is for misclassification.

In [4]:
# Create a pipeline that scales the data then trains a support vector classifier
classifier_pipeline = make_pipeline(preprocessing.StandardScaler(), svm.SVC(C=1))

Cross validation

Now apply the classifier pipeline to the feature and target data with KFold/StratifiedKFold cross validation.


In [5]:
scores = model_selection.cross_val_score(classifier_pipeline, X, y, cv=3)

Model evaluation

To get an good measure of the model's accuracy, calculate the mean of the three scores as measure of model accuracy.


In [6]:
scores


Out[6]:
array([ 0.98039216,  0.90196078,  0.97916667])

In [7]:
scores.mean()


Out[7]:
0.95383986928104569