Model evaluation with pipeline

In this notebook, I will use pipeline to preprocess data, construct model and perform model evaluation with k-fold cross validation. The pipeline will do the following steps:

Split the raw data into k folds. Select one for testing and two for training.
Preprocess the data by scaling the training features.
Train a classifier on the training data.
Apply the classifier to the test data.
Record the accuracy score.
Repeat steps 1-5 for each fold.
Calculate the mean score for all the folds.



In [2]:

    
from sklearn.datasets import load_iris
from sklearn.pipeline import make_pipeline
from sklearn import preprocessing
from sklearn import model_selection
from sklearn import svm



In [3]:

    
# load iris data
iris = load_iris()
X = iris.data
y = iris.target

Create classifier pipeline

The pipeline preprocesses the data by scaling the feature variable's values to mean zero and unit variance.
The pipeline trains a SVM classifier on the data with C=1. C is the cost function for the margins. The higher the C, the less tolerant the model is for misclassification.



In [4]:

    
# Create a pipeline that scales the data then trains a support vector classifier
classifier_pipeline = make_pipeline(preprocessing.StandardScaler(), svm.SVC(C=1))

Cross validation

Now apply the classifier pipeline to the feature and target data with KFold/StratifiedKFold cross validation.



In [5]:

    
scores = model_selection.cross_val_score(classifier_pipeline, X, y, cv=3)

Model evaluation

To get an good measure of the model's accuracy, calculate the mean of the three scores as measure of model accuracy.



In [6]:

    
scores









    Out[6]:





array([ 0.98039216,  0.90196078,  0.97916667])



In [7]:

    
scores.mean()









    Out[7]:





0.95383986928104569