Cross-Validation on the Iris Dataset

Here is an example on you to split the data on the iris dataset.

Let's re-use the results of the 2D PCA of the iris dataset in order to explore clustering. First we need to repeat some of the code from the previous notebook:



In [ ]:

    
# all of this is taken from the notebook '04_iris_clustering.ipynb'
import numpy as np
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target

n_samples, n_features = iris.data.shape
print n_samples

First we need to shuffle the order of the samples and the target to ensure that all classes are well represented on both sides of the split:



In [ ]:

    
indices = np.arange(n_samples)
indices[:10]



In [ ]:

    
np.random.RandomState(42).shuffle(indices)
indices[:10]



In [ ]:

    
X = iris.data[indices]
y = iris.target[indices]

We can now split the data using a 2/3 - 1/3 ratio:



In [ ]:

    
split = (n_samples * 2) / 3

X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]

X_train.shape



In [ ]:

    
X_test.shape



In [ ]:

    
y_train.shape



In [ ]:

    
y_test.shape

We can now re-train a new linear classifier on the training set only:



In [ ]:

    
from sklearn.svm import SVC(kernel='linear')
clf = SVC().fit(X_train, y_train)

To evaluate its quality we can compute the average number of correct classifications on the test set:



In [ ]:

    
np.mean(clf.predict(X_test) == y_test)

This shows that the model has a predictive accurracy of 100% which means that the classification model was perfectly capable of generalizing what was learned from the training set to the test set: this is rarely so easy on real life datasets as we will see in the later sections.