Here is an example on you to split the data on the iris dataset.
Let's re-use the results of the 2D PCA of the iris dataset in order to explore clustering. First we need to repeat some of the code from the previous notebook:
In [ ]:
# all of this is taken from the notebook '04_iris_clustering.ipynb'
import numpy as np
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
n_samples, n_features = iris.data.shape
print n_samples
First we need to shuffle the order of the samples and the target to ensure that all classes are well represented on both sides of the split:
In [ ]:
indices = np.arange(n_samples)
indices[:10]
In [ ]:
np.random.RandomState(42).shuffle(indices)
indices[:10]
In [ ]:
X = iris.data[indices]
y = iris.target[indices]
We can now split the data using a 2/3 - 1/3 ratio:
In [ ]:
split = (n_samples * 2) / 3
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]
X_train.shape
In [ ]:
X_test.shape
In [ ]:
y_train.shape
In [ ]:
y_test.shape
We can now re-train a new linear classifier on the training set only:
In [ ]:
from sklearn.svm import SVC(kernel='linear')
clf = SVC().fit(X_train, y_train)
To evaluate its quality we can compute the average number of correct classifications on the test set:
In [ ]:
np.mean(clf.predict(X_test) == y_test)
This shows that the model has a predictive accurracy of 100% which means that the classification model was perfectly capable of generalizing what was learned from the training set to the test set: this is rarely so easy on real life datasets as we will see in the later sections.