The quintessential iris dataset


In [1]:
from sklearn import datasets
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.learning_curve import learning_curve
import matplotlib.pyplot as plt

Getting data

Load the dataset from scikit-learn and perform a test train split


In [2]:
iris = datasets.load_iris()
X, Y = iris.data, iris.target
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target)

Fitting a model

Let's try with KNN. Fit the training data and test on testing dataset


In [4]:
model = KNeighborsClassifier()

In [5]:
model.fit(x_train, y_train)


Out[5]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [6]:
y_hat = model.predict(x_test)

Validating the model fit


In [26]:
conf_matrix = confusion_matrix(y_test, y_hat, labels=[0,1,2])
plt.imshow(conf_matrix, interpolation="nearest")
accuracy_score(y_test, y_hat) #94.73! not bad!


Out[26]:
0.94736842105263153

Check if the we can reduce the size of training set and still get decent accuracy


In [9]:
train_sizes, train_scores, test_scores = learning_curve(model, X, Y, cv=5)

In [30]:
plt.plot(train_sizes, np.mean(train_scores, axis=1),'o-', label="Training score")
plt.plot(train_sizes, np.mean(test_scores, axis=1),'o-', label="Testing score")
plt.legend(loc="lower right")


Out[30]:
<matplotlib.legend.Legend at 0x115132f90>