Let's work with the wine dataset we worked with before, but slightly modified. This has more instances and different target features


In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.neighbors import KNeighborsClassifier
from sklearn import cross_validation

In [ ]:
import numpy as np

In [ ]:
df = pd.read_csv("data/wine.csv")

In [ ]:
df.columns

Instead of wine cultvar, we have the wine color (red or white), as well as a binary (is red) and high quality indicator (0 or 1)


In [ ]:
df.high_quality.unique()

Let's set up our training and test sets


In [ ]:
train, test = cross_validation.train_test_split(df[['density','sulphates','residual_sugar','high_quality']],train_size=0.75)

We'll use just three columns (dimensions) for classification


In [ ]:
train

In [ ]:
x_train = train[:,:3]
y_train = train[:,3]

In [ ]:
x_test = test[:,:3]
y_test = test[:,3]

Let's start with a k of 1 to predict high quality


In [ ]:
clf = KNeighborsClassifier(n_neighbors=1)

In [ ]:
clf.fit(x_train,y_train)

In [ ]:
preds = clf.predict(x_test)

In [ ]:
accuracy = np.where(preds==y_test, 1, 0).sum() / float(len(test))

In [ ]:
print "Accuracy: %3f" % (accuracy,)

Not bad. Let's see what happens as the k changes


In [ ]:
results = []
for k in range(1, 51, 2):
    clf = KNeighborsClassifier(n_neighbors=k)
    clf.fit(x_train,y_train)
    preds = clf.predict(x_test)
    accuracy = np.where(preds==y_test, 1, 0).sum() / float(len(test))
    print "Neighbors: %d, Accuracy: %3f" % (k, accuracy)

    results.append([k, accuracy])

results = pd.DataFrame(results, columns=["k", "accuracy"])

plt.plot(results.k, results.accuracy)
plt.title("Accuracy with Increasing K")
plt.show()

Looks like about 80% is the best we can do. The way it plateaus, suggests there's not much more to be gained by increasing k

We can also tune this a bit by not weighting each instance the same, but decreasing the weight as the distance increases


In [ ]:
results = []
for k in range(1, 51, 2):
    clf = KNeighborsClassifier(n_neighbors=k,weights='distance')
    clf.fit(x_train,y_train)
    preds = clf.predict(x_test)
    accuracy = np.where(preds==y_test, 1, 0).sum() / float(len(test))
    print "Neighbors: %d, Accuracy: %3f" % (k, accuracy)

    results.append([k, accuracy])

results = pd.DataFrame(results, columns=["k", "accuracy"])

plt.plot(results.k, results.accuracy)
plt.title("Accuracy with Increasing K")
plt.show()

This actually increases the accuracy of our prediction