Let's work with the wine dataset we worked with before, but slightly modified. This has more instances and different target features

based on http://blog.yhathq.com/posts/classification-using-knn-and-python.html



In [ ]:

    
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.neighbors import KNeighborsClassifier
from sklearn import cross_validation



In [ ]:

    
import numpy as np



In [ ]:

    
df = pd.read_csv("data/wine.csv")



In [ ]:

    
df.columns

Instead of wine cultvar, we have the wine color (red or white), as well as a binary (is red) and high quality indicator (0 or 1)



In [ ]:

    
df.high_quality.unique()

Let's set up our training and test sets



In [ ]:

    
train, test = cross_validation.train_test_split(df[['density','sulphates','residual_sugar','high_quality']],train_size=0.75)

We'll use just three columns (dimensions) for classification



In [ ]:

    
train



In [ ]:

    
x_train = train[:,:3]
y_train = train[:,3]



In [ ]:

    
x_test = test[:,:3]
y_test = test[:,3]

Let's start with a k of 1 to predict high quality



In [ ]:

    
clf = KNeighborsClassifier(n_neighbors=1)



In [ ]:

    
clf.fit(x_train,y_train)



In [ ]:

    
preds = clf.predict(x_test)



In [ ]:

    
accuracy = np.where(preds==y_test, 1, 0).sum() / float(len(test))



In [ ]:

    
print "Accuracy: %3f" % (accuracy,)

Not bad. Let's see what happens as the k changes



In [ ]:

    
results = []
for k in range(1, 51, 2):
    clf = KNeighborsClassifier(n_neighbors=k)
    clf.fit(x_train,y_train)
    preds = clf.predict(x_test)
    accuracy = np.where(preds==y_test, 1, 0).sum() / float(len(test))
    print "Neighbors: %d, Accuracy: %3f" % (k, accuracy)

    results.append([k, accuracy])

results = pd.DataFrame(results, columns=["k", "accuracy"])

plt.plot(results.k, results.accuracy)
plt.title("Accuracy with Increasing K")
plt.show()

Looks like about 80% is the best we can do. The way it plateaus, suggests there's not much more to be gained by increasing k

We can also tune this a bit by not weighting each instance the same, but decreasing the weight as the distance increases



In [ ]:

    
results = []
for k in range(1, 51, 2):
    clf = KNeighborsClassifier(n_neighbors=k,weights='distance')
    clf.fit(x_train,y_train)
    preds = clf.predict(x_test)
    accuracy = np.where(preds==y_test, 1, 0).sum() / float(len(test))
    print "Neighbors: %d, Accuracy: %3f" % (k, accuracy)

    results.append([k, accuracy])

results = pd.DataFrame(results, columns=["k", "accuracy"])

plt.plot(results.k, results.accuracy)
plt.title("Accuracy with Increasing K")
plt.show()

Let's work with the wine dataset we worked with before, but slightly modified. This has more instances and different target features

based on http://blog.yhathq.com/posts/classification-using-knn-and-python.html

Instead of wine cultvar, we have the wine color (red or white), as well as a binary (is red) and high quality indicator (0 or 1)

Let's set up our training and test sets

We'll use just three columns (dimensions) for classification

Let's start with a k of 1 to predict high quality

Not bad. Let's see what happens as the k changes

Looks like about 80% is the best we can do. The way it plateaus, suggests there's not much more to be gained by increasing k

We can also tune this a bit by not weighting each instance the same, but decreasing the weight as the distance increases

This actually increases the accuracy of our prediction