In [167]:
import numpy as np
import pandas as pd
from collections import Counter
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from IPython import display
K Nearest Neighbours is a algorithim for finding out the similarity or distance b/w two things, to find out how alike/different they are.
Say we have a bunch of fruit, KNN will classify them into clusters by using what we know - with fruit this would be shape, size, weight, color, etc.
Anyways, lets start with the Iris dataset, which has 150 measurements of flowers:
In [151]:
iris = sns.load_dataset("iris")
print(f"Iris dataset shape: {iris.shape}")
iris.head()
Out[151]:
Now I'm sampling 5 random flowers from this data set so we can use our fancy new KNN algo to determine what kind of flower they are later on:
In [152]:
test = iris.sample(n=5)
test
Out[152]:
And here I am deleting the sampled flowers from the iris dataset to make sure our algo hasn't seem the test flowers:
In [153]:
iris.drop(test.index, inplace=True)
print(iris.shape)
iris.head()
Out[153]:
Now to look at the data visually:
It's pretty clear the the species are different, though there is some overlap at the boundaries:
In [154]:
sns.pairplot(data=iris, hue="species")
Out[154]:
Looking at petal length variation across species:
In [155]:
sns.boxplot(x="species", y="petal_length", data=iris);
Now to actually write the algorithim and figure out what species the flowers in the test data set belong to.
First, a helper function to calculate the distance b/w points:
In [156]:
def distance(x, y):
"""returns distance b/w two points x and y"""
assert len(x) == len(y)
inner = 0
for a, b in zip(x,y):
inner += (a - b)**2
return np.sqrt(inner)
distance((1,5),[5,5])
Out[156]:
lets look at the values of the first flower in our test data and see if we can figure out what it is by using KNN:
In [204]:
test.iloc[2]
Out[204]:
In [209]:
def knn(item, data, n=3):
"""takes in an item to check and a dataset, of size 4 features each
returns the first n closest neighbours as a tuple (loc, distance)"""
dist = []
for i, row in data.iterrows():
dist.append((i, distance(row[:4], item)))
nearest = sorted(dist, key=lambda x: x[1])[:n]
species = [iris.loc[i[0]]["species"] for i in nearest]
return Counter(species).most_common()[0][0]
knn(test.iloc[2][:4], iris)
Out[209]:
In [211]:
knn_species = []
for i, row in test.iterrows():
knn_species.append(knn(row[:4], iris))
knn_species
Out[211]:
In [212]:
test["knn"] = knn_species
test
Out[212]:
All right! our KNN algo got all the 5 test species right!
I wrote this KNN algo a bit specifically targeting the iris dataset, but it can be modified to deal with with many different data sets.
The iris dataset is very simple, but usually I would normalize the data so all the attributes get a chance to effect the rating.
In [ ]: