We will introduce the task of classification, and look at a naive way of doing it using a method called k-Nearest Neighbors.
Classification is one of supervised machine learning's most basic tasks.
In [1]:
import numpy as np
from IPython.display import HTML, display
import tabulate
import matplotlib.pyplot as plt
# toy datast of whether or not it will be rainy or sunny
feature_names = ["Humidity (%)", "Pressure (kPa)"]
data = [[29, 101.7], [60, 98.6], [40, 101.1], [62, 99.9], [39, 103.2], [51, 97.6], [46, 102.1], [55, 100.2]]
labels = ["Sun","Rain","Sun","Rain","Sun","Rain","Sun","Rain"]
# display table
table_labels = np.array(['class']+feature_names).reshape((1, 1+len(feature_names)))
table_data = np.concatenate([np.array(labels).reshape(len(data), 1), data], axis=1)
table_full = np.concatenate([table_labels, table_data], axis=0)
display(HTML(tabulate.tabulate(table_full, tablefmt='html')))
We can plot these points on a scatterplot. In the following, a + means "Rain" and a - is "Sun" (no rain).
Classification is defined as the task of predicting the correct label or category of an unknown point. With two classes, we divide the data space into two halves, one for each class. So when we receive a new point, we simply find which side of the partition the point is in.
We will introduce a simple technique for classification called k-nearest neighbors classification (kNN). Before doing that, we are going to scale up our problem with a slightly more realistic dataset called Iris, which is commonly used to introduce data science tasks.
Iris is a dataset containing 150 samples of flowers of the Iris genus, belonging to three different species (Iris setosa, Iris virginica, Iris versicolor). The dataset records their species (which is the class label), along with the following features: Petal Length, Petal Width, Sepal Length, and Sepal width.
In the next cell, we import the dataset, and shuffle it.
In [2]:
import numpy as np
from sklearn.datasets import load_iris
# load iris and grab our data and labels
iris = load_iris()
labels, data = iris.target, iris.data
num_samples = len(labels) # size of our dataset
num_features = len(iris.feature_names) # number of columns/variables
# shuffle the dataset
shuffle_order = np.random.permutation(num_samples)
data = data[shuffle_order, :]
labels = labels[shuffle_order]
Let's view a table showing the first 20 samples.
In [3]:
label_names = np.array([iris.target_names[l] for l in labels])
table_labels = np.array(['class']+iris.feature_names).reshape((1, 1+num_features))
class_names = iris.target_names
table_data = np.concatenate([np.array(label_names).reshape(num_samples, 1), data], axis=1)[0:20]
# display table
table_full = np.concatenate([table_labels, table_data], axis=0)
display(HTML(tabulate.tabulate(table_full, tablefmt='html')))
For simplicity, we will restrict our attention to just the first two features, sepal width and sepal length. Let's plot the dataset.
In [4]:
# plot the original data
x, y, lab = data[:, 0], data[:, 1], labels
plt.figure(figsize=(8, 6))
plt.scatter(x, y, c=lab)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.title('Iris dataset')
Out[4]:
Suppose we are given a new point whose sepal length (x) and sepal width (y) are the following:
In [5]:
new_x, new_y = 6.5, 3.7
Let's plot it on the graph. What could its class be?
In [6]:
# plot the original data
x, y, lab = data[:, 0], data[:, 1], labels
plt.figure(figsize=(8, 6))
plt.scatter(x, y, c=lab)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.title('Iris dataset')
# put the new point on top
plt.scatter(new_x, new_y, c='grey', cmap=None, edgecolor='k')
plt.annotate('?', (new_x+0.45, new_y+0.25), fontsize=20, horizontalalignment='center', verticalalignment='center')
plt.annotate("", xytext=(new_x+0.4, new_y+0.2), xy=(new_x+0.05, new_y), arrowprops=dict(arrowstyle="->"))
Out[6]:
Our simple approach to predicting the new point's label is to find the point in the dataset which is closest to the new point, and copying its label.
In [7]:
# calculate the distance between the new point and each of the points in our labeled dataset
distances = np.sum((data[:,0:2] - [new_x, new_y])**2, axis=1)
# find the index of the point whose distance is lowest
closest_point = np.argmin(distances)
# take its label
new_label = labels[closest_point]
print('Predicted label: %d'%new_label)
That's it! That is k-nearest neighbors where we set k = 1. If k > 1, we find the k closest points and take a vote among them.
We can now plot the newly-labeled point on top of the dataset.
In [8]:
# append the newly labeled point in our dataset
x = np.append(x, new_x)
y = np.append(y, new_y)
lab = np.append(lab, new_label)
# scatter plot as before
plt.figure(figsize=(8, 6))
plt.scatter(x, y, c=lab)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.title('Iris dataset')
plt.annotate("", xytext=(x[closest_point]+0.02, y[closest_point]+0.02), xy=(new_x-0.02, new_y-0.02), arrowprops=dict(arrowstyle="->"))
Out[8]: