Introduction to classification

We will introduce the task of classification, and look at a naive way of doing it using a method called k-Nearest Neighbors.

Classification is one of supervised machine learning's most basic tasks.

Classification

Consider the following small dataset:



In [1]:

    
import numpy as np
from IPython.display import HTML, display
import tabulate
import matplotlib.pyplot as plt

# toy datast of whether or not it will be rainy or sunny
feature_names = ["Humidity (%)", "Pressure (kPa)"]
data = [[29, 101.7], [60, 98.6], [40, 101.1], [62, 99.9], [39, 103.2], [51, 97.6], [46, 102.1], [55, 100.2]]
labels = ["Sun","Rain","Sun","Rain","Sun","Rain","Sun","Rain"]

# display table
table_labels = np.array(['class']+feature_names).reshape((1, 1+len(feature_names)))
table_data = np.concatenate([np.array(labels).reshape(len(data), 1), data], axis=1)
table_full = np.concatenate([table_labels, table_data], axis=0)
display(HTML(tabulate.tabulate(table_full, tablefmt='html')))









    






class Humidity (%) Pressure (kPa)
Sun  29.0        101.7         
Rain 60.0        98.6          
Sun  40.0        101.1         
Rain 62.0        99.9          
Sun  39.0        103.2         
Rain 51.0        97.6          
Sun  46.0        102.1         
Rain 55.0        100.2

We can plot these points on a scatterplot. In the following, a + means "Rain" and a - is "Sun" (no rain).

Classification is defined as the task of predicting the correct label or category of an unknown point. With two classes, we divide the data space into two halves, one for each class. So when we receive a new point, we simply find which side of the partition the point is in.

k-nearest neighbors classification

We will introduce a simple technique for classification called k-nearest neighbors classification (kNN). Before doing that, we are going to scale up our problem with a slightly more realistic dataset called Iris, which is commonly used to introduce data science tasks.

Iris is a dataset containing 150 samples of flowers of the Iris genus, belonging to three different species (Iris setosa, Iris virginica, Iris versicolor). The dataset records their species (which is the class label), along with the following features: Petal Length, Petal Width, Sepal Length, and Sepal width.

In the next cell, we import the dataset, and shuffle it.



In [2]:

    
import numpy as np
from sklearn.datasets import load_iris

# load iris and grab our data and labels
iris = load_iris()
labels, data = iris.target, iris.data

num_samples = len(labels)  # size of our dataset
num_features = len(iris.feature_names)  # number of columns/variables

# shuffle the dataset
shuffle_order = np.random.permutation(num_samples)
data = data[shuffle_order, :]
labels = labels[shuffle_order]

Let's view a table showing the first 20 samples.



In [3]:

    
label_names = np.array([iris.target_names[l] for l in labels])
table_labels = np.array(['class']+iris.feature_names).reshape((1, 1+num_features))
class_names = iris.target_names
table_data = np.concatenate([np.array(label_names).reshape(num_samples, 1), data], axis=1)[0:20]

# display table
table_full = np.concatenate([table_labels, table_data], axis=0)
display(HTML(tabulate.tabulate(table_full, tablefmt='html')))









    






class     sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
versicolor 5.5              2.5             4.0              1.3             
versicolor 5.8              2.7             4.1              1.0             
setosa    5.1              3.8             1.9              0.4             
setosa    4.7              3.2             1.6              0.2             
setosa    5.7              4.4             1.5              0.4             
setosa    5.8              4.0             1.2              0.2             
virginica 6.2              3.4             5.4              2.3             
versicolor 5.8              2.6             4.0              1.2             
setosa    5.2              3.5             1.5              0.2             
virginica 6.7              3.0             5.2              2.3             
versicolor 6.0              2.2             4.0              1.0             
versicolor 6.3              3.3             4.7              1.6             
versicolor 6.6              3.0             4.4              1.4             
virginica 7.4              2.8             6.1              1.9             
setosa    4.8              3.0             1.4              0.1             
versicolor 6.6              2.9             4.6              1.3             
virginica 6.0              2.2             5.0              1.5             
setosa    5.0              3.6             1.4              0.2             
virginica 6.1              2.6             5.6              1.4             
setosa    5.2              4.1             1.5              0.1

For simplicity, we will restrict our attention to just the first two features, sepal width and sepal length. Let's plot the dataset.



In [4]:

    
# plot the original data
x, y, lab = data[:, 0], data[:, 1], labels

plt.figure(figsize=(8, 6))
plt.scatter(x, y, c=lab)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.title('Iris dataset')









    Out[4]:





<matplotlib.text.Text at 0x10aa4e240>

Suppose we are given a new point whose sepal length (x) and sepal width (y) are the following:



In [5]:

    
new_x, new_y = 6.5, 3.7

Let's plot it on the graph. What could its class be?



In [6]:

    
# plot the original data
x, y, lab = data[:, 0], data[:, 1], labels

plt.figure(figsize=(8, 6))
plt.scatter(x, y, c=lab)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.title('Iris dataset')

# put the new point on top
plt.scatter(new_x, new_y, c='grey', cmap=None, edgecolor='k')
plt.annotate('?', (new_x+0.45, new_y+0.25), fontsize=20, horizontalalignment='center', verticalalignment='center')
plt.annotate("", xytext=(new_x+0.4, new_y+0.2), xy=(new_x+0.05, new_y), arrowprops=dict(arrowstyle="->"))









    Out[6]:





<matplotlib.text.Annotation at 0x10ac611d0>

Our simple approach to predicting the new point's label is to find the point in the dataset which is closest to the new point, and copying its label.



In [7]:

    
# calculate the distance between the new point and each of the points in our labeled dataset
distances = np.sum((data[:,0:2] - [new_x, new_y])**2, axis=1)

# find the index of the point whose distance is lowest
closest_point = np.argmin(distances)

# take its label
new_label = labels[closest_point]

print('Predicted label: %d'%new_label)









    



Predicted label: 2

That's it! That is k-nearest neighbors where we set k = 1. If k > 1, we find the k closest points and take a vote among them.

We can now plot the newly-labeled point on top of the dataset.



In [8]:

    
# append the newly labeled point in our dataset
x = np.append(x, new_x)
y = np.append(y, new_y)
lab = np.append(lab, new_label)

# scatter plot as before
plt.figure(figsize=(8, 6))
plt.scatter(x, y, c=lab)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.title('Iris dataset')
plt.annotate("", xytext=(x[closest_point]+0.02, y[closest_point]+0.02), xy=(new_x-0.02, new_y-0.02), arrowprops=dict(arrowstyle="->"))









    Out[8]:





<matplotlib.text.Annotation at 0x10ad9fe80>

class	Humidity (%)	Pressure (kPa)
Sun	29.0	101.7
Rain	60.0	98.6
Sun	40.0	101.1
Rain	62.0	99.9
Sun	39.0	103.2
Rain	51.0	97.6
Sun	46.0	102.1
Rain	55.0	100.2

class	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
versicolor	5.5	2.5	4.0	1.3
versicolor	5.8	2.7	4.1	1.0
setosa	5.1	3.8	1.9	0.4
setosa	4.7	3.2	1.6	0.2
setosa	5.7	4.4	1.5	0.4
setosa	5.8	4.0	1.2	0.2
virginica	6.2	3.4	5.4	2.3
versicolor	5.8	2.6	4.0	1.2
setosa	5.2	3.5	1.5	0.2
virginica	6.7	3.0	5.2	2.3
versicolor	6.0	2.2	4.0	1.0
versicolor	6.3	3.3	4.7	1.6
versicolor	6.6	3.0	4.4	1.4
virginica	7.4	2.8	6.1	1.9
setosa	4.8	3.0	1.4	0.1
versicolor	6.6	2.9	4.6	1.3
virginica	6.0	2.2	5.0	1.5
setosa	5.0	3.6	1.4	0.2
virginica	6.1	2.6	5.6	1.4
setosa	5.2	4.1	1.5	0.1