Tutorial : Creating a Simple Nearest Neighbor Classifier from scratch

This is based on the K Nearest Neighbours Classifier which is a useful supervised machine learning classification algorithm. But really, we want to know what happens under the hood, i.e. when we call the KNNClassifier( ) from scikit learn, what really happens?

The scope of this tutorial is to build a simple Classifier from SCRATCH! But there are a few points to consider :

  • To keep the time to a minimum, I will be building a classifier which identifies only ONE neighbor (this will be futher explained).
  • In the case of KNN, it uses a method of determining the distance between 'k' different points in space and using a voting method to classify, but in this case there is just 1 neighbor and hence I have eliminated the need for complexity by keeping the training algorithm very simple.
  • I will be using the IRIS dataset that's built into scikit learn. Since this dataset is already simplified and has only 4 dimensions (or variables), using the neighbors classifier will be easy. It will be much harder as the number of dimensions increase.
  • I'll be importing a useful tool (Euclidean Distance : Google it!) from the Scipy (SCIentific PYthon library)

1. Importing the dataset from sklearn

The Iris Dataset is already loaded in sklearn and more details about iris can be found here (https://en.wikipedia.org/wiki/Iris_flower_data_set)


In [1]:
from sklearn import datasets
iris = datasets.load_iris()

In [2]:
X = iris.data
# Iris.data contains the features or independent variables.
y = iris.target
# Iris.target contains the labels or the dependent variables.

2. Doing a train-test split by using 50% of the data as our training set

The train-test-splitter found in the cross-valiation (now model selection module) of sklearn is a simple but powerful tool to randomly split the data into train and test datasets.


In [3]:
from sklearn.model_selection import train_test_split

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.50)

If you have used Machine Learning Classifiers in Python before, remember that there are usually 5 steps involved :

  • Select the model
  • Train the model
  • Fit the model
  • Predict the outcome
  • Check the accuracy

3. Calling the Euclidean Distance and creating a function to call it

This works similar to the Pythagorean Theorem but it can work in more than one dimension


In [5]:
from scipy.spatial import distance  # Built in function called distance.

                                    #Defining the n dimensional distance as euc.
def euc(a,b):                       # Lists of numeric features. 
    return distance.euclidean(a,b)  # Measure and return the distance between 2 points 
                                    # i.e. the training point and a test point.

4. The Real Deal : Building (Coding up) the classifier

The details are explained in comment form line-by-line in the following section.


In [6]:
# First we implement a class. Classes let you structure code in a specific way.(source --> https://learnpythonthehardway.org/book/ex40.html)

class OneNeighborClassifier():                # This 'class' has 2 Methods : Fit and Predict
    
    #Each step is followed by a comment which explains how the classifier is working 
    
    def fit(self, X_train, y_train):          # Takes features and labels as input
        self.X_train = X_train                # Storing the X_train in self.X_train
        self.y_train = y_train                # Storing the y_train in self.y_train
                                              # This is like the ML classifier will memorize the values 
        
    def predict (self, X_test):               # Receives features from the testing data and returns predictions
        predictions = []                      # List of predictions, since X_test is a 2D array or a list of lists.
        for row in X_test:                    # Each row contains the features for one testing example
            label = self.closest(row)         # We are calling the function that we are creating in the next block
                                              # to find the closest training point from the test point
            predictions.append(label)         # Add the labels to the predictions list to fill it.
        return predictions                    # Return predictions as the output
    
    def closest(self, row):                   # Create the function closest such that -->
        best_dist = euc(row, self.X_train[0]) # Measure the shortest distance a test point and the first train point
        best_index = 0                        # Keep track of the index of the train point that is closest
        for i in range (1, len(self.X_train)):# Iterate over the different training points
            dist = euc(row, self.X_train[i])
            if dist < best_dist:              # The moment we find a closer one, we update our variables.
                best_dist = dist              # If dist is shorter than best_dist, then its the new best_dist
                best_index = i                # Using the index of best_dist to return label of the closest training pt.
        return self.y_train[best_index]       # Return that label

5. Final Steps

The classifier is built to utilize the standard pipeline that we use in scikit learn i.e. :

  • Call the classifier
  • Fit the model to train it
  • Predict the model on the test set
  • Check for accuracy between the real values and the predicted values

In [7]:
my_classifier = OneNeighborClassifier()
my_classifier.fit(X_train, y_train)

In [8]:
pred = my_classifier.predict(X_test)

In [9]:
from sklearn.metrics import accuracy_score
print ('Accuracy of the classifier is', accuracy_score(y_test, pred)*100, '%')


Accuracy of the classifier is 94.6666666667 %