Perceptron Classifier

Overview

This notebook shows how to implement a binary classifier using a perceptron with a sigmoidal unit. The data used for this project are hand written digits. The note book only shows the result, through use of a confusion matrix, of one possible combination of digits; however, the notebook could easily be modified to output the confusion matrix of all possible digit combinations.

Zip Code Data

The zip code data for this notebook can be found at the website of "The Elements of Statistical Learning" (ESL). The data consists of 7291 training observation and 2007 test observation. Each observation is 16 x 16 grayscale image of a handwritten zip code digit, ranging from zero to nine.

Zero One Two Three Four Five Six Seven Eight Nine Total Observations
Training Data 1194 1005 731 658 652 556 664 645 542 644 7291
Test Data 359 264 198 166 200 160 170 147 166 177 2007

Building the Classifier

When building this classifier Raul Rojas', Neural Networks - A Systematic Introduction was used as a reference. The following steps were taken to build the classifier:

  1. An initial vector of weights, $\vec{\beta}$, is created
  2. The sigmoid function is caculated for each observation, $S\left ( \mathbf{X_{i}}\right )=\frac{1}{1+e^{-\vec{\beta}^{T}\mathbf{X_{i}}}}=\begin{cases}& \text{ if }> 0.50\text{, } \hat{y}= 1\\ & \text{ else, } \hat{y}= 0\end{cases}$
  3. The sum of errors is collected, $E=\frac{1}{2}\sum \left ( \hat{y}-y_{i} \right )^{2}$
  4. The gradient of the error is caculated, $\triangledown E$
  5. and gradient descent is used to update the weights, $\vec{\beta _{new}}=\vec{\beta _{old}} - \gamma \triangledown E$
  6. Steps 2-5 are repeated till the difference in the current and previous errors is insignificant

Python Code

Imported Libraries

In [1]:
import gzip
import numpy as np
Defined Functions
name: define_binary_data
description: function removes all data related to two class labels from an N x M matrix and returns N_new x M matrix
dependencies: none
inputs: data - is an N x M matrix containing labels
label_idx - is the column containing the labels of the data
label_a - is labeled data from the N x M matrix that will be kept and re-labeled as 0
label_b - is labeled data from the N x M matrix that will be kept and re-labeled as 1
outputs: returns N_new x M matrix composed of the two labels extracted from the orginal N x M matrix

In [2]:
def define_binary_data(data, label_idx, label_a, label_b):
    
    # extract labels 1 and 2 from data
    data_label_a = data[data[:,label_idx] == label_a]
    data_label_b = data[data[:,label_idx] == label_b]
    
    # re-label data: label_1 = 0 and label_2 = 1
    data_label_a[:,label_idx] = 0
    data_label_b[:,label_idx] = 1
    
    # verticaly stack matrices and randomly shuffle rows
    binary_data = np.vstack((data_label_a, data_label_b))
    np.random.shuffle(binary_data)
    
    return binary_data
name: ad_bias
description: function takes a matrix of binay labeled data and returns that matrix with an added row of ones at index zero
dependencies: none
inputs: data - is an N x M matrix containing binary labeled data
outputs: returns N+1 x M composed of the orginal input matrix and an additional row for bias

In [3]:
def add_bias(data):
    
    # create bias vector
    bias = np.ones((1,data.shape[1]))
    
    # return verticaly stacked bias on data
    return np.vstack((bias, data))
name: initial_beta
description: function initializes a vector of weights by taking the mean difference of the two labeled data classes
dependencies: none
inputs data - is a N x M matrix containg only binary labeled data with a bias added
outputs returns an initial vector of weights (gamma)

In [4]:
def initial_beta(data, label_idx):
    
    # extract labels 0 and 1 from data
    data_label_0 = data[data[:,label_idx] == 0][:,1:]
    data_label_1 = data[data[:,label_idx] == 1][:,1:]
    
    # caculate means of the column vectors
    mean_0 = np.mean(data_label_0, axis=0)[:,np.newaxis]
    mean_1 = np.mean(data_label_1, axis=0)[:,np.newaxis]
    
    # return the difference of the means for labels 0 and 1
    return mean_0 - mean_1
name: prepare_data
description: function takes a matrix of raw data which it extracts all the data involving two specified classes and builds another matrix of binary labeled data; It then adds a bias to the new matrix and creates an initial vector of weights, based on this new matrix; in addition to seperating the data from its labels
dependencies: define_binary_data, add_bias, initial_beta
inputs data - is an N x M matrix containing labels
label_idx - is the column containing the labels of the data
label_a - is labeled data from the N x M matrix that will be kept and re-labeled as 0
label_b - is labeled data from the N x M matrix that will be kept and re-labeled as 1
outputs returns an initial vector of weights (beta), a matrix of observations (X), and a vector of labels (y)

In [5]:
def prepare_data(train_data, label_idx, label_a, label_b):
    
    # generate a new matrix, from the raw data, based on the specified labels and add a bias row
    new_data = add_bias(define_binary_data(train_data, label_idx, label_a, label_b))
    
    # return beta, X, and y
    return new_data[:,1:], new_data[:,0].astype(int)[:,np.newaxis], initial_beta(new_data, label_idx)
name: train
description: function uses gradient descent to update beta (vector of coeficients or weights) to minimize the error of the function output
dependencies: none
inputs X - a matrix of observations
y - a vector of labels
beta - an initial vector of weights
outputs returns an update version of beta that minimizes the error of the function

In [6]:
def train(X, y, beta, step_size=0.1, significant_figures=1E-4):
    
    # initialize some variables
    error_flag = True
    previous_error = 0
    
    # while the difference in the current error and previous error is significant 
    while error_flag:

        # initialize error and error gradient to zeros
        error = 0
        error_gradient_sum = np.zeros(beta.shape)

        # iterate through each observation
        for observation_idx in range(X.shape[0]):

            # get current observation
            x = X[observation_idx][:,np.newaxis]

            # caculate y_hat using sigmoid function
            y_hat = 1 / (1 + np.exp(np.dot(-beta.T, x)))
            
            # convert y_hat to binary output
            y_hat = np.where(y_hat >= 0.5, 0, 1) # should be: y_hats <= 0.5, 0, 1

            # caculate the error and add to sum of errors
            difference = y_hat - y[observation_idx]
            error += 0.5 * (difference)**2

            # caculate gradient of error and add it to the sum of gradient errors
            error_gradient = (difference * y_hat * (1 - y_hat)) * x
            error_gradient_sum = np.sum(np.hstack(( error_gradient_sum, error_gradient)), axis=1)[:,np.newaxis]

        # check if the difference in the current error and previous error is significant
        if abs(previous_error - error) < significant_figures:
            error_flag = False

        # else update weights (beta)
        else:
            previous_error = error
            beta -= step_size * error_gradient_sum
            
    # return vector of weights
    return beta
name: predict
description: function predicts labels using a binary perceptron classifier
dependencies: none
inputs X - is a N+1 x M matrix of observations with bias added to the first row
beta - a vector of weights/coeficients
outputs return a vector of predicted labels

In [7]:
def predict(X, beta):
    
    # create an empty array to store y_hats
    y_hats = np.empty((X.shape[0], 1))
    
    # iterate through each observation
    for observation_idx in range(X.shape[0]):

        # get current observation
        x = X[observation_idx][:,np.newaxis]

        # caculate y_hat using sigmoid function and add y_hat to array of y_hats
        y_hats[observation_idx] = 1 / (1 + np.exp(np.dot(-beta.T, x)))
        
    # return y_hats converted to binary
    return np.where(y_hats >= 0.5, 0, 1) # should be: y_hats <= 0.5, 0, 1
name: confusion_scores
description: function creates a confusion matrix with the percentage of correctly/incorectly classified labels
dependencies: none
inputs y_hat - predicted labels
y - actual labels
outputs returns confusion matrix

In [8]:
def confusion_scores(y_hat, y):
    
    # initialize a 2 x 2 matrix of zeros
    matrix = np.zeros((2,2))
    
    # increment cell coresponding to actual (row) and predicted (col)
    for prediction_idx in range(y_hat.shape[0]):
        row = y[prediction_idx]
        col = y_hat[prediction_idx]
        matrix[row,col] += 1
        
    # get quantity of actual zeros and ones
    zeros = y[y[:] == 0].shape[0]
    ones = y[y[:] == 1].shape[0]
    
    # return confision matrix scores
    return np.divide(matrix, np.array((zeros,ones))[:,np.newaxis]).round(2)

Extract the training data from its file


In [9]:
with gzip.open('../Data/zip.train.gz') as f:
    train_data = np.loadtxt(f)

Define index of labels in the training data and the two labels, choose between digits 0 and 9, to be used with the perceptron


In [10]:
label_idx = 0
label_a = 2
label_b = 3

Prepare the training data to be used in finding the correct weights


In [11]:
X_train, y_train, beta_initial = prepare_data(train_data, label_idx, label_a, label_b)

Update beta to minimize the error of the perceptron


In [12]:
beta = train(X_train, y_train, beta_initial)

Extract the testing data from its file


In [13]:
with gzip.open('../Data/zip.test.gz') as f:
    test_data = np.loadtxt(f)

Prepare the testing data to predict labels using a perceptron


In [14]:
X_test, y_test, _ = prepare_data(test_data, label_idx, label_a, label_b)

Use new weight coeficients to classify test set and print confusion matrix scores


In [15]:
print confusion_scores(predict(X_test, beta), y_test)


[[ 0.77  0.23]
 [ 0.01  0.99]]

Thoughts

First, some kind of feature reduction would help increase the scores of the data, but I'd also like to look at the difference between using a batch update versus an inline update. The function train is currently using a batch update. In Addition, there is an issue with functions train and predict, as the logic for classifying based on a probability is opposite of what it should be. I added comments where this occurs and what the logic should be.