This notebook shows how to implement a binary classifier using a perceptron with a sigmoidal unit. The data used for this project are hand written digits. The note book only shows the result, through use of a confusion matrix, of one possible combination of digits; however, the notebook could easily be modified to output the confusion matrix of all possible digit combinations.
The zip code data for this notebook can be found at the website of "The Elements of Statistical Learning" (ESL). The data consists of 7291 training observation and 2007 test observation. Each observation is 16 x 16 grayscale image of a handwritten zip code digit, ranging from zero to nine.
| Zero | One | Two | Three | Four | Five | Six | Seven | Eight | Nine | Total Observations | |
| Training Data | 1194 | 1005 | 731 | 658 | 652 | 556 | 664 | 645 | 542 | 644 | 7291 |
| Test Data | 359 | 264 | 198 | 166 | 200 | 160 | 170 | 147 | 166 | 177 | 2007 |
When building this classifier Raul Rojas', Neural Networks - A Systematic Introduction was used as a reference. The following steps were taken to build the classifier:
In [1]:
import gzip
import numpy as np
| name: | define_binary_data |
| description: | function removes all data related to two class labels from an N x M matrix and returns N_new x M matrix |
| dependencies: | none |
| inputs: | data - is an N x M matrix containing labels |
| label_idx - is the column containing the labels of the data | |
| label_a - is labeled data from the N x M matrix that will be kept and re-labeled as 0 | |
| label_b - is labeled data from the N x M matrix that will be kept and re-labeled as 1 | |
| outputs: | returns N_new x M matrix composed of the two labels extracted from the orginal N x M matrix |
In [2]:
def define_binary_data(data, label_idx, label_a, label_b):
# extract labels 1 and 2 from data
data_label_a = data[data[:,label_idx] == label_a]
data_label_b = data[data[:,label_idx] == label_b]
# re-label data: label_1 = 0 and label_2 = 1
data_label_a[:,label_idx] = 0
data_label_b[:,label_idx] = 1
# verticaly stack matrices and randomly shuffle rows
binary_data = np.vstack((data_label_a, data_label_b))
np.random.shuffle(binary_data)
return binary_data
| name: | ad_bias |
| description: | function takes a matrix of binay labeled data and returns that matrix with an added row of ones at index zero |
| dependencies: | none |
| inputs: | data - is an N x M matrix containing binary labeled data |
| outputs: | returns N+1 x M composed of the orginal input matrix and an additional row for bias |
In [3]:
def add_bias(data):
# create bias vector
bias = np.ones((1,data.shape[1]))
# return verticaly stacked bias on data
return np.vstack((bias, data))
| name: | initial_beta |
| description: | function initializes a vector of weights by taking the mean difference of the two labeled data classes |
| dependencies: | none |
| inputs | data - is a N x M matrix containg only binary labeled data with a bias added |
| outputs | returns an initial vector of weights (gamma) |
In [4]:
def initial_beta(data, label_idx):
# extract labels 0 and 1 from data
data_label_0 = data[data[:,label_idx] == 0][:,1:]
data_label_1 = data[data[:,label_idx] == 1][:,1:]
# caculate means of the column vectors
mean_0 = np.mean(data_label_0, axis=0)[:,np.newaxis]
mean_1 = np.mean(data_label_1, axis=0)[:,np.newaxis]
# return the difference of the means for labels 0 and 1
return mean_0 - mean_1
| name: | prepare_data |
| description: | function takes a matrix of raw data which it extracts all the data involving two specified classes and builds another matrix of binary labeled data; It then adds a bias to the new matrix and creates an initial vector of weights, based on this new matrix; in addition to seperating the data from its labels |
| dependencies: | define_binary_data, add_bias, initial_beta |
| inputs | data - is an N x M matrix containing labels |
| label_idx - is the column containing the labels of the data | |
| label_a - is labeled data from the N x M matrix that will be kept and re-labeled as 0 | |
| label_b - is labeled data from the N x M matrix that will be kept and re-labeled as 1 | |
| outputs | returns an initial vector of weights (beta), a matrix of observations (X), and a vector of labels (y) |
In [5]:
def prepare_data(train_data, label_idx, label_a, label_b):
# generate a new matrix, from the raw data, based on the specified labels and add a bias row
new_data = add_bias(define_binary_data(train_data, label_idx, label_a, label_b))
# return beta, X, and y
return new_data[:,1:], new_data[:,0].astype(int)[:,np.newaxis], initial_beta(new_data, label_idx)
| name: | train |
| description: | function uses gradient descent to update beta (vector of coeficients or weights) to minimize the error of the function output |
| dependencies: | none |
| inputs | X - a matrix of observations |
| y - a vector of labels | |
| beta - an initial vector of weights | |
| outputs | returns an update version of beta that minimizes the error of the function |
In [6]:
def train(X, y, beta, step_size=0.1, significant_figures=1E-4):
# initialize some variables
error_flag = True
previous_error = 0
# while the difference in the current error and previous error is significant
while error_flag:
# initialize error and error gradient to zeros
error = 0
error_gradient_sum = np.zeros(beta.shape)
# iterate through each observation
for observation_idx in range(X.shape[0]):
# get current observation
x = X[observation_idx][:,np.newaxis]
# caculate y_hat using sigmoid function
y_hat = 1 / (1 + np.exp(np.dot(-beta.T, x)))
# convert y_hat to binary output
y_hat = np.where(y_hat >= 0.5, 0, 1) # should be: y_hats <= 0.5, 0, 1
# caculate the error and add to sum of errors
difference = y_hat - y[observation_idx]
error += 0.5 * (difference)**2
# caculate gradient of error and add it to the sum of gradient errors
error_gradient = (difference * y_hat * (1 - y_hat)) * x
error_gradient_sum = np.sum(np.hstack(( error_gradient_sum, error_gradient)), axis=1)[:,np.newaxis]
# check if the difference in the current error and previous error is significant
if abs(previous_error - error) < significant_figures:
error_flag = False
# else update weights (beta)
else:
previous_error = error
beta -= step_size * error_gradient_sum
# return vector of weights
return beta
| name: | predict |
| description: | function predicts labels using a binary perceptron classifier |
| dependencies: | none |
| inputs | X - is a N+1 x M matrix of observations with bias added to the first row |
| beta - a vector of weights/coeficients | |
| outputs | return a vector of predicted labels |
In [7]:
def predict(X, beta):
# create an empty array to store y_hats
y_hats = np.empty((X.shape[0], 1))
# iterate through each observation
for observation_idx in range(X.shape[0]):
# get current observation
x = X[observation_idx][:,np.newaxis]
# caculate y_hat using sigmoid function and add y_hat to array of y_hats
y_hats[observation_idx] = 1 / (1 + np.exp(np.dot(-beta.T, x)))
# return y_hats converted to binary
return np.where(y_hats >= 0.5, 0, 1) # should be: y_hats <= 0.5, 0, 1
| name: | confusion_scores |
| description: | function creates a confusion matrix with the percentage of correctly/incorectly classified labels |
| dependencies: | none |
| inputs | y_hat - predicted labels |
| y - actual labels | |
| outputs | returns confusion matrix |
In [8]:
def confusion_scores(y_hat, y):
# initialize a 2 x 2 matrix of zeros
matrix = np.zeros((2,2))
# increment cell coresponding to actual (row) and predicted (col)
for prediction_idx in range(y_hat.shape[0]):
row = y[prediction_idx]
col = y_hat[prediction_idx]
matrix[row,col] += 1
# get quantity of actual zeros and ones
zeros = y[y[:] == 0].shape[0]
ones = y[y[:] == 1].shape[0]
# return confision matrix scores
return np.divide(matrix, np.array((zeros,ones))[:,np.newaxis]).round(2)
Extract the training data from its file
In [9]:
with gzip.open('../Data/zip.train.gz') as f:
train_data = np.loadtxt(f)
Define index of labels in the training data and the two labels, choose between digits 0 and 9, to be used with the perceptron
In [10]:
label_idx = 0
label_a = 2
label_b = 3
Prepare the training data to be used in finding the correct weights
In [11]:
X_train, y_train, beta_initial = prepare_data(train_data, label_idx, label_a, label_b)
Update beta to minimize the error of the perceptron
In [12]:
beta = train(X_train, y_train, beta_initial)
Extract the testing data from its file
In [13]:
with gzip.open('../Data/zip.test.gz') as f:
test_data = np.loadtxt(f)
Prepare the testing data to predict labels using a perceptron
In [14]:
X_test, y_test, _ = prepare_data(test_data, label_idx, label_a, label_b)
Use new weight coeficients to classify test set and print confusion matrix scores
In [15]:
print confusion_scores(predict(X_test, beta), y_test)
First, some kind of feature reduction would help increase the scores of the data, but I'd also like to look at the difference between using a batch update versus an inline update. The function train is currently using a batch update. In Addition, there is an issue with functions train and predict, as the logic for classifying based on a probability is opposite of what it should be. I added comments where this occurs and what the logic should be.