The goal of this notebook is to create a gaussian classifier that takes in images of handwritten zip code digits and predict their corresponding labels, in this case a digit between zero and nine. First, a set of training data will be divided into ten subsets based on their corresponding label. Each subset of data will be used to obtain a probability density function (pdf). Once the pdf’s have been obtained, the training data can be disregarded. To classify a given image from a test set, the image data is computed on each of the ten pdf’s. The predicted label assigned to the test data is that of the pdf which returns the largest probability. The predicted labels are then compared with the actual labels of the test set. This is accomplished using a confusion matrix, in which the rows are the actual labels and the columns are the predicted labels. The values in each of the cells are the proportion of labels correctly or incorrectly predicted. Ideally, the result of the confusion matrix would be the identity matrix.
The zip code data for this notebook can be found at the website of "The Elements of Statistical Learning" (ESL). The data consists of 7291 training observation and 2007 test observation. Each observation is 16 x 16 grayscale image of a handwritten zip code digit, ranging from zero to nine.
| Zero | One | Two | Three | Four | Five | Six | Seven | Eight | Nine | Total Observations | |
| Training Data | 1194 | 1005 | 731 | 658 | 652 | 556 | 664 | 645 | 542 | 644 | 7291 |
| Test Data | 359 | 264 | 198 | 166 | 200 | 160 | 170 | 147 | 166 | 177 | 2007 |
When building the gaussian classifier, the following steps were preformed for each class (label) of the training data.
It is possiable that the determinante of sigma, $|\mathbf{\Sigma}|$, is zero, creating a divide by zero error and getting the inverse of sigma not possible, $\mathbf{\Sigma}^{-1}$. In such cases, regularization is used: $(\mathbf{\Sigma}+rI)$. Once fitting a gaussian to each class has been completed, each observation of the test data is classified and a confusion matrix is used to present the results.
In [1]:
import gzip
import numpy as np
from sets import Set
from numpy.linalg import inv
from numpy.linalg import det
get_mean: takes a matrix of data values and returns a vector of mean values. Each indicie of the vector is the mean value of the corresponding features of the matrix.
In [2]:
def mean(data):
# caculate mean of observations for label x
return np.mean(data, axis=0)[:,np.newaxis]
get_sigma: takes a matrix of data values and returns the covariance matrix. If the determinant of the covariance matrix is singular, regularzation is used: sigma += rI. The coefficient r is by default set to 0.1.
In [3]:
def sigma(data, r=0.1):
# caculate the covariance of observations for label x
sigma = np.cov(data.T)
# if sigma is not invertable (singular) use regularization
if abs(det(sigma)) <= 1E-9:
return (sigma + r*np.identity(sigma.shape[0]))
# else return covariance
return sigma
gaussian_distribution: takes the mean, covariance and dimensionality of a class. It returns a lambda function that is the multivariate normal distribution.
In [4]:
def gaussian_distribution(mean, sigma, dimension):
# return multivariate normal distribution (gaussian) for label x
const = 1.0 / (((2*pi)**(dimension/2)) * np.sqrt(det(sigma)))
return lambda x: const * exp(-0.5 * np.dot(np.dot((x[:,np.newaxis]-mean).T, inv(sigma)), (x[:,np.newaxis]-mean)))
gaussian_distributions: takes the training data and labels. It seperates the training data by class (label) and returns a dict of gaussian distributions {key=label : value=gaussian_func}
In [5]:
def gaussian_distributions(train_data, labels):
# gaussian distributions for each label
distributions = {}
for label in labels:
# get training data for label x and remove col with label
label_data = train_data[train_data[:,0] == label][:,1:]
# get gaussian function for current label
distributions[label] = gaussian_distribution(mean(label_data), sigma(label_data), label_data.shape[1])
# return gaussian distrabutions
return distributions
gaussian_classifier: takes a dict of gaussian distributions, the test data and labels. Function takes each observation, finds the probability, then adds the label with the max probability to a list. It returns an array of predicted labels.
In [6]:
def gaussian_classifier(gaussian_distributions, test_data, labels):
# caculate the probability for every observation ...
predicted_labels = []
for observation in range(test_data.shape[0]):
# ... on each gausian distrabution
probabilities = []
for label in sorted(gaussian_distributions.keys()):
# keep the probability for each gaussian function
probabilities.append(gaussian_distributions[label](test_data[observation,1:]))
# add the label with the max probability
predicted_labels.append(labels[probabilities.index(max(probabilities))])
# return a numpy array of the predicted labels
return np.array(predicted_labels)
confusion_matrix: takes an array of actual labels, an array of predicted labels, and the class labels. It returns a matrix of proportions of correctly and incorrectly labeled predictions.
In [7]:
def confusion_matrix(actual, predicted, labels):
# get number of class labels and observations
num_labels = len(Set(actual))
num_observations = actual.shape[0]
# initialize a matrix of zeros
matrix = np.zeros((num_labels,num_labels))
# increment cell coresponding to actual (row) and predicted (col)
for observation_n in range(num_observations):
row = labels.index(actual[observation_n])
col = labels.index(predicted[observation_n])
matrix[row, col] += 1
# array indicies corespond with the number of class oberservations
class_label_sums = []
for label in labels:
class_labels = actual[actual[:] == label]
class_label_sums.append(class_labels.shape[0])
# return confusion matrix proportions
return np.divide(matrix, np.array(class_label_sums)[np.newaxis,:])
Get the zip code training and testing data from the files.
In [8]:
with gzip.open('zip.train.gz') as f:
train_data = np.loadtxt(f)
with gzip.open('zip.test.gz') as f:
test_data = np.loadtxt(f)
Get a list of the classes and sort them.
In [9]:
class_labels = sorted(list(Set(train_data[:,0])))
Get the gaussian distrabutions for each class.
In [10]:
distributions = gaussian_distributions(train_data, class_labels)
Predicted labels of the zip code test data.
In [11]:
predicted_labels = gaussian_classifier(distributions, test_data, class_labels)
Get the actual labels of the test data.
In [12]:
actual_labels = test_data[:,0]
Compare actual labels with predicted labels using a confusion matrix and display the results.
In [13]:
confusion_matrix = confusion_matrix(actual_labels, predicted_labels, class_labels)
print confusion_matrix.round(2)
On ESL's website, they note that an accuracy 97.5% and greater is good. This gaussian classifier is obtaining that goal for the zero class and is close for the one and nine classes; however, it misses the mark on classes two through eight. Dimensionality reduction may help improve the accuracy rate of this data set.
In [13]: