Gaussian Classifier

Overview

The goal of this notebook is to create a gaussian classifier that takes in images of handwritten zip code digits and predict their corresponding labels, in this case a digit between zero and nine. First, a set of training data will be divided into ten subsets based on their corresponding label. Each subset of data will be used to obtain a probability density function (pdf). Once the pdf’s have been obtained, the training data can be disregarded. To classify a given image from a test set, the image data is computed on each of the ten pdf’s. The predicted label assigned to the test data is that of the pdf which returns the largest probability. The predicted labels are then compared with the actual labels of the test set. This is accomplished using a confusion matrix, in which the rows are the actual labels and the columns are the predicted labels. The values in each of the cells are the proportion of labels correctly or incorrectly predicted. Ideally, the result of the confusion matrix would be the identity matrix.

Zip Code Data

The zip code data for this notebook can be found at the website of "The Elements of Statistical Learning" (ESL). The data consists of 7291 training observation and 2007 test observation. Each observation is 16 x 16 grayscale image of a handwritten zip code digit, ranging from zero to nine.

Zero One Two Three Four Five Six Seven Eight Nine Total Observations
Training Data 1194 1005 731 658 652 556 664 645 542 644 7291
Test Data 359 264 198 166 200 160 170 147 166 177 2007

Building the Classifier

When building the gaussian classifier, the following steps were preformed for each class (label) of the training data.

  1. Compute the mean $$\mu=\frac{1}{N}\sum^{N}x_{i}$$
  2. Compute the covariance (sigma) $$\mathbf{\Sigma}=\frac{1}{N}\sum^{N}\left(x_{i}-\mu\right)\left(x_{i}-\mu\right)^{T}$$
  3. Fit a multivariate normal distribution (gaussian) $$\frac{1}{(2\pi)^{\frac{k}{2}}|\mathbf{\Sigma}|^{-\frac{1}{2}}}e^{ -\frac{1}{2}(x-\mu)^{T}\mathbf{\Sigma}^{-1}(x-\mu)}$$

It is possiable that the determinante of sigma, $|\mathbf{\Sigma}|$, is zero, creating a divide by zero error and getting the inverse of sigma not possible, $\mathbf{\Sigma}^{-1}$. In such cases, regularization is used: $(\mathbf{\Sigma}+rI)$. Once fitting a gaussian to each class has been completed, each observation of the test data is classified and a confusion matrix is used to present the results.

Python Code

Imported Libraries

In [1]:
import gzip
import numpy as np
from sets import Set
from numpy.linalg import inv
from numpy.linalg import det

Defined Functions

get_mean: takes a matrix of data values and returns a vector of mean values. Each indicie of the vector is the mean value of the corresponding features of the matrix.


In [2]:
def mean(data):
    # caculate mean of observations for label x
    return np.mean(data, axis=0)[:,np.newaxis]

get_sigma: takes a matrix of data values and returns the covariance matrix. If the determinant of the covariance matrix is singular, regularzation is used: sigma += rI. The coefficient r is by default set to 0.1.


In [3]:
def sigma(data, r=0.1):
    # caculate the covariance of observations for label x
    sigma = np.cov(data.T)
    
    # if sigma is not invertable (singular) use regularization 
    if abs(det(sigma)) <= 1E-9: 
        return (sigma + r*np.identity(sigma.shape[0]))
    
    # else return covariance
    return sigma

gaussian_distribution: takes the mean, covariance and dimensionality of a class. It returns a lambda function that is the multivariate normal distribution.


In [4]:
def gaussian_distribution(mean, sigma, dimension):
    # return multivariate normal distribution (gaussian) for label x
    const = 1.0 / (((2*pi)**(dimension/2)) * np.sqrt(det(sigma)))
    return lambda x: const * exp(-0.5 * np.dot(np.dot((x[:,np.newaxis]-mean).T, inv(sigma)), (x[:,np.newaxis]-mean)))

gaussian_distributions: takes the training data and labels. It seperates the training data by class (label) and returns a dict of gaussian distributions {key=label : value=gaussian_func}


In [5]:
def gaussian_distributions(train_data, labels):
    # gaussian distributions for each label
    distributions = {}
    for label in labels:

        # get training data for label x and remove col with label
        label_data = train_data[train_data[:,0] == label][:,1:]

        # get gaussian function for current label
        distributions[label] = gaussian_distribution(mean(label_data), sigma(label_data), label_data.shape[1])
        
    # return gaussian distrabutions
    return distributions

gaussian_classifier: takes a dict of gaussian distributions, the test data and labels. Function takes each observation, finds the probability, then adds the label with the max probability to a list. It returns an array of predicted labels.


In [6]:
def gaussian_classifier(gaussian_distributions, test_data, labels):
    # caculate the probability for every observation ...
    predicted_labels = []
    for observation in range(test_data.shape[0]):
        
        # ... on each gausian distrabution
        probabilities = []
        for label in sorted(gaussian_distributions.keys()):
            
            # keep the probability for each gaussian function
            probabilities.append(gaussian_distributions[label](test_data[observation,1:]))
            
        # add the label with the max probability 
        predicted_labels.append(labels[probabilities.index(max(probabilities))])
    
    # return a numpy array of the predicted labels
    return np.array(predicted_labels)

confusion_matrix: takes an array of actual labels, an array of predicted labels, and the class labels. It returns a matrix of proportions of correctly and incorrectly labeled predictions.


In [7]:
def confusion_matrix(actual, predicted, labels):
    # get number of class labels and observations
    num_labels = len(Set(actual))
    num_observations = actual.shape[0]
    
    # initialize a matrix of zeros
    matrix = np.zeros((num_labels,num_labels))
    
    # increment cell coresponding to actual (row) and predicted (col)
    for observation_n in range(num_observations):
        row = labels.index(actual[observation_n])
        col = labels.index(predicted[observation_n])
        matrix[row, col] += 1
    
    # array indicies corespond with the number of class oberservations
    class_label_sums = []
    for label in labels:
        class_labels = actual[actual[:] == label]
        class_label_sums.append(class_labels.shape[0])
    
    # return confusion matrix proportions
    return np.divide(matrix, np.array(class_label_sums)[np.newaxis,:])

Execute Gaussian Classifier

Get the zip code training and testing data from the files.


In [8]:
with gzip.open('zip.train.gz') as f:
    train_data = np.loadtxt(f)
    
with gzip.open('zip.test.gz') as f:
    test_data = np.loadtxt(f)

Get a list of the classes and sort them.


In [9]:
class_labels = sorted(list(Set(train_data[:,0])))

Get the gaussian distrabutions for each class.


In [10]:
distributions = gaussian_distributions(train_data, class_labels)

Predicted labels of the zip code test data.


In [11]:
predicted_labels = gaussian_classifier(distributions, test_data, class_labels)

Get the actual labels of the test data.


In [12]:
actual_labels = test_data[:,0]

Compare actual labels with predicted labels using a confusion matrix and display the results.


In [13]:
confusion_matrix = confusion_matrix(actual_labels, predicted_labels, class_labels)
print confusion_matrix.round(2)


[[ 0.98  0.    0.03  0.    0.01  0.    0.    0.    0.01  0.  ]
 [ 0.    0.97  0.    0.    0.02  0.01  0.03  0.    0.    0.  ]
 [ 0.01  0.    0.91  0.01  0.02  0.01  0.    0.01  0.03  0.  ]
 [ 0.    0.    0.02  0.89  0.    0.06  0.    0.    0.03  0.  ]
 [ 0.    0.    0.03  0.    0.92  0.    0.    0.01  0.    0.05]
 [ 0.01  0.    0.    0.02  0.    0.92  0.    0.    0.02  0.01]
 [ 0.    0.    0.01  0.    0.01  0.02  0.95  0.    0.01  0.  ]
 [ 0.    0.    0.01  0.    0.02  0.    0.    0.94  0.01  0.02]
 [ 0.01  0.    0.02  0.02  0.    0.02  0.    0.    0.9   0.01]
 [ 0.    0.    0.    0.    0.    0.    0.    0.01  0.02  0.97]]

Thoughts

On ESL's website, they note that an accuracy 97.5% and greater is good. This gaussian classifier is obtaining that goal for the zero class and is close for the one and nine classes; however, it misses the mark on classes two through eight. Dimensionality reduction may help improve the accuracy rate of this data set.


In [13]: