Module 12 - Programming Assignment

This is the notebook for the Module 10 Programming Assignment.

Here are a few tips for using the iPython HTML notebook:

  1. You can use tab . Try le<<tab> and see the available functions.
  2. You can change the type of cell by picking "Code" or "Markdown" from the menu at the left.
  3. If you keep typing in a Markdown text area, you will eventually get scroll bars. To prevent this, hit return when you come to the end of the window. Only a double return creates a new paragraph.
  4. You can find out more about Markdown text here: http://daringfireball.net/projects/markdown/ (Copy this link and put it in another tab for reference--Don't click it or you'll leave your notebook!).
  5. Every so often, restart the kernel, clear all output and run all code cells so you can be certain that you didn't define something out of order.

You should rename this notebook to be <your JHED id>_PR12.ipynb Do it right now.

Make certain the entire notebook executes before you submit it. (See Hint #5 above)

Change the following variables:


In [1]:
name = "Shehzad Qureshi"
jhed_id = "squresh6"
if name == "Student Name" or jhed_id == "sname1":
    raise Exception( "Change the name and/or JHED ID...preferrably to yours.")

Add whatever additional imports you require here. Stick with the standard libraries and those required by the class. The import gives you access to these functions: http://ipython.org/ipython-doc/stable/api/generated/IPython.core.display.html (Copy this link) Which, among other things, will permit you to display HTML as the result of evaluated code (see HTML() or display_html()).


In [2]:
from IPython.core.display import *
from StringIO import StringIO
import numpy as np
from pprint import pprint

In this assignment you will be using the mushroom data from the Decision Tree module:

http://archive.ics.uci.edu/ml/datasets/Mushroom

The assignment is to program a Naive Bayes Classifier for this data. You'll first need to calculate all of the necessary probabilities (don't forget to use +1 smoothing). You'll then need to have a classify() function that takes an unlabeled instance and returns the normalized probability distribution over the possible classes.

Which classifier do you prefer for this problem...the decision tree or the naive bayes classifier? Why?

Naive Bayes Classifier

We will now implement a Naive Bayes Classifier and run the classifier against the same Mushroom dataset as we did for the Decision Tree.

We're going to load the dataset beforehand because we will need it for testing.


In [3]:
data = np.loadtxt('agaricus-lepiota.data', delimiter=',', dtype='S1')

We need to be able to determine all domain values of a particular variable.


In [4]:
def get_domain_values(data, col):
    d = np.unique(data[:, col])
    return d

In [5]:
print get_domain_values(data, 0) == ['e', 'p']


[ True  True]

Now we need to be able to count the number of values for a particular variable. We also give the option to calculate the number of values given a particular label.


In [6]:
def get_domain_value_count(data, col, value, ycolumn=None, yvalue=None):
    vals = (data[:, col] == value)
    count = vals.sum()
    if ycolumn is None or yvalue is None:
        return count
    yvals = (data[:, ycolumn] == yvalue)
    count = np.logical_and(vals, yvals).sum()
    return count

In [7]:
print get_domain_value_count(data, 0, 'e') == 4208
print get_domain_value_count(data, 0, 'p') == 3916
print get_domain_value_count(data, 1, 'x', 0, 'e') == 1948


True
True
True

This is a method for counting the number of entries in a particular column.


In [8]:
def get_column_count(data, col):
    vals = data[:,col]
    count = len(vals)
    return count

In [9]:
get_column_count(data, 1) == 8124


Out[9]:
True

Now we need a method to calculate our probabilities. We will use the following formulas:

$P(label = e) = \frac {\# label = e}{\# labels}$

and

$P(attrib = x | label = e) = \frac{\#attrib = x ~and~ label = e + 1}{\# labels = e + 1}$

We add a value of 1 to the count for the latter formula for smoothing, and to prevent multiplication by 0.


In [10]:
def calc_probability(data, col, value, ycolumn=None, yvalue=None):
    num_vals = get_domain_value_count(data, col, value)
    num_total = get_column_count(data, col)
    val_prob = float(num_vals) / (num_total)
    if ycolumn is None or yvalue is None:
        return val_prob
    num_vals = get_domain_value_count(data, col, value, ycolumn, yvalue)
    num_yvals = get_domain_value_count(data, ycolumn, yvalue)
    val_prob = float(num_vals + 1) / float(num_yvals + 1)
    return val_prob

In [11]:
print calc_probability(data, 0, 'e') == 4208/8124.
print calc_probability(data, 1, 'x', 0, 'e') == 1949/4209.


True
True

We're going to precalculate our probability table to save us from doing it over and over again. The output will be a list of dictionaries, with each dictionary element having a dictionary of the probability of each label. We pass in the domain values here because cross-validation tends to divide the dataset which may result in some domains not showing up in the training set.


In [12]:
def get_probability_table(data, domains):
    probs = []
    yvals = domains[0]
    probs.append({value : calc_probability(data, 0, value) for value in domains[0]})
    for i in range(1,len(data[0])):
        probs.append ({value : {yvalue : calc_probability(data, i, value, 0, yvalue) for yvalue in yvals} 
             for value in domains[i]})
    return probs

In [13]:
domains = [get_domain_values(data, i) for i in xrange(len(data[0]))]
%time probs = get_probability_table(data, domains)
pprint(probs[0])
pprint(probs[1])


Wall time: 344 ms
{'e': 0.517971442639094, 'p': 0.48202855736090594}
{'b': {'e': 0.0962223806129722, 'p': 0.012509573653306101},
 'c': {'e': 0.00023758612497030174, 'p': 0.0012764871074802144},
 'f': {'e': 0.37942504157757184, 'p': 0.3974980852693388},
 'k': {'e': 0.054407222618199094, 'p': 0.15343375031912176},
 's': {'e': 0.007840342124019958, 'p': 0.0002552974214960429},
 'x': {'e': 0.4630553575671181, 'p': 0.4363032933367373}}

This function will calculate the probability of an example given a specific class label. The formula is as follows:

$label = p(c) \cdot \prod_{i,j} p(a_i = v_j | c)$

where $c, a, v$ are the class label, attribute and attribute value respectively.


In [14]:
def probability_of(probs, instance, label):
    probability = probs[0][label] * np.product(np.array([probs[i+1][instance[i]][label] for i in xrange(len(instance))]))
    return probability

We will need to normalize the results so that they add up to 1.


In [15]:
def normalize(results, labels):
    total = np.sum(np.array([results[label] for label in labels]))
    r = {label : results[label]/total for label in labels}
    return r

We will need to determine the best label from the probabilities returned. This is as simple as taking the maximum value of all labels.


In [16]:
def find_best(results, labels):
    tmp = [(results[label], label) for label in labels]
    return sorted(tmp, reverse=True)[0][1]

This is our simple Naive Bayes Classification function. It calculates the probabilities of every label for the given example and returns the best one (along with all the results).


In [17]:
def classify(probs, instance, labels=['e', 'p']):
    results = {label : probability_of(probs, instance, label) for label in labels}
    results = normalize(results, labels)
    best = find_best(results, labels)
    return (best, results)

In [18]:
i = np.random.randint(0, len(data))
print i, data[i][0]
classify(probs, data[i][1:])


1826 e
Out[18]:
('e', {'e': 0.99999997411017272, 'p': 2.5889827185035588e-08})

Cross Validation

We'll use the same methodology as we did with Decision Trees to determine the error.

First we need to partition the data into folds.


In [19]:
def partition(dataset, fold, k):
    segments = np.array_split(dataset, k)
    test = segments[fold-1]
    training = [segments[i] for i in xrange(k) if i != (fold-1)]
    return np.concatenate(training), test

Now we'll have to split the data into examples without the label.


In [20]:
def split_data(dataset):
    x = dataset[:,1:]
    y = dataset[:,0]
    return x, y

This is our cross-validation function. It splits the dataset into a training and validation set and calculates the probability table with the training set and tests it against the test set. The output is a tuple of truth and test results which will be used later for statistical analysis.


In [21]:
def cross_validation(dataset, folds=10):
    output = []
    domains = [get_domain_values(data, i) for i in xrange(len(data[0]))]
    for i in xrange(1, folds+1):
        training_set, test_set = partition(dataset, i, folds)
        probs = get_probability_table(training_set, domains)
        x, y = split_data(test_set)
        y_h = np.array([classify(probs, xval) for xval in x])
        output.append((y, y_h))
    return output

We'll need a way to print out our confusion matrix.


In [22]:
def confusion_matrix(y, y_h, pos, neg):
    total = len(y)
    tp = np.logical_and((y == pos), (y_h == pos)).sum()
    tn = np.logical_and((y == neg), (y_h == neg)).sum()
    fp = np.logical_and((y == pos), (y_h == neg)).sum()
    fn = np.logical_and((y == neg), (y_h == pos)).sum()
    return tp, tn, fp, fn

In [23]:
def print_confusion_matrix(tp, tn, fp, fn, pos, neg):
    html = '''
    <table>
    <tr><td> </td><td>{0}</td><td>{1}</td></tr>
    <tr><td>{2}</td><td>{3}</td><td>{4}</td></tr>
    <tr><td>{5}</td><td>{6}</td><td>{7}</td></tr>
    '''.format(pos, neg, pos, tp, fp, neg, fn, tn)
    h = HTML(html)
    return h

Along with the confusion matrix we will print out some stats.


In [24]:
def print_test_stats(total, tp, tn, fp, fn):
    accuracy = (tp + tn)/float(total)
    error = (fn + fp)/float(total)
    precision = float(tp)/(tp + fp)
    recall = float(tp)/(tp + fn)
    classified = tp + tn + fp + fn
    print "total:", total
    print "total classified: {0} ({1:.3%})".format(classified, classified/float(total))
    print "total unclassified: {0} ({1:.3%})".format(total - classified,  (total - classified)/float(total))
    print "accuracy: {0:.3%}".format(accuracy)
    print "error: {0:.3%}".format(error)
    print "precision: {0:.3%}".format(precision)
    print "recall: {0:.3%}".format(recall)

Now we'll do our 10-fold cross-validation test.


In [25]:
dataset = np.loadtxt('agaricus-lepiota.data', delimiter=',', dtype='S1')
np.random.shuffle(dataset)
%time output = cross_validation(dataset)


Wall time: 4.66 s

and print out the results.


In [26]:
cm = np.array([confusion_matrix(y, y_h[:,0], 'p', 'e') for y, y_h in output])
tp, tn, fp, fn = cm.sum(axis=0)
total = 0
for _, y_h in output:
    total += len(y_h)
print_test_stats(total, tp, tn, fp, fn)
print_confusion_matrix(tp, tn, fp, fn, 'p', 'e')


total: 8124
total classified: 8124 (100.000%)
total unclassified: 0 (0.000%)
accuracy: 95.507%
error: 4.493%
precision: 91.216%
recall: 99.416%
Out[26]:
pe
p3572344
e214187

Summary

Unlike Decision Trees, Naive Bayes Classification uses probability theory to determine classification labels. It's called "Naive" because it makes the assumption that all variables are independent, which in reality is not the case. But even though it makes this assumption it still performs relatively well as seen above.

I decided to take a stab of using a Random Forest implementation to see if I could reduce the error rate. Besides the fact that a 100-forest test ran for 17 minutes, the results showed no improvement whatsoever. I will presume that the reason for this is because the probability distribution is roughly the same for each tree in the forest. If this is the case then it makes sense that a forest did not show any improvement whatsoever.


In [26]: