Module 12 - Programming Assignment

Directions

There are general instructions on Blackboard and in the Syllabus for Programming Assignments. This Notebook also has instructions specific to this assignment. Read all the instructions carefully and make sure you understand them. Please ask questions on the discussion boards or email me at EN605.445@gmail.com if you do not understand something.

You must follow the directions *exactly* or you will get a 0 on the assignment.

You must submit a zip file of your assignment and associated files (if there are any) to Blackboard. The zip file will be named after you JHED ID: <jhed_id>.zip. It will not include any other information. Inside this zip file should be the following directory structure:

<jhed_id>
    |
    +--module-01-programming.ipynb
    +--module-01-programming.html
    +--(any other files)

For example, do not name your directory programming_assignment_01 and do not name your directory smith122_pr1 or any else. It must be only your JHED ID. Make sure you submit both an .ipynb and .html version of your completed notebook. You can generate the HTML version using:

ipython nbconvert [notebookname].ipynb

or use the File menu.

Naive Bayes Classifier

In this assignment you will be using the mushroom data from the Decision Tree module:

http://archive.ics.uci.edu/ml/datasets/Mushroom

The assignment is to write a program that will learn and apply a Naive Bayes Classifier for this problem. You'll first need to calculate all of the necessary probabilities (don't forget to use +1 smoothing) using a learn function. You'll then need to have a classify function that takes your probabilities, a List of instances (possibly a list of 1) and returns a List of Tuples. Each Tuple is a class and the normalized probability of that class. The List should be sorted so that the probabilities are in descending order. For example,

[("e", 0.98), ("p", 0.02)]

when calculating the error rate of your classifier, you should pick the class with the highest probability (the first one in the list).

As a reminder, the Naive Bayes Classifier generates the un-normalized probabilities from the numerator of Bayes Rule:

$$P(C|A) \propto P(A|C)P(C)$$

where C is the class and A are the attributes (data). Since the normalizer of Bayes Rule is the sum of all possible numerators and you have to calculate them all, the normalizer is just the sum of the probabilities.

You'll also need an evaluate function as before. You should use the $error\_rate$ again.

Use the same testing procedure as last time, on two randomized subsets of the data:

learn the probabilities for set 1
classify set 2
evaluate the predictions
learn the probabilities for set 2
classify set 1
evalute the the predictions
average the classification error.

Imports



In [1]:

    
from __future__ import division # so that 1/2 = 0.5 and not 0
from IPython.core.display import *
import csv, math, copy, random

attributes_domain
A helper function to return a dictionary of attributes, and the domains possible for that attribute.

This is used to start the Naive Bayes algorithm with the appropriate possible attributes and their domains. A '?' attribute is added to every domain in case a record is missing a value for a given domain. In the Record the value for that domain is expected to have a '?' indicating that for that record the attribute value is unknown.

input:
None

return:

attributes: a dictionary of attribute names as keys and the attributes domain as a list of strings.



In [2]:

    
def attributes_domains():
    return {
        'label': ['e', 'p', '?'],
        'cap-shape': ['b', 'c', 'x', 'f', 'k', 's', '?'],
        'cap-surface': ['f', 'g', 'y', 's', '?'],
        'cap-color': ['n', 'b', 'c', 'g', 'r', 'p', 'u', 'e', 'w', 'y', '?'],
        'bruises?': ['t', 'f', '?'],
        'odor': ['a', 'l', 'c', 'y', 'f', 'm', 'n', 'p', 's', '?'],
        'gill-attachment': ['a', 'd', 'f', 'n', '?'],
        'gill-spacing': ['c', 'w', 'd', '?'],
        'gill-size': ['b', 'n', '?'],
        'gill-color': ['k', 'n', 'b', 'h', 'g', 'r', 'o', 'p', 'u', 'e', 'w', 'y', '?'],
        'stalk-shape': ['e', 't', '?'],
        'salk-root': ['b', 'c', 'u', 'e', 'z', 'r', '?'],
        'stalk-surface-above-ring': ['f', 'y', 'k', 's', '?'],
        'stalk-surface-below-ring': ['f', 'y', 'k', 's', '?'],
        'stalk-color-above-ring': ['n', 'b', 'c', 'g', 'o', 'p', 'e', 'w', 'y', '?'],
        'stalk-color-below-ring': ['n', 'b', 'c', 'g', 'o', 'p', 'e', 'w', 'y', '?'],
        'veil-type': ['p', 'u', '?'],
        'veil-color': ['n', 'o', 'w', 'y', '?'],
        'ring-number': ['n', 'o', 't', '?'],
        'ring-type': ['c', 'e', 'f', 'l', 'n', 'p', 's', 'z', '?'],
        'spore-print-color': ['k', 'n', 'b', 'h', 'r', 'o', 'u', 'w', 'y', '?'],
        'population': ['a', 'c', 'n', 's', 'v', 'y', '?'],
        'habitat': ['g', 'l', 'm', 'p', 'u', 'w', 'd', '?'],
    }

get_positive_label
A helper function to return the positive label for this implimentation of a Naive Bayes Classifier. Used incase the positive label were to change. "positive" in this context is simply derived from the data set, and that it is a POSITIVE thing to be able to eat a mushroom, thus the label for the dataset e is "Positive". This is the ONLY reason it's called positive.

The label is used in calculating the information gain, as well as determining the majority label of an attribute.

input:
None

return:

the label, a string.



In [3]:

    
def get_positive_label():
    return 'e'

get_negative_label
A helper function to return the negative label for this implimentation of a Naive Bayes Classifier. Used incase the negative label were to change. "Negative" in this context is simply derived from the data set, and that it is a NEGATIVE thing to eat a Poisonous mushroom, thus the label for the dataset p is "Negative". This is the ONLY reason it's called negative.

The label is used in calculating the information gain, as well as determining the majority label of an attribute.

input:
None

return:

the label, a string.



In [4]:

    
def get_negative_label():
    return 'p'

create_record
A helper function to create a record to be used in the Naive Bayes Classifier, given a record from the csv file.

Creates a dictionary that maps the attribute_name to the value of that attribute for a given record.

This is used to transform all of the data read in from the csv file into an easily usable dictionary for Naive Bayes Classifier.

input:

csv_record: a list of strings

return:

a dictionary that maps attribute_names to the value for that attribute.



In [5]:

    
def create_record(csv_record):
    return {
        'label': csv_record[0],
        'cap-shape': csv_record[1],
        'cap-surface': csv_record[2],
        'cap-color': csv_record[3],
        'bruises?': csv_record[4],
        'odor': csv_record[5],
        'gill-attachment': csv_record[6],
        'gill-spacing': csv_record[7],
        'gill-size': csv_record[8],
        'gill-color': csv_record[9],
        'stalk-shape': csv_record[10],
        'salk-root': csv_record[11],
        'stalk-surface-above-ring': csv_record[12],
        'stalk-surface-below-ring': csv_record[13],
        'stalk-color-above-ring': csv_record[14],
        'stalk-color-below-ring': csv_record[15],
        'veil-type': csv_record[16],
        'veil-color': csv_record[17],
        'ring-number': csv_record[18],
        'ring-type': csv_record[19],
        'spore-print-color': csv_record[20],
        'population': csv_record[21],
        'habitat': csv_record[22],
    }

create_distribution_dict
A helper function to create a dictionary that holds the Naive Bayes Classifier distibutions for all of the $P(a_i|c_i)$ probabilities, for each $A$ where $A$ is all attributes and $a_i$ is a domain for a specific attribute.

The dictionary has the following strucutre:

{
    (attribute, attribute_domain_value, 'label', label_value) : value
}

The key allows us to specify for which attribute, and for what domain value we are creating the distribution for, and the 'label' label_value allow us to create the "Given $c_i$" part of the distribution.

This dictionary is used first to create an overall count of each disitbution, and then is later used to hold the actual probability distibution for the Naive Bayes Classifier.

Note that the distibution for "counting" is initialized to 1. This is to account for the "+1" smoothing that is needed for calculating the probabilities later on for the $P(f_i | c)$ which describes the probability.

This is an important method for the algorithm because this function specifies how the distibution is stored.

input:
None

return:

a dictionary with the structure specified in the above discription.



In [6]:

    
def create_distribution_dict():
    attributes_with_domains = attributes_domains()

    distribution = {}

    for attribute, domains in attributes_with_domains.iteritems():
        if attribute == 'label':
            continue
        for domain in domains:
            distribution[(attribute, domain, 'label', get_positive_label())] = 1
            distribution[(attribute, domain, 'label', get_negative_label())] = 1

    return distribution

read_file
A helper function to read in the data from a CSV file, and transform it into a list of records, as described in the create_record description.

NOTE: If not given a path to a file, it assumes that the file is in your local directory, from which you are running this notebook. It also assumes that the file it is reading is "agaricus-lepiota.data".
The file can be found at https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data

Please also note that this file is the expected format of input for this entire Naive Bayes Classifier implementation. Please do not try to run this with other data that is not in this format, or have the same bounds as this data set.

input:

path (optional): the path to the csv file you wish to read in.

return:

records: A list of records. Records have the shape described by the create_record description.



In [7]:

    
def read_file(path=None):
    if path is None:
        path = 'agaricus-lepiota.data'
    with open(path, 'r') as f:
        reader = csv.reader(f)
        csv_list = list(reader)

    records = []
    for value in csv_list:
        records.append(create_record(value))

    return records

create_distribution_key
A helper function the key needed to access a given probability in the Naive Bayes Distribution dictionary, described in create_distribution_dict.

input:

attribute: a String that specifies the attribute for the probability to access
domain: a string that specifies the domain value for the probability to access
label_value: a string that specifies which classification label to use when accessing the probability.

return:

a tuple with the structure: (attribute_name, domain_value, 'label', label_value)



In [8]:

    
def create_distribution_key(attribute, domain_value, label_value):
    return (attribute, domain_value, 'label', label_value)

put_value_in_distribution
A helper function to increment the count by 1, in the distribution dictionary, of a given key.

Used when counting the number of occurenses of a particular $A=a_i, C=c_i$ when building out the distribution of the training set.

input:

distribution: a dictionary with the structure specified by create_distribution_dict
attribute: a String that specifies the attribute for the probability to access
domain: a string that specifies the domain value for the probability to access
label_value: a string that specifies which classification label to use when accessing the probability.

return: None



In [9]:

    
def put_value_in_distribution(distribution, attribute, domain_value, label_value):
    key = create_distribution_key(attribute, domain_value, label_value)
    distribution[key] += 1

get_label_count
A helper function that returns the number of records that have a given label.

This is used to get the total number of records with a given label.
This value is then used when calculating the normalized probabilites of the distribution, $$P(f_i | c_i) = \frac{Num((f_i,c_i)) + 1}{Num(c_i) + 1}$$

Specifically the $Num(c_i)$ part.

input:

records: a list of records.

return:

count: the number of records with the specified label



In [10]:

    
def get_label_count(records, label):
    count = 0
    for record in records:
        if record['label'] == label:
            count += 1
    return count

create_percentages
A helper function that, given a distibution of counts for $(f_i, c_i)$ calculates the probability according to: $$P(f_i | c_i) = \frac{Num((f_i,c_i)) + 1}{Num(c_i) + 1}$$

The distribution already contains the "count" for the probability, the $Num((f_i,c_i)) + 1$ part. To calculte the probability, we just divide by the dividend which is passed in in the form of the count for the positive and negative lables.

For each key in the distribution, we determine which $c_i$ it uses, and divide by the appropriate dividend.

These percentages or distributions are then used during the classification step.

input:

pos_count: an int, the number of records with the "positive" label in the training set.
neg_count: an int, the number of records with the "negative" label in the training set.
distribution: a dictionary, with the structure specified in create_distribution_dict

return:

distribution: a dictionary, with the structure specified in create_distribution_dict, now with values that are probabilites rather than raw counts. Probability is calculated according to the above formula.



In [11]:

    
def create_percentages(pos_count, neg_count, distribution):
    pos_count_plus_1 = pos_count + 1
    neg_count_plus_1 = neg_count + 1

    pos_label = get_positive_label()
    neg_label = get_negative_label()

    for key in distribution:
        if key[3] == pos_label:
            distribution[key] = distribution[key] / pos_count_plus_1
        elif key[3] == neg_label:
            distribution[key] = distribution[key] / neg_count_plus_1

    return distribution

learn
The main function that learns the distribution for the Naive Bayes Classifier.

The function works as follows:

Create initial distribution counts
get positive label counts
get negative label counts
for each record in the training set:
- For each attribute, and domain_value for the attribute:
  - put the value into the distribution (i.e incriment the value for that attribute, domain, and label tuple
    - the Corresponding value in the distribution is (Attribute, domain_value, 'label', actual label for record)
change the distribution from counts to probabilities
add special entries in the distribution for the Probability of each possible label.
- the Probability of a given label is as follows: $P(c_i) = \frac{Num(c_i)}{Size Of Training Set}$

We then return the learned distribution, as our Naive Bayes Classifier.

input:

records: a list of records, as described by the create_record function.

return:

distribution: a dictionary, with the structure specified in create_distribution_dict, with values that are the probabilites for each $A$ and $C$ so that we have $P(A=a_i | C=c_i)$



In [12]:

    
def learn(records):
    distribution = create_distribution_dict()
    pos_count = get_label_count(records, get_positive_label())
    neg_count = get_label_count(records, get_negative_label())

    for record in records:
        for attribute, domain_value in record.iteritems():
            if attribute == 'label':
                continue
            put_value_in_distribution(distribution, attribute, domain_value, record['label'])

    distribution = create_percentages(pos_count, neg_count, distribution)
    distribution[('label', get_positive_label())] = pos_count / (pos_count + neg_count)
    distribution[('label', get_negative_label())] = neg_count / (pos_count + neg_count)

    return distribution

calculate_probability_of
A helper function that calculates the un_normalized probability of a given instance (record), for a given label.

The un_normalized probability is caclulated as follows: $$P(c_i) \prod_i P(f_i | c_i)$$

Where $f_i$ is a given attribute and attributes value, and c_i is a given label.

To calculate this, we itterate throught the instance's (record's) attributes, and values for the attributes, create the key into the distribution from the attribute and attribute's value and the label we are wishing to calculate the probability for.

This is then multiplied to the running product of the other probabilities. The running product is initialized to the $P(c_i)$ to take care of the initial multiplicative term.

The un_normalized probability is then returned.

This is used when classifying a record, to get the probability that the record should have a certain label. This is important because this probability is then normalized after all probabilities are gotten for all labels, and then used to determing how likely a record is part of a given class label.

input:

distribution: a dictionary, with the structure specified in create_distribution_dict, with values that are the probabilites.
instance: a record, as described by create_record
labelL: a string that describes a given label value.

return:

un_normalized_prob: a float that represents the un_normalized probability that a record belongs to the given class label.



In [13]:

    
def calculate_probability_of(distribution, instance, label):
    un_normalized_prob = distribution[('label', label)]
    for attribute, domain_value in instance.iteritems():
        if attribute == 'label':
            continue
        key = create_distribution_key(attribute, domain_value, label)
        un_normalized_prob *= distribution[key]

    return un_normalized_prob

normalize
A helper function that normalizes a list of probabilities. The list of probabilities is for a single record, and should have the following structure:

[(label, probability), (label, probability)]

These probabilities should be un_normalized probabilities for each label.

This function normalizes the probabilities by summing the probabilities for each label together, then calculating the normalized probability for each label by dividing the probability for that label by the sum of all the probabilities.

This normalized probability is then placed into a new list with the same structure and same corresponding label.

The list of normalized probabilies is then SORTED in descending order. I.E. the label with the most likely possibility is in index position 0 for the list of probabilities**

This new normalized list of probabilities is then returned.

This function is important because this calculates the probabilities that are then used to choose which label should be used to describe a record. This is done during validation

input:

probability_list: a list of tuples, as described by: [(label, probability), (label, probability)]

return:

normalized_list: a list of tuples, as described by: [(label, probability), (label, probability)] with the probabilities being normalized as described above.



In [14]:

    
def normalize(probability_list):
    sum_of_probabilities = 0

    normalized_list = []

    for prob_tuple in probability_list:
        sum_of_probabilities += prob_tuple[1]

    for prob_tuple in probability_list:
        normalized_prob = prob_tuple[1] / sum_of_probabilities

        normalized_list.append((prob_tuple[0], normalized_prob))

    normalized_list.sort(key=lambda x: x[1], reverse=True)
    return normalized_list

classify_instance
A helper that does most of the work to classifiy a given instance of a record.

It works as follows:

create a list of possible labels
initialize results list.
for each label
- calculate the un_normalized probability of the instance using calculate_probabily_of
- add the probability to the results list as a tuple of (label, un_normalized probability)
normalize the probabilities, using normalize
- note that now the list of results (a list of tuples) is now sorted in descending order by the value of the probability
return the normalized probabilities for that instance of a record.

This is important because this list describes the probabilities that this record should have a given label. The First tuple in the list is the tuple with the label that has the Hightest probability for this record.

input:

distribution: a dictionary, with the structure specified in create_distribution_dict, with values that are the probabilites.
instace: a record, as described by create_record

return:

probability_results: a List of tuples with the structure as: [(label, normalized probability), (label, normalized probability)] sorted in descending order by probability.
NOTE: This is these are the probabilites for a SINGLE record



In [15]:

    
def classify_instance(distribution, instance):
    labels = [get_positive_label(), get_negative_label()]

    probability_results = []

    for label in labels:
        probability = calculate_probability_of(distribution, instance, label)
        probability_results.append((label, probability))

    probability_results = normalize(probability_results)

    return probability_results

classify
A function to classify a list of instances(records).

Given a list of instances (records), classify each instance using classify_instance and put the result into a result list. Return the result list after each instance has been classified.

The Structure of the return list will be a List of lists where each inner list is a list of tuples, as described by the classify_instance function. An example will look as follows:

[ [('e', .999),('p', .001)], [('p', .78), ('e', .22)] ]

The first list [('e', .999),('p', .001)] corresponds to the probabilities for the first instance in the instances list and the second list to the second instance of the instances list. So on and so forth for each entry in the instances list.

input:

distribution: a dictionary, with the structure specified in create_distribution_dict, with values that are the probabilites.
instace: a record, as described by create_record

return:

results: a list of lists of tuples as described above.



In [16]:

    
def classify(distribution, instances):
    results = []
    for instance in instances:
        results.append(classify_instance(distribution, instance))

    return results

evaluate
The main evaluation method. Uses a simple $\frac{ Num Errors}{total Data Points}$ to calculate the error rate of the Naive Bayes Classifier.

Given a list of records (test_data) and a list of predicted classifications for that data set, run through both lists, and compire the label for the record to the predicted classification. If they do not match, increase the number of errors seen.

The label for the predicted classification is at position 0 of the predicted probabilities list, and position 0 of the tuple for that holds the label and probability of that label. i.e. for a classifications list that is as follows:

[ [('e', .999),('p', .001)], [('p', .78), ('e', .22)] ]

The predicted label for record 1 is 'e' since the corresponding predicted probabilities are [('e', .999),('p', .001)], the most likely label is at position 0 in the list, since they are sorted from most probable to least probable. Position 0 of the list gives us ('e', .999). The label for this selected probability is then at position 0 of the tuple, which gives us 'e'.
This label is then compared to the actual label for the record for correctness.

Return the number of erros seen divided by the total number of data points. This is the error rate.

input:

test_data: a list of records
classifications: a list of lists of tuples, as described by the classify function.

return:

error_rate: a float that represents the number of errors / total number of data points.



In [17]:

    
def evaluate(test_data, classifications):
    number_of_errors = 0
    for record, classification in zip(test_data, classifications):
        if record['label'] != classification[0][0]:
            number_of_errors += 1

    return number_of_errors/len(test_data)

Put your main function calls here.

Set up Training Sets

Shuffle training set to ensure no bias from data order



In [18]:

    
test_records = read_file()

random.shuffle(test_records)

half_way = int(math.floor(len(test_records)/2))
set_1 = test_records[:half_way]
set_2 = test_records[half_way:]

Train Naive Bayes 1 on Set 1



In [19]:

    
distro_1 = learn(set_1)

Get Predicted Classifications for Set 2 From Naive Bayes 1



In [20]:

    
b1_c2 = classify(distro_1, set_2)

Evaluate Predicted Set 2 against Actual Set 2



In [21]:

    
evaluation_b1_c2 = evaluate(set_2, b1_c2)
print "Error Rate for Naive Bayes 1 with Set 2 = {}".format(evaluation_b1_c2)









    



Error Rate for Naive Bayes 1 with Set 2 = 0.0467749876908

Train Naive Bayes 2 on Set 2



In [22]:

    
distro_2 = learn(set_2)

Get Predicted Classifications for Set 1 From Naive Bayes 2



In [23]:

    
b2_c1 = classify(distro_2, set_1)

Evaluate Predicted Set 1 against Actual Set 1



In [24]:

    
evaluation_b2_c1 = evaluate(set_1, b2_c1)
print "Error Rate for Naive Bayes 2 with Set 1 = {}".format(evaluation_b2_c1)









    



Error Rate for Naive Bayes 2 with Set 1 = 0.0558838010832

Calculate Average Error for Both Naive Bayes Distrobutions



In [25]:

    
average_error = (evaluation_b1_c2 + evaluation_b2_c1)/2
print "Average Error Rate: {}".format(average_error)









    



Average Error Rate: 0.051329394387