There are general instructions on Blackboard and in the Syllabus for Programming Assignments. This Notebook also has instructions specific to this assignment. Read all the instructions carefully and make sure you understand them. Please ask questions on the discussion boards or email me at EN605.445@gmail.com
if you do not understand something.
You must submit a zip file of your assignment and associated files (if there are any) to Blackboard. The zip file will be named after you JHED ID: <jhed_id>.zip
. It will not include any other information. Inside this zip file should be the following directory structure:
<jhed_id>
|
+--module-01-programming.ipynb
+--module-01-programming.html
+--(any other files)
For example, do not name your directory programming_assignment_01
and do not name your directory smith122_pr1
or any else. It must be only your JHED ID. Make sure you submit both an .ipynb and .html version of your completed notebook. You can generate the HTML version using:
ipython nbconvert [notebookname].ipynb
or use the File menu.
In this assignment you will be using the mushroom data from the Decision Tree module:
http://archive.ics.uci.edu/ml/datasets/Mushroom
The assignment is to write a program that will learn and apply a Naive Bayes Classifier for this problem. You'll first need to calculate all of the necessary probabilities (don't forget to use +1 smoothing) using a learn
function. You'll then need to have a classify
function that takes your probabilities, a List of instances (possibly a list of 1) and returns a List of Tuples. Each Tuple is a class and the normalized probability of that class. The List should be sorted so that the probabilities are in descending order. For example,
[("e", 0.98), ("p", 0.02)]
when calculating the error rate of your classifier, you should pick the class with the highest probability (the first one in the list).
As a reminder, the Naive Bayes Classifier generates the un-normalized probabilities from the numerator of Bayes Rule:
$$P(C|A) \propto P(A|C)P(C)$$where C is the class and A are the attributes (data). Since the normalizer of Bayes Rule is the sum of all possible numerators and you have to calculate them all, the normalizer is just the sum of the probabilities.
You'll also need an evaluate
function as before. You should use the $error\_rate$ again.
Use the same testing procedure as last time, on two randomized subsets of the data:
Imports
In [1]:
from __future__ import division # so that 1/2 = 0.5 and not 0
from IPython.core.display import *
import csv, math, copy, random
attributes_domain
A helper function to return a dictionary of attributes, and the domains possible for that attribute.
This is used to start the Naive Bayes algorithm with the appropriate possible attributes and their domains. A '?' attribute is added to every domain in case a record is missing a value for a given domain. In the Record the value for that domain is expected to have a '?' indicating that for that record the attribute value is unknown.
input:
None
return:
In [2]:
def attributes_domains():
return {
'label': ['e', 'p', '?'],
'cap-shape': ['b', 'c', 'x', 'f', 'k', 's', '?'],
'cap-surface': ['f', 'g', 'y', 's', '?'],
'cap-color': ['n', 'b', 'c', 'g', 'r', 'p', 'u', 'e', 'w', 'y', '?'],
'bruises?': ['t', 'f', '?'],
'odor': ['a', 'l', 'c', 'y', 'f', 'm', 'n', 'p', 's', '?'],
'gill-attachment': ['a', 'd', 'f', 'n', '?'],
'gill-spacing': ['c', 'w', 'd', '?'],
'gill-size': ['b', 'n', '?'],
'gill-color': ['k', 'n', 'b', 'h', 'g', 'r', 'o', 'p', 'u', 'e', 'w', 'y', '?'],
'stalk-shape': ['e', 't', '?'],
'salk-root': ['b', 'c', 'u', 'e', 'z', 'r', '?'],
'stalk-surface-above-ring': ['f', 'y', 'k', 's', '?'],
'stalk-surface-below-ring': ['f', 'y', 'k', 's', '?'],
'stalk-color-above-ring': ['n', 'b', 'c', 'g', 'o', 'p', 'e', 'w', 'y', '?'],
'stalk-color-below-ring': ['n', 'b', 'c', 'g', 'o', 'p', 'e', 'w', 'y', '?'],
'veil-type': ['p', 'u', '?'],
'veil-color': ['n', 'o', 'w', 'y', '?'],
'ring-number': ['n', 'o', 't', '?'],
'ring-type': ['c', 'e', 'f', 'l', 'n', 'p', 's', 'z', '?'],
'spore-print-color': ['k', 'n', 'b', 'h', 'r', 'o', 'u', 'w', 'y', '?'],
'population': ['a', 'c', 'n', 's', 'v', 'y', '?'],
'habitat': ['g', 'l', 'm', 'p', 'u', 'w', 'd', '?'],
}
get_positive_label
A helper function to return the positive label for this implimentation of a Naive Bayes Classifier. Used incase the positive label were to change. "positive" in this context is simply derived from the data set, and that it is a POSITIVE thing to be able to eat a mushroom, thus the label for the dataset e is "Positive". This is the ONLY reason it's called positive.
The label is used in calculating the information gain, as well as determining the majority label of an attribute.
input:
None
return:
In [3]:
def get_positive_label():
return 'e'
get_negative_label
A helper function to return the negative label for this implimentation of a Naive Bayes Classifier. Used incase the negative label were to change. "Negative" in this context is simply derived from the data set, and that it is a NEGATIVE thing to eat a Poisonous mushroom, thus the label for the dataset p is "Negative". This is the ONLY reason it's called negative.
The label is used in calculating the information gain, as well as determining the majority label of an attribute.
input:
None
return:
In [4]:
def get_negative_label():
return 'p'
create_record
A helper function to create a record to be used in the Naive Bayes Classifier, given a record from the csv file.
Creates a dictionary that maps the attribute_name to the value of that attribute for a given record.
This is used to transform all of the data read in from the csv file into an easily usable dictionary for Naive Bayes Classifier.
input:
return:
In [5]:
def create_record(csv_record):
return {
'label': csv_record[0],
'cap-shape': csv_record[1],
'cap-surface': csv_record[2],
'cap-color': csv_record[3],
'bruises?': csv_record[4],
'odor': csv_record[5],
'gill-attachment': csv_record[6],
'gill-spacing': csv_record[7],
'gill-size': csv_record[8],
'gill-color': csv_record[9],
'stalk-shape': csv_record[10],
'salk-root': csv_record[11],
'stalk-surface-above-ring': csv_record[12],
'stalk-surface-below-ring': csv_record[13],
'stalk-color-above-ring': csv_record[14],
'stalk-color-below-ring': csv_record[15],
'veil-type': csv_record[16],
'veil-color': csv_record[17],
'ring-number': csv_record[18],
'ring-type': csv_record[19],
'spore-print-color': csv_record[20],
'population': csv_record[21],
'habitat': csv_record[22],
}
create_distribution_dict
A helper function to create a dictionary that holds the Naive Bayes Classifier distibutions for all of the $P(a_i|c_i)$ probabilities, for each $A$ where $A$ is all attributes and $a_i$ is a domain for a specific attribute.
The dictionary has the following strucutre:
{
(attribute, attribute_domain_value, 'label', label_value) : value
}
The key allows us to specify for which attribute, and for what domain value we are creating the distribution for, and the 'label' label_value allow us to create the "Given $c_i$" part of the distribution.
This dictionary is used first to create an overall count of each disitbution, and then is later used to hold the actual probability distibution for the Naive Bayes Classifier.
Note that the distibution for "counting" is initialized to 1. This is to account for the "+1" smoothing that is needed for calculating the probabilities later on for the $P(f_i | c)$ which describes the probability.
This is an important method for the algorithm because this function specifies how the distibution is stored.
input:
None
return:
In [6]:
def create_distribution_dict():
attributes_with_domains = attributes_domains()
distribution = {}
for attribute, domains in attributes_with_domains.iteritems():
if attribute == 'label':
continue
for domain in domains:
distribution[(attribute, domain, 'label', get_positive_label())] = 1
distribution[(attribute, domain, 'label', get_negative_label())] = 1
return distribution
read_file
A helper function to read in the data from a CSV file, and transform it into a list of records, as described in the create_record
description.
NOTE: If not given a path to a file, it assumes that the file is in your local directory, from which you are running this notebook. It also assumes that the file it is reading is "agaricus-lepiota.data".
The file can be found at https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data
Please also note that this file is the expected format of input for this entire Naive Bayes Classifier implementation. Please do not try to run this with other data that is not in this format, or have the same bounds as this data set.
input:
return:
create_record
description.
In [7]:
def read_file(path=None):
if path is None:
path = 'agaricus-lepiota.data'
with open(path, 'r') as f:
reader = csv.reader(f)
csv_list = list(reader)
records = []
for value in csv_list:
records.append(create_record(value))
return records
create_distribution_key
A helper function the key needed to access a given probability in the Naive Bayes Distribution dictionary, described in create_distribution_dict
.
input:
return:
In [8]:
def create_distribution_key(attribute, domain_value, label_value):
return (attribute, domain_value, 'label', label_value)
put_value_in_distribution
A helper function to increment the count by 1, in the distribution dictionary, of a given key.
Used when counting the number of occurenses of a particular $A=a_i, C=c_i$ when building out the distribution of the training set.
input:
create_distribution_dict
return: None
In [9]:
def put_value_in_distribution(distribution, attribute, domain_value, label_value):
key = create_distribution_key(attribute, domain_value, label_value)
distribution[key] += 1
get_label_count
A helper function that returns the number of records that have a given label.
This is used to get the total number of records with a given label.
This value is then used when calculating the normalized probabilites of the distribution, $$P(f_i | c_i) = \frac{Num((f_i,c_i)) + 1}{Num(c_i) + 1}$$
Specifically the $Num(c_i)$ part.
input:
return:
In [10]:
def get_label_count(records, label):
count = 0
for record in records:
if record['label'] == label:
count += 1
return count
create_percentages
A helper function that, given a distibution of counts for $(f_i, c_i)$ calculates the probability according to:
$$P(f_i | c_i) = \frac{Num((f_i,c_i)) + 1}{Num(c_i) + 1}$$
The distribution already contains the "count" for the probability, the $Num((f_i,c_i)) + 1$ part. To calculte the probability, we just divide by the dividend which is passed in in the form of the count for the positive and negative lables.
For each key in the distribution, we determine which $c_i$ it uses, and divide by the appropriate dividend.
These percentages or distributions are then used during the classification step.
input:
create_distribution_dict
return:
create_distribution_dict
, now with values that are probabilites rather than raw counts. Probability is calculated according to the above formula.
In [11]:
def create_percentages(pos_count, neg_count, distribution):
pos_count_plus_1 = pos_count + 1
neg_count_plus_1 = neg_count + 1
pos_label = get_positive_label()
neg_label = get_negative_label()
for key in distribution:
if key[3] == pos_label:
distribution[key] = distribution[key] / pos_count_plus_1
elif key[3] == neg_label:
distribution[key] = distribution[key] / neg_count_plus_1
return distribution
learn
The main function that learns the distribution for the Naive Bayes Classifier.
The function works as follows:
We then return the learned distribution, as our Naive Bayes Classifier.
input:
create_record
function.return:
create_distribution_dict
, with values that are the probabilites for each $A$ and $C$ so that we have $P(A=a_i | C=c_i)$
In [12]:
def learn(records):
distribution = create_distribution_dict()
pos_count = get_label_count(records, get_positive_label())
neg_count = get_label_count(records, get_negative_label())
for record in records:
for attribute, domain_value in record.iteritems():
if attribute == 'label':
continue
put_value_in_distribution(distribution, attribute, domain_value, record['label'])
distribution = create_percentages(pos_count, neg_count, distribution)
distribution[('label', get_positive_label())] = pos_count / (pos_count + neg_count)
distribution[('label', get_negative_label())] = neg_count / (pos_count + neg_count)
return distribution
calculate_probability_of
A helper function that calculates the un_normalized probability of a given instance (record), for a given label.
The un_normalized probability is caclulated as follows: $$P(c_i) \prod_i P(f_i | c_i)$$
Where $f_i$ is a given attribute and attributes value, and c_i is a given label.
To calculate this, we itterate throught the instance's (record's) attributes, and values for the attributes, create the key into the distribution from the attribute and attribute's value and the label we are wishing to calculate the probability for.
This is then multiplied to the running product of the other probabilities. The running product is initialized to the $P(c_i)$ to take care of the initial multiplicative term.
The un_normalized probability is then returned.
This is used when classifying a record, to get the probability that the record should have a certain label. This is important because this probability is then normalized after all probabilities are gotten for all labels, and then used to determing how likely a record is part of a given class label.
input:
create_distribution_dict
, with values that are the probabilites.create_record
return:
In [13]:
def calculate_probability_of(distribution, instance, label):
un_normalized_prob = distribution[('label', label)]
for attribute, domain_value in instance.iteritems():
if attribute == 'label':
continue
key = create_distribution_key(attribute, domain_value, label)
un_normalized_prob *= distribution[key]
return un_normalized_prob
normalize
A helper function that normalizes a list of probabilities. The list of probabilities is for a single record, and should have the following structure:
[(label, probability), (label, probability)]
These probabilities should be un_normalized probabilities for each label.
This function normalizes the probabilities by summing the probabilities for each label together, then calculating the normalized probability for each label by dividing the probability for that label by the sum of all the probabilities.
This normalized probability is then placed into a new list with the same structure and same corresponding label.
The list of normalized probabilies is then SORTED in descending order. I.E. the label with the most likely possibility is in index position 0 for the list of probabilities**
This new normalized list of probabilities is then returned.
This function is important because this calculates the probabilities that are then used to choose which label should be used to describe a record. This is done during validation
input:
[(label, probability), (label, probability)]
return:
[(label, probability), (label, probability)]
with the probabilities being normalized as described above.
In [14]:
def normalize(probability_list):
sum_of_probabilities = 0
normalized_list = []
for prob_tuple in probability_list:
sum_of_probabilities += prob_tuple[1]
for prob_tuple in probability_list:
normalized_prob = prob_tuple[1] / sum_of_probabilities
normalized_list.append((prob_tuple[0], normalized_prob))
normalized_list.sort(key=lambda x: x[1], reverse=True)
return normalized_list
classify_instance
A helper that does most of the work to classifiy a given instance of a record.
It works as follows:
calculate_probabily_of
normalize the probabilities, using normalize
return the normalized probabilities for that instance of a record.
This is important because this list describes the probabilities that this record should have a given label. The First tuple in the list is the tuple with the label that has the Hightest probability for this record.
input:
create_distribution_dict
, with values that are the probabilites.create_record
return:
[(label, normalized probability), (label, normalized probability)]
sorted in descending order by probability.
In [15]:
def classify_instance(distribution, instance):
labels = [get_positive_label(), get_negative_label()]
probability_results = []
for label in labels:
probability = calculate_probability_of(distribution, instance, label)
probability_results.append((label, probability))
probability_results = normalize(probability_results)
return probability_results
classify
A function to classify a list of instances(records).
Given a list of instances (records), classify each instance using classify_instance
and put the result into a result list. Return the result list after each instance has been classified.
The Structure of the return list will be a List of lists where each inner list is a list of tuples, as described by the classify_instance
function. An example will look as follows:
[ [('e', .999),('p', .001)], [('p', .78), ('e', .22)] ]
The first list [('e', .999),('p', .001)]
corresponds to the probabilities for the first instance in the instances
list and the second list to the second instance of the instances
list. So on and so forth for each entry in the instances
list.
input:
create_distribution_dict
, with values that are the probabilites.create_record
return:
In [16]:
def classify(distribution, instances):
results = []
for instance in instances:
results.append(classify_instance(distribution, instance))
return results
evaluate
The main evaluation method. Uses a simple $\frac{ Num Errors}{total Data Points}$ to calculate the error rate of the Naive Bayes Classifier.
Given a list of records (test_data) and a list of predicted classifications for that data set, run through both lists, and compire the label for the record to the predicted classification. If they do not match, increase the number of errors seen.
The label for the predicted classification is at position 0 of the predicted probabilities list, and position 0 of the tuple for that holds the label and probability of that label. i.e. for a classifications list that is as follows:
[ [('e', .999),('p', .001)], [('p', .78), ('e', .22)] ]
The predicted label for record 1 is 'e'
since the corresponding predicted probabilities are [('e', .999),('p', .001)]
, the most likely label is at position 0 in the list, since they are sorted from most probable to least probable. Position 0 of the list gives us ('e', .999)
. The label for this selected probability is then at position 0 of the tuple, which gives us 'e'
.
This label is then compared to the actual label for the record for correctness.
Return the number of erros seen divided by the total number of data points. This is the error rate.
input:
classify
function. return:
In [17]:
def evaluate(test_data, classifications):
number_of_errors = 0
for record, classification in zip(test_data, classifications):
if record['label'] != classification[0][0]:
number_of_errors += 1
return number_of_errors/len(test_data)
Put your main function calls here.
In [18]:
test_records = read_file()
random.shuffle(test_records)
half_way = int(math.floor(len(test_records)/2))
set_1 = test_records[:half_way]
set_2 = test_records[half_way:]
In [19]:
distro_1 = learn(set_1)
In [20]:
b1_c2 = classify(distro_1, set_2)
In [21]:
evaluation_b1_c2 = evaluate(set_2, b1_c2)
print "Error Rate for Naive Bayes 1 with Set 2 = {}".format(evaluation_b1_c2)
In [22]:
distro_2 = learn(set_2)
In [23]:
b2_c1 = classify(distro_2, set_1)
In [24]:
evaluation_b2_c1 = evaluate(set_1, b2_c1)
print "Error Rate for Naive Bayes 2 with Set 1 = {}".format(evaluation_b2_c1)
In [25]:
average_error = (evaluation_b1_c2 + evaluation_b2_c1)/2
print "Average Error Rate: {}".format(average_error)