HW1.0.0: Define big data. Provide an example of a big data problem in your domain of expertise.
One of the simplest definitions of "big data" is so much data that it cannot be processed by traditional methods [1]. This definition includes data that cannot be loaded into memory or, often, even stored on disk on a normal laptop. Most definitions of "big data" include the "3 V's," variety, velocity, and volume (sometimes with the addition of a fourth V, veracity). "Volume" means the large storage space required, which can be up to zettabytes in modern applications [1]. "Variety" includes many different types of sources of data including, in general, structured and unstructured data [1]. "Velocity" corresponds to the need to process streaming data, often in real-time. A source of big data collected and analyzed for environmental remediation is sensor data from remediation process monitoring, including real-time pressure, temperature, and flow rate measurements.
[1] Shanahan, J. MIDS W261, Week 1, 1.3, Course notes
HW1.0.1:In 500 words (English or pseudo code or a combination) describe how to estimate the bias, the variance, the error for a test dataset T when using polynomial regression models of degree 1,2,3,4,5 are considered. How would you select a model?
data = Load_data()
poly_deg = [1,2,3,4,5]
train_data = random_sample(data)
test_data = data[not in train_data]
B = < number of subsets >
for k in poly_deg:
{
for b in range(B)
{
train_data[b] = bootstrap_sample(train_data)
regr_model = fit_regr(k, train_data[b])
test_predictions = regr_model.predict(test_data)
test_error[k,b] = mean((test_predictions - yTest)**2)
}
bias-sq[k] = mean((mean(test_predictions for each b)-yTest)^2)
variance[k] = mean(var(predictions for each b))
test_error[k]=mean(test error for each b)
}
Select the model that minimizes the test error, this is also the point at which the (bias-squared + variance) is minimized.
Counts words in parallel via a unix, poor-man's map-reduce framework.
HW1.1: Read through the provided control script (pNaiveBayes.sh) and all of its comments.
In [402]:
'''
HW1.1. Read through the provided control script (pNaiveBayes.sh)
and all of its comments. When you are comfortable with their
purpose and function, respond to the remaining homework questions below.
A simple cell in the notebook with a print statmement with a "done" string will
suffice here. (dont forget to include the Question Number and the quesition
in the cell as a multiline comment!)
'''
def hw1_1():
print 'done'
hw1_1()
HW1.2: Provide a mapper/reducer pair that, when executed by pNaiveBayes.sh will determine the number of occurrences of a single, user-specified word. Examine the word “assistance” and report your results.
In [24]:
'''
HW1.2. Provide a mapper/reducer pair that, when executed by pNaiveBayes.sh
will determine the number of occurrences of a single, user-specified word.
Examine the word “assistance” and report your results.
- mapper.py counts all occurrences of a single word, and
- reducer.py collates the counts of the single word.
CROSSCHECK: >grep assistance enronemail_1h.txt|cut -d$'\t' -f4| grep assistance|wc -l
8
'''
# make directory for problem and change to that dir
!mkdir ~/Documents/W261/hw1/hw1_2
%cd ~/Documents/W261/hw1/hw1_2/
Mapper: The mapper processes each email and counts all occurrences of the word in each chunk and emits that count.
In [106]:
%%writefile ./mapper.py
#!/usr/bin/python
import sys
import re
# Data filename and word to count passed in as command line arguments
filename = sys.argv[1]
findword = sys.argv[2]
# Initialize word count
count = 0
# regular expression for extracting word tokens from email content
WORD_RE = re.compile(r"[\w']+")
# Read emails in from file
with open (filename, "r") as myfile:
# Loop over each email (one email per line)
for line in myfile.readlines():
# email content is in last column of input file
content = line.split('\t')[-1]
# extract words from content
words = WORD_RE.findall(content.lower())
# loop over words in email
for w in words:
# if word matches the word we are looking for, increment counter
if w==findword.lower():
count+=1
# output total count for mapper
print count
Reducer: The reducer sums the counts of the specified word from each mapper
In [107]:
%%writefile ./reducer.py
#!/usr/bin/python
import sys
# initialize total count
sum = 0
# loop over mappers
for i in range(1,len(sys.argv)):
# Read mapper output from file
with open (sys.argv[i], "r") as countfile:
# loop over lines in mapper output
for line in countfile.readlines():
# add count to total
sum += int(line)
# output total count
print sum
In [504]:
# Run mapper and reducer cells first to write mapper.py and reducer.py
def hw1_2():
# move to directory for problem and copy data and map reduce script into dir
%cd ~/Documents/W261/hw1/hw1_2/
!cp ../enronemail_1h.txt ./
!cp ../pNaiveBayes.sh ./
# set mapper and reducer permissions
!chmod a+x ./mapper.py
!chmod a+x ./reducer.py
# pNaiveBayes.sh m wordlist
# with m = number of mappers
# wordlist = words in vocab
!./pNaiveBayes.sh 2 "assistance"
# total count from mapreduce output
total = !cat enronemail_1h.txt.output
# show results
print('\nThere are %d occurrences of the word "assistance" in the email contents' % int(total[0]))
hw1_2()
HW1.3: MapReduce to classify email messages by a single, user-specified word using Mulitnomial Naive Bayes
In [505]:
'''
HW1.3. Provide a mapper/reducer pair that, when executed by pNaiveBayes.sh
will classify the email messages by a single, user-specified word.
Examine the word “assistance” and report your results.
- mapper.py is same as in part (2), and
- reducer.py performs a single word Naive Bayes classification.
'''
# make directory for problem and change to that dir
!mkdir ~/Documents/W261/hw1/hw1_3/
%cd ~/Documents/W261/hw1/hw1_3/
!cp ../enronemail_1h_cleaned.txt ./
!cp ../pNaiveBayes.sh ./
Mapper: The mapper counts occurrences of the given word in each email
In [455]:
%%writefile ./mapper.py
#!/usr/bin/python
import sys
import re
from collections import defaultdict
# Data filename and word(s) in vocab passed in as command line arguments
filename = sys.argv[1]
vocabwords = sys.argv[2].lower().split(' ')
# regular expression for extracting word tokens from email content
WORD_RE = re.compile(r"[\w']+")
# Read emails in from file
with open (filename, "r") as myfile:
# Loop over each email (one email per line)
for line in myfile.readlines():
# email content is in last column of input file
linelist = line.split('\t')
content = linelist[-1]
# extract words from content
words = WORD_RE.findall(content.lower())
# Initialize word counts
vocabWordsCount = dict()
for v in vocabwords:
vocabWordsCount[v]=0
# loop over words in email
for w in words:
# if word is in vocab, increment count
if w in vocabWordsCount.keys():
vocabWordsCount[w]+=1
# output count of each vocab word in each email
for k in vocabWordsCount.keys():
print linelist[0]+'\t'+linelist[1]+'\t'+k+'\t'+str(vocabWordsCount[k])
Reducer: The reducer uses the single word count for each email to make multinomial Naive Bayes prediction
P(C|D) = P(D|C)P(C)/P(D)
P(C|D) ~ P(D|C)P(C)
P(spam | D) ~ P(assistance | spam) * P(spam)
P(assistance | spam) = (count of 'assistance' in all spam emails + 1) / (count of all words in vocab in all spam emails + vocabulary size)
vocabulary = 'assistance'
=> count of all words in vocab in all spam emails = count of 'assistance' in all spam emails
vocab size = 1
P(assistance | spam) = 1
P(spam | D) ~ P(spam)
In [462]:
%%writefile ./reducer.py
#!/usr/bin/python
import sys
from collections import defaultdict
from math import log
# initialize counters
vocabWordCounts_spam = defaultdict(int)
vocabWordCounts_ham = defaultdict(int)
totalCount = 0
spamCount = 0
wordcounts = defaultdict(dict)
# initialize set of email ids in spam and ham
spam_ids = set()
ham_ids = set()
# loop over mappers
for i in range(1,len(sys.argv)):
# read mapper output from each file
with open (sys.argv[i], "r") as countfile:
# loop over lines in mapper output
for line in countfile.readlines():
line = line.strip()
linelist = line.split('\t')
# add count of word on line to dictionary by email ids
# {email_ids: {word1: count_of_word1, word2: count_of_word2,...}}
wordcounts[linelist[0]][linelist[2]]=int(linelist[3])
# class label for email is in second column
spam = int(linelist[1])
# add email id to correct set (spam or ham) and increment word counts
if spam==1:
spam_ids.add(linelist[0])
vocabWordCounts_spam[linelist[2]]+=int(linelist[3])
else:
ham_ids.add(linelist[0])
vocabWordCounts_ham[linelist[2]]+=int(linelist[3])
# number of emails in spam, ham, and total
spamCount = len(spam_ids)
hamCount = len(ham_ids)
totalCount = spamCount+hamCount
# prior probabilities of spam and ham calculated as (number of spam emails)/(total number of emails)
prior_spam = 1.0*spamCount/totalCount
prior_ham = 1.0-prior_spam
# sum number of words in vocab in spam and ham
total_all_vocab_in_spam = 0
total_all_vocab_in_ham = 0
for vw in vocabWordCounts_spam.keys():
total_all_vocab_in_spam+=vocabWordCounts_spam[vw]
for vw in vocabWordCounts_ham.keys():
total_all_vocab_in_ham+=vocabWordCounts_ham[vw]
# calc the total number of words in vocab as union of words in spam and ham
vocab_spam = set(vocabWordCounts_spam.keys())
vocab_ham = set(vocabWordCounts_ham.keys())
vocab = vocab_spam.union(vocab_ham)
vocab_size = len(vocab)
# initialize dicts for storing conditional probabilities for each vocab word in spam/ham
pr_vw_given_spam = dict()
pr_vw_given_ham = dict()
'''
Conditional Probabilities calculated as:
pr_vw_given_spam = (total_vw_in_spam+1)/(total_all_vocab_in_spam + vocab_size)
pr_vw_given_ham = (total_vw_in_ham+1)/(total_all_vocab_in_ham + vocab_size)
'''
# loop over all vocab words and store conditional probabilities
for vw in vocab:
if vw in vocabWordCounts_spam.keys():
pr_vw_given_spam[vw]=1.0*(vocabWordCounts_spam[vw]+1)/(total_all_vocab_in_spam + vocab_size)
else:
pr_vw_given_spam[vw]=1.0*(0+1)/(total_all_vocab_in_spam + vocab_size)
if vw in vocabWordCounts_ham.keys():
pr_vw_given_ham[vw]=1.0*(vocabWordCounts_ham[vw]+1)/(total_all_vocab_in_ham + vocab_size)
else:
pr_vw_given_ham[vw]=1.0*(0+1)/(total_all_vocab_in_ham + vocab_size)
# loop over emails to calculate spam/ham scores and make NB prediction
for email_id,email in wordcounts.iteritems():
# initialize scores as priors
spam_score = log(prior_spam)
ham_score = log(prior_ham)
# loop over vocab words in email and counts
for w,c in email.iteritems():
# add count*conditional probability for each vocab word in email
spam_score+=c*log(pr_vw_given_spam[w])
ham_score+=c*log(pr_vw_given_ham[w])
# true class
spam = 1*(email_id in spam_ids)
# predicted class
predict = 1*(spam_score>ham_score)
# output results as 'ID \t truth \t prediction'
print email_id+'\t'+str(spam)+'\t'+str(predict)
In [506]:
# Run mapper and reducer cells first to write mapper.py and reducer.py
def score(outputfile):
'''
Input: reducer Naive Bayes output file with format
ID \t truth \t prediction
Returns: Accuracy of classifier prediction
'''
# initialize count of examples and correct predictions
n = 0
correct=0
# read lines from reducer output file
with open (outputfile, "r") as outfile:
for line in outfile.readlines():
# increment count of number of examples
n+=1
# split line
lineList = line.replace('\n','').split('\t')
# true class is in second column
truth=int(lineList[1])
# predicted class is in third column
predict=int(lineList[2])
# increment number of correct examples if prediction matches truth
correct+=(1*truth==predict)
# return percent of correct predictions (accuracy)
return 1.0*correct/n
def hw1_3():
# move to directory for problem and copy data and map reduce script into dir
%cd ~/Documents/W261/hw1/hw1_3/
!cp ../enronemail_1h.txt ./
!cp ../pNaiveBayes.sh ./
# set mapper and reducer permissions
!chmod a+x ./mapper.py
!chmod a+x ./reducer.py
# pNaiveBayes.sh m wordlist
# with m = number of mappers
# wordlist = words in vocab
!./pNaiveBayes.sh 2 "assistance"
# show output file
!cat enronemail_1h_cleaned.txt.output
# calculate and show accuracy
outputfile = 'enronemail_1h_cleaned.txt.output'
accuracy = score(outputfile)
print '\nAccuracy for single word multinomial NB classifier: ',accuracy
hw1_3()
In [ ]:
In [507]:
'''
HW1.4. Provide a mapper/reducer pair that, when executed by pNaiveBayes.sh
will classify the email messages by a list of one or more user-specified words.
Examine the words “assistance”, “valium”, and “enlargementWithATypo”
and report your results
- mapper.py counts all occurrences of a list of words, and
- reducer.py performs the multiple-word Naive Bayes classification via the chosen list.
'''
# make directory for problem and change to that dir
!mkdir ~/Documents/W261/hw1/hw1_4/
%cd ~/Documents/W261/hw1/hw1_4/
!cp ../enronemail_1h_cleaned.txt ./
!cp ../pNaiveBayes.sh ./
In [465]:
%%writefile ./mapper.py
#!/usr/bin/python
import sys
import re
# Data filename and word to count passed in as command line arguments
filename = sys.argv[1]
vocabwords = sys.argv[2].lower().split(' ')
# regular expression for extracting word tokens from email content
WORD_RE = re.compile(r"[\w']+")
# initialize dictionary for storing word counts
vocabWordsCount = dict()
# Read emails in from file
with open (filename, "r") as myfile:
# Loop over each email (one email per line)
for line in myfile.readlines():
# email content is in last column of input file
linelist = line.split('\t')
content = linelist[-1]
# extract words from content
words = WORD_RE.findall(content.lower())
# add vocab words to word count dictionary keys
for v in vocabwords:
vocabWordsCount[v]=0
# loop over words in email
for w in words:
# if word matches the word we are looking for, increment counter
if w in vocabWordsCount.keys():
vocabWordsCount[w]+=1
# output email ID, spam class, word, count for each vocab word
for k in vocabWordsCount.keys():
print linelist[0]+'\t'+linelist[1]+'\t'+k+'\t'+str(vocabWordsCount[k])
In [466]:
%%writefile ./reducer.py
#!/usr/bin/python
import sys
from collections import defaultdict
from math import log
# initialize counters
vocabWordCounts_spam = defaultdict(int)
vocabWordCounts_ham = defaultdict(int)
totalCount = 0
spamCount = 0
wordcounts = defaultdict(dict)
# initialize set of email ids in spam and ham
spam_ids = set()
ham_ids = set()
# loop over mappers
for i in range(1,len(sys.argv)):
# read mapper output from each file
with open (sys.argv[i], "r") as countfile:
# loop over lines in mapper output
for line in countfile.readlines():
line = line.strip()
linelist = line.split('\t')
# add count of word on line to dictionary by email ids
# {email_ids: {word1: count_of_word1, word2: count_of_word2,...}}
wordcounts[linelist[0]][linelist[2]]=int(linelist[3])
# class label for email is in second column
spam = int(linelist[1])
# add email id to correct set (spam or ham) and increment word counts
if spam==1:
spam_ids.add(linelist[0])
vocabWordCounts_spam[linelist[2]]+=int(linelist[3])
else:
ham_ids.add(linelist[0])
vocabWordCounts_ham[linelist[2]]+=int(linelist[3])
# number of emails in spam, ham, and total
spamCount = len(spam_ids)
hamCount = len(ham_ids)
totalCount = spamCount+hamCount
# prior probabilities of spam and ham calculated as (number of spam emails)/(total number of emails)
prior_spam = 1.0*spamCount/totalCount
prior_ham = 1.0-prior_spam
# sum number of words in vocab in spam and ham
total_all_vocab_in_spam = 0
total_all_vocab_in_ham = 0
for vw in vocabWordCounts_spam.keys():
total_all_vocab_in_spam+=vocabWordCounts_spam[vw]
for vw in vocabWordCounts_ham.keys():
total_all_vocab_in_ham+=vocabWordCounts_ham[vw]
# calc the total number of words in vocab as union of words in spam and ham
vocab_spam = set(vocabWordCounts_spam.keys())
vocab_ham = set(vocabWordCounts_ham.keys())
vocab = vocab_spam.union(vocab_ham)
vocab_size = len(vocab)
# initialize dicts for storing conditional probabilities for each vocab word in spam/ham
pr_vw_given_spam = dict()
pr_vw_given_ham = dict()
'''
Conditional Probabilities calculated as:
pr_vw_given_spam = (total_vw_in_spam+1)/(total_all_vocab_in_spam + vocab_size)
pr_vw_given_ham = (total_vw_in_ham+1)/(total_all_vocab_in_ham + vocab_size)
'''
# loop over all vocab words and store conditional probabilities
for vw in vocab:
if vw in vocabWordCounts_spam.keys():
pr_vw_given_spam[vw]=1.0*(vocabWordCounts_spam[vw]+1)/(total_all_vocab_in_spam + vocab_size)
else:
pr_vw_given_spam[vw]=1.0*(0+1)/(total_all_vocab_in_spam + vocab_size)
if vw in vocabWordCounts_ham.keys():
pr_vw_given_ham[vw]=1.0*(vocabWordCounts_ham[vw]+1)/(total_all_vocab_in_ham + vocab_size)
else:
pr_vw_given_ham[vw]=1.0*(0+1)/(total_all_vocab_in_ham + vocab_size)
# loop over emails to calculate spam/ham scores and make NB prediction
for email_id,email in wordcounts.iteritems():
# initialize scores as priors
spam_score = log(prior_spam)
ham_score = log(prior_ham)
# loop over vocab words in email and counts
for w,c in email.iteritems():
# add count*conditional probability for each vocab word in email
spam_score+=c*log(pr_vw_given_spam[w])
ham_score+=c*log(pr_vw_given_ham[w])
# true class
spam = 1*(email_id in spam_ids)
# predicted class
predict = 1*(spam_score>ham_score)
# output results as 'ID \t truth \t prediction'
print email_id+'\t'+str(spam)+'\t'+str(predict)
In [508]:
# Run mapper and reducer cells first to write mapper.py and reducer.py
def score(outputfile):
'''
Input: reducer Naive Bayes output file with format
ID \t truth \t prediction
Returns: Accuracy of classifier prediction
'''
# initialize count of examples and correct predictions
n = 0
correct=0
# read lines from reducer output file
with open (outputfile, "r") as outfile:
for line in outfile.readlines():
# increment count of number of examples
n+=1
# split line
lineList = line.replace('\n','').split('\t')
# true class is in second column
truth=int(lineList[1])
# predicted class is in third column
predict=int(lineList[2])
# increment number of correct examples if prediction matches truth
correct+=(1*truth==predict)
# return percent of correct predictions (accuracy)
return 1.0*correct/n
def hw1_4():
# move to directory for problem and copy data and map reduce script into dir
%cd ~/Documents/W261/hw1/hw1_4/
!cp ../enronemail_1h.txt ./
!cp ../pNaiveBayes.sh ./
# set mapper and reducer permissions
!chmod a+x ./mapper.py
!chmod a+x ./reducer.py
# pNaiveBayes.sh m wordlist
# with m = number of mappers
# wordlist = words in vocab
!./pNaiveBayes.sh 4 "assistance valium enlargementWithATypo"
# show output file
!cat enronemail_1h_cleaned.txt.output
# calculate and show accuracy
outputfile = 'enronemail_1h_cleaned.txt.output'
accuracy = score(outputfile)
print 'Accuracy: ',accuracy
hw1_4()
In [509]:
'''
HW1.5. Provide a mapper/reducer pair that, when executed by pNaiveBayes.sh
will classify the email messages by all words present.
To do so, make sure that
- mapper.py counts all occurrences of all words, and
- reducer.py performs a word-distribution-wide Naive Bayes classification.
'''
# make directory for problem and change to that dir
!mkdir ~/Documents/W261/hw1/hw1_5/
%cd ~/Documents/W261/hw1/hw1_5/
!cp ../enronemail_1h_cleaned.txt ./
!cp ../pNaiveBayes.sh ./
In [490]:
%%writefile ./mapper.py
#!/usr/bin/python
import sys
import re
from collections import defaultdict
# Data filename and vocab words passed in as command line arguments
filename = sys.argv[1]
vocabwords = sys.argv[2].lower().split(' ')
# regular expression for extracting word tokens from email content
WORD_RE = re.compile(r"[\w']+")
# Read emails in from file
with open (filename, "r") as myfile:
# Loop over each email (one email per line)
for line in myfile.readlines():
# email content is in last column of input file
linelist = line.split('\t')
content = linelist[-1]
# extract words from content
words = WORD_RE.findall(content.lower())
# loop over words in email and count each, store in dict {word: count,...}
vocabWordsCount = defaultdict(int)
for w in words:
vocabWordsCount[w]+=1
# output email ID, spam class, word, count for each word
for k in vocabWordsCount.keys():
print linelist[0]+'\t'+linelist[1]+'\t'+k+'\t'+str(vocabWordsCount[k])
In [491]:
%%writefile ./reducer.py
#!/usr/bin/python
import sys
from collections import defaultdict
from math import log
# initialize counters
vocabWordCounts_spam = defaultdict(int)
vocabWordCounts_ham = defaultdict(int)
totalCount = 0
spamCount = 0
wordcounts = defaultdict(dict)
# initialize set of email ids in spam and ham
spam_ids = set()
ham_ids = set()
# loop over mappers
for i in range(1,len(sys.argv)):
# read mapper output from each file
with open (sys.argv[i], "r") as countfile:
# loop over lines in mapper output
for line in countfile.readlines():
line = line.strip()
linelist = line.split('\t')
# add count of word on line to dictionary by email ids
# {email_ids: {word1: count_of_word1, word2: count_of_word2,...}}
wordcounts[linelist[0]][linelist[2]]=int(linelist[3])
# class label for email is in second column
spam = int(linelist[1])
# add email id to correct set (spam or ham) and increment word counts
if spam==1:
spam_ids.add(linelist[0])
vocabWordCounts_spam[linelist[2]]+=int(linelist[3])
else:
ham_ids.add(linelist[0])
vocabWordCounts_ham[linelist[2]]+=int(linelist[3])
# number of emails in spam, ham, and total
spamCount = len(spam_ids)
hamCount = len(ham_ids)
totalCount = spamCount+hamCount
# prior probabilities of spam and ham calculated as (number of spam emails)/(total number of emails)
prior_spam = 1.0*spamCount/totalCount
prior_ham = 1.0-prior_spam
# sum number of words in vocab in spam and ham
total_all_vocab_in_spam = 0
total_all_vocab_in_ham = 0
for vw in vocabWordCounts_spam.keys():
total_all_vocab_in_spam+=vocabWordCounts_spam[vw]
for vw in vocabWordCounts_ham.keys():
total_all_vocab_in_ham+=vocabWordCounts_ham[vw]
# calc the total number of words in vocab as union of words in spam and ham
vocab_spam = set(vocabWordCounts_spam.keys())
vocab_ham = set(vocabWordCounts_ham.keys())
vocab = vocab_spam.union(vocab_ham)
vocab_size = len(vocab)
# initialize dicts for storing conditional probabilities for each vocab word in spam/ham
pr_vw_given_spam = dict()
pr_vw_given_ham = dict()
'''
Conditional Probabilities calculated as:
pr_vw_given_spam = (total_vw_in_spam+1)/(total_all_vocab_in_spam + vocab_size)
pr_vw_given_ham = (total_vw_in_ham+1)/(total_all_vocab_in_ham + vocab_size)
'''
# loop over all vocab words and store conditional probabilities
for vw in vocab:
if vw in vocabWordCounts_spam.keys():
pr_vw_given_spam[vw]=1.0*(vocabWordCounts_spam[vw]+1)/(total_all_vocab_in_spam + vocab_size)
else:
pr_vw_given_spam[vw]=1.0*(0+1)/(total_all_vocab_in_spam + vocab_size)
if vw in vocabWordCounts_ham.keys():
pr_vw_given_ham[vw]=1.0*(vocabWordCounts_ham[vw]+1)/(total_all_vocab_in_ham + vocab_size)
else:
pr_vw_given_ham[vw]=1.0*(0+1)/(total_all_vocab_in_ham + vocab_size)
# loop over emails to calculate spam/ham scores and make NB prediction
for email_id,email in wordcounts.iteritems():
# initialize scores as priors
spam_score = log(prior_spam)
ham_score = log(prior_ham)
# loop over vocab words in email and counts
for w,c in email.iteritems():
# add count*conditional probability for each vocab word in email
spam_score+=c*log(pr_vw_given_spam[w])
ham_score+=c*log(pr_vw_given_ham[w])
# true class
spam = 1*(email_id in spam_ids)
# predicted class
predict = 1*(spam_score>ham_score)
# output results as 'ID \t truth \t prediction'
print email_id+'\t'+str(spam)+'\t'+str(predict)
In [510]:
# Run mapper and reducer cells first to write mapper.py and reducer.py
def score(outputfile):
'''
Input: reducer Naive Bayes output file with format
ID \t truth \t prediction
Returns: Accuracy of classifier prediction
'''
# initialize count of examples and correct predictions
n = 0
correct=0
# read lines from reducer output file
with open (outputfile, "r") as outfile:
for line in outfile.readlines():
# increment count of number of examples
n+=1
# split line
lineList = line.replace('\n','').split('\t')
# true class is in second column
truth=int(lineList[1])
# predicted class is in third column
predict=int(lineList[2])
# increment number of correct examples if prediction matches truth
correct+=(1*truth==predict)
# return percent of correct predictions (accuracy)
return 1.0*correct/n
def hw1_5():
# move to directory for problem and copy data and map reduce script into dir
%cd ~/Documents/W261/hw1/hw1_5/
!cp ../enronemail_1h.txt ./
!cp ../pNaiveBayes.sh ./
# set mapper and reducer permissions
setMapperPermissions()
setReducerPermissions()
# pNaiveBayes.sh m wordlist
# with m = number of mappers
# wordlist = words in vocab
!./pNaiveBayes.sh 4 "*"
# show output file
!cat enronemail_1h_cleaned.txt.output
# calculate and show accuracy
outputfile = 'enronemail_1h_cleaned.txt.output'
accuracy = score(outputfile)
print 'Multinomial Naive Bayes Classifier Accuracy: ',accuracy
hw1_5()
In [478]:
'''
HW1.6 Benchmark your code with the Python SciKit-Learn implementation of Naive Bayes
Training error = misclassification rate with respect to a training set.
It is more formally defined here:
Let DF represent the training set in the following:
Err(Model, DF) = |{(X, c(X)) ∈ DF : c(X) != Model(x)}| / |DF|
Where || denotes set cardinality;
c(X) denotes the class of the tuple X in DF;
and Model(X) denotes the class inferred by the Model “Model”
In this exercise, please complete the following:
— Run the Multinomial Naive Bayes algorithm (using default settings)
from SciKit-Learn over the same training data used in HW1.5 and
report the Training error (please note some data preparation might
be needed to get the Multinomial Naive Bayes algorithm from SkiKit-Learn
to run over this dataset)
— Run the Bernoulli Naive Bayes algorithm from SciKit-Learn (using default settings)
over the same training data used in HW1.5 and report the Training error
— Run the Multinomial Naive Bayes algorithm you developed for HW1.5
over the same data used HW1.5 and report the Training error
— Please prepare a table to present your results
— Explain/justify any differences in terms of training error rates
over the dataset in HW1.5 between your Multinomial Naive Bayes
implementation (in Map Reduce) versus the Multinomial Naive Bayes
implementation in SciKit-Learn
— Discuss the performance differences in terms of training error rates
over the dataset in HW1.5 between the Multinomial Naive Bayes
implementation in SciKit-Learn with the Bernoulli Naive Bayes
implementation in SciKit-Learn
'''
In [501]:
def hw1_6():
import numpy as np
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.feature_extraction.text import CountVectorizer
import sys
import re
from collections import defaultdict
# make directory for problem and change to that dir
!mkdir ~/Documents/W261/hw1/hw1_6/
%cd ~/Documents/W261/hw1/hw1_6/
!cp ../enronemail_1h_cleaned.txt ./
# regular expression for extracting word tokens from email content
WORD_RE = re.compile(r"[\w']+")
# initialize lists for storing training data (email contents),
# email ids, email class labels
train_ids = []
train_data = []
train_labels = []
# Read emails in from file
filename = 'enronemail_1h_cleaned.txt'
with open (filename, "r") as myfile:
# Loop over each email (one email per line)
for line in myfile.readlines():
# store email ids, contents, and spam labels
linelist = line.split('\t')
email_id = linelist[0]
content = linelist[-1]
label = linelist[1]
train_ids.append(email_id)
train_data.append(content)
train_labels.append(label)
# Construct feature matrix for training data using CountVactorizer
cv_train = CountVectorizer(train_data)
tdf_train = cv_train.fit_transform(train_data)
'''
# loop through words in training data and include those with greater than 3 characters
vocab = []
for w in cv_train.vocabulary_.keys():
if len(w)>3:
vocab.append(w)
# Construct feature matrix for training data using CountVactorizer
# with vocab of words greater than lenght 3
cv_train = CountVectorizer(train_data, vocabulary = vocab)
tdf_train = cv_train.fit_transform(train_data)
'''
# Multinomial Naive Bayes classifier
print "\nMultinomial Naive Bayes"
multNB = MultinomialNB()
multNB.fit(tdf_train, train_labels)
print("accuracy: %0.2f " % multNB.score(tdf_train, train_labels))
# Bernoulli Naive Bayes classifier
print "\nBernoulli Naive Bayes"
binNB = BernoulliNB()
binNB.fit(tdf_train, train_labels)
print("accuracy: %0.2f " % binNB.score(tdf_train, train_labels))
hw1_6()
The training accuracy is the same for the sklearn implementation of Multinomial Naive Bayes (NB) and the map reduce implementation in HW1.5.
The sklearn multinomial NB classifier has greater training set accuracy than the sklearn Bernoulli Naive Bayes (NB) classifier because it contains more information by including counts for the words instead of just the presence of them in each email.
In [ ]: