DATASCI W261: Machine Learning at Scale

Week 1, Homework 1

Katrina Adams

kradams@ischool.berkeley.edu
15 September 2015


HW 1.0

HW1.0.0: Define big data. Provide an example of a big data problem in your domain of expertise.

One of the simplest definitions of "big data" is so much data that it cannot be processed by traditional methods [1]. This definition includes data that cannot be loaded into memory or, often, even stored on disk on a normal laptop. Most definitions of "big data" include the "3 V's," variety, velocity, and volume (sometimes with the addition of a fourth V, veracity). "Volume" means the large storage space required, which can be up to zettabytes in modern applications [1]. "Variety" includes many different types of sources of data including, in general, structured and unstructured data [1]. "Velocity" corresponds to the need to process streaming data, often in real-time. A source of big data collected and analyzed for environmental remediation is sensor data from remediation process monitoring, including real-time pressure, temperature, and flow rate measurements.

[1] Shanahan, J. MIDS W261, Week 1, 1.3, Course notes

HW1.0.1:In 500 words (English or pseudo code or a combination) describe how to estimate the bias, the variance, the error for a test dataset T when using polynomial regression models of degree 1,2,3,4,5 are considered. How would you select a model?

data = Load_data()
poly_deg = [1,2,3,4,5]
train_data = random_sample(data)
test_data = data[not in train_data]
B = < number of subsets >

for k in poly_deg:
{
for b in range(B)
{
train_data[b] = bootstrap_sample(train_data)
regr_model = fit_regr(k, train_data[b])
test_predictions = regr_model.predict(test_data)
test_error[k,b] = mean((test_predictions - yTest)**2)
}
bias-sq[k] = mean((mean(test_predictions for each b)-yTest)^2)
variance[k] = mean(var(predictions for each b))
test_error[k]=mean(test error for each b)
}

Select the model that minimizes the test error, this is also the point at which the (bias-squared + variance) is minimized.


Spam Filter: multinomial naive Bayes classifier

Counts words in parallel via a unix, poor-man's map-reduce framework.

HW1.1: Read through the provided control script (pNaiveBayes.sh) and all of its comments.


In [402]:
'''
    HW1.1. Read through the provided control script (pNaiveBayes.sh)
    and all of its comments. When you are comfortable with their
    purpose and function, respond to the remaining homework questions below. 
    A simple cell in the notebook with a print statmement with  a "done" string will 
    suffice here. (dont forget to include the Question Number and the quesition 
    in the cell as a multiline comment!)
'''
def hw1_1():
    print 'done'
    
hw1_1()


done

HW1.2: Provide a mapper/reducer pair that, when executed by pNaiveBayes.sh will determine the number of occurrences of a single, user-specified word. Examine the word “assistance” and report your results.


In [24]:
'''
    HW1.2. Provide a mapper/reducer pair that, when executed by pNaiveBayes.sh
    will determine the number of occurrences of a single, user-specified word. 
    Examine the word “assistance” and report your results.

    - mapper.py counts all occurrences of a single word, and
    - reducer.py collates the counts of the single word.

    CROSSCHECK: >grep assistance enronemail_1h.txt|cut -d$'\t' -f4| grep assistance|wc -l
       8
'''

# make directory for problem and change to that dir
!mkdir ~/Documents/W261/hw1/hw1_2
%cd ~/Documents/W261/hw1/hw1_2/

Mapper: The mapper processes each email and counts all occurrences of the word in each chunk and emits that count.


In [106]:
%%writefile ./mapper.py
#!/usr/bin/python
import sys
import re

# Data filename and word to count passed in as command line arguments
filename = sys.argv[1]
findword = sys.argv[2]

# Initialize word count
count = 0

# regular expression for extracting word tokens from email content
WORD_RE = re.compile(r"[\w']+")

# Read emails in from file
with open (filename, "r") as myfile:
    # Loop over each email (one email per line)
    for line in myfile.readlines():
        # email content is in last column of input file
        content = line.split('\t')[-1]
        # extract words from content
        words = WORD_RE.findall(content.lower())
        
        # loop over words in email
        for w in words:
            # if word matches the word we are looking for, increment counter
            if w==findword.lower():
                count+=1
    
    # output total count for mapper 
    print count


Overwriting ./mapper.py

Reducer: The reducer sums the counts of the specified word from each mapper


In [107]:
%%writefile ./reducer.py
#!/usr/bin/python

import sys

# initialize total count
sum = 0

# loop over mappers
for i in range(1,len(sys.argv)):
    # Read mapper output from file
    with open (sys.argv[i], "r") as countfile:
        # loop over lines in mapper output
        for line in countfile.readlines():
            # add count to total
            sum += int(line)

# output total count
print sum


Overwriting ./reducer.py

In [504]:
# Run mapper and reducer cells first to write mapper.py and reducer.py

def hw1_2():
    
    # move to directory for problem and copy data and map reduce script into dir
    %cd ~/Documents/W261/hw1/hw1_2/
    !cp ../enronemail_1h.txt ./
    !cp ../pNaiveBayes.sh ./
    
    # set mapper and reducer permissions
    !chmod a+x ./mapper.py
    !chmod a+x ./reducer.py
    
    # pNaiveBayes.sh m wordlist
    #    with m = number of mappers
    #         wordlist = words in vocab
    !./pNaiveBayes.sh 2 "assistance"
    
    # total count from mapreduce output
    total = !cat enronemail_1h.txt.output
    # show results
    print('\nThere are %d occurrences of the word "assistance" in the email contents' % int(total[0]))
    
hw1_2()


/Users/davidadams/Documents/W261/hw1/hw1_2

There are 9 occurrences of the word "assistance" in the email contents

HW1.3: MapReduce to classify email messages by a single, user-specified word using Mulitnomial Naive Bayes


In [505]:
'''
    HW1.3. Provide a mapper/reducer pair that, when executed by pNaiveBayes.sh
    will classify the email messages by a single, user-specified word. 
    Examine the word “assistance” and report your results. 
    
    - mapper.py is same as in part (2), and
    - reducer.py performs a single word Naive Bayes classification.
'''

# make directory for problem and change to that dir
!mkdir ~/Documents/W261/hw1/hw1_3/
%cd ~/Documents/W261/hw1/hw1_3/
!cp ../enronemail_1h_cleaned.txt ./
!cp ../pNaiveBayes.sh ./


mkdir: /Users/davidadams/Documents/W261/hw1/hw1_3/: File exists
/Users/davidadams/Documents/W261/hw1/hw1_3

Mapper: The mapper counts occurrences of the given word in each email


In [455]:
%%writefile ./mapper.py
#!/usr/bin/python

import sys
import re
from collections import defaultdict

# Data filename and word(s) in vocab passed in as command line arguments
filename = sys.argv[1]
vocabwords = sys.argv[2].lower().split(' ')

# regular expression for extracting word tokens from email content
WORD_RE = re.compile(r"[\w']+")

# Read emails in from file
with open (filename, "r") as myfile:
    # Loop over each email (one email per line)
    for line in myfile.readlines():
        
        # email content is in last column of input file
        linelist = line.split('\t')
        content = linelist[-1]
        # extract words from content
        words = WORD_RE.findall(content.lower())
        
        # Initialize word counts
        vocabWordsCount = dict()
        for v in vocabwords:
            vocabWordsCount[v]=0
        
        # loop over words in email
        for w in words:
            # if word is in vocab, increment count
            if w in vocabWordsCount.keys():
                vocabWordsCount[w]+=1
        
        # output count of each vocab word in each email
        for k in vocabWordsCount.keys():
            print linelist[0]+'\t'+linelist[1]+'\t'+k+'\t'+str(vocabWordsCount[k])


Overwriting ./mapper.py

Reducer: The reducer uses the single word count for each email to make multinomial Naive Bayes prediction

P(C|D) = P(D|C)P(C)/P(D)
P(C|D) ~ P(D|C)P(C)
P(spam | D) ~ P(assistance | spam) * P(spam)

P(assistance | spam) = (count of 'assistance' in all spam emails + 1) / (count of all words in vocab in all spam emails + vocabulary size) 

                            vocabulary = 'assistance'
                            => count of all words in vocab in all spam emails = count of 'assistance' in all spam emails
                               vocab size = 1
P(assistance | spam) = 1

P(spam | D) ~ P(spam)


In [462]:
%%writefile ./reducer.py
#!/usr/bin/python

import sys
from collections import defaultdict
from math import log

# initialize counters
vocabWordCounts_spam = defaultdict(int)
vocabWordCounts_ham = defaultdict(int)
totalCount = 0
spamCount = 0
wordcounts = defaultdict(dict)

# initialize set of email ids in spam and ham
spam_ids = set()
ham_ids = set()

# loop over mappers
for i in range(1,len(sys.argv)):
    # read mapper output from each file
    with open (sys.argv[i], "r") as countfile:
        # loop over lines in mapper output
        for line in countfile.readlines():
            
            line = line.strip()
            linelist = line.split('\t')
            
            # add count of word on line to dictionary by email ids
            #    {email_ids: {word1: count_of_word1, word2: count_of_word2,...}}
            wordcounts[linelist[0]][linelist[2]]=int(linelist[3])
            
            # class label for email is in second column
            spam = int(linelist[1])
            
            # add email id to correct set (spam or ham) and increment word counts
            if spam==1:
                spam_ids.add(linelist[0])
                vocabWordCounts_spam[linelist[2]]+=int(linelist[3])
            else:
                ham_ids.add(linelist[0])
                vocabWordCounts_ham[linelist[2]]+=int(linelist[3])

# number of emails in spam, ham, and total
spamCount = len(spam_ids)
hamCount = len(ham_ids)
totalCount = spamCount+hamCount

# prior probabilities of spam and ham calculated as (number of spam emails)/(total number of emails)
prior_spam = 1.0*spamCount/totalCount
prior_ham = 1.0-prior_spam

# sum number of words in vocab in spam and ham
total_all_vocab_in_spam = 0
total_all_vocab_in_ham = 0
for vw in vocabWordCounts_spam.keys():
    total_all_vocab_in_spam+=vocabWordCounts_spam[vw]
for vw in vocabWordCounts_ham.keys():
    total_all_vocab_in_ham+=vocabWordCounts_ham[vw]

# calc the total number of words in vocab as union of words in spam and ham
vocab_spam = set(vocabWordCounts_spam.keys())
vocab_ham = set(vocabWordCounts_ham.keys())
vocab = vocab_spam.union(vocab_ham)
vocab_size = len(vocab)

# initialize dicts for storing conditional probabilities for each vocab word in spam/ham
pr_vw_given_spam = dict()
pr_vw_given_ham = dict()

'''
Conditional Probabilities calculated as:
pr_vw_given_spam = (total_vw_in_spam+1)/(total_all_vocab_in_spam + vocab_size)
pr_vw_given_ham = (total_vw_in_ham+1)/(total_all_vocab_in_ham + vocab_size)
'''

# loop over all vocab words and store conditional probabilities
for vw in vocab:
    if vw in vocabWordCounts_spam.keys():
        pr_vw_given_spam[vw]=1.0*(vocabWordCounts_spam[vw]+1)/(total_all_vocab_in_spam + vocab_size)
    else:
        pr_vw_given_spam[vw]=1.0*(0+1)/(total_all_vocab_in_spam + vocab_size)
    if vw in vocabWordCounts_ham.keys():
        pr_vw_given_ham[vw]=1.0*(vocabWordCounts_ham[vw]+1)/(total_all_vocab_in_ham + vocab_size)
    else:
        pr_vw_given_ham[vw]=1.0*(0+1)/(total_all_vocab_in_ham + vocab_size)

# loop over emails to calculate spam/ham scores and make NB prediction
for email_id,email in wordcounts.iteritems():
    
    # initialize scores as priors
    spam_score = log(prior_spam)
    ham_score = log(prior_ham)

    # loop over vocab words in email and counts
    for w,c in email.iteritems():
        # add count*conditional probability for each vocab word in email
        spam_score+=c*log(pr_vw_given_spam[w])
        ham_score+=c*log(pr_vw_given_ham[w])

    # true class
    spam = 1*(email_id in spam_ids)
    
    # predicted class
    predict = 1*(spam_score>ham_score)
    
    # output results as 'ID \t truth \t prediction'
    print email_id+'\t'+str(spam)+'\t'+str(predict)


Overwriting ./reducer.py

In [506]:
# Run mapper and reducer cells first to write mapper.py and reducer.py

def score(outputfile):
    '''
        Input: reducer Naive Bayes output file with format
                ID \t truth \t prediction
        Returns: Accuracy of classifier prediction
    '''
    # initialize count of examples and correct predictions
    n = 0
    correct=0
    
    # read lines from reducer output file
    with open (outputfile, "r") as outfile:
        for line in outfile.readlines():
            # increment count of number of examples
            n+=1
            
            # split line
            lineList = line.replace('\n','').split('\t')
            
            # true class is in second column
            truth=int(lineList[1])
            
            # predicted class is in third column
            predict=int(lineList[2])
            
            # increment number of correct examples if prediction matches truth
            correct+=(1*truth==predict)
    
    # return percent of correct predictions (accuracy)
    return 1.0*correct/n
    
    
def hw1_3():
    
    # move to directory for problem and copy data and map reduce script into dir
    %cd ~/Documents/W261/hw1/hw1_3/
    !cp ../enronemail_1h.txt ./
    !cp ../pNaiveBayes.sh ./
    
    # set mapper and reducer permissions
    !chmod a+x ./mapper.py
    !chmod a+x ./reducer.py
    
    # pNaiveBayes.sh m wordlist
    #    with m = number of mappers
    #         wordlist = words in vocab
    !./pNaiveBayes.sh 2 "assistance"
    
    # show output file
    !cat enronemail_1h_cleaned.txt.output
    
    # calculate and show accuracy
    outputfile = 'enronemail_1h_cleaned.txt.output'
    accuracy = score(outputfile)
    print '\nAccuracy for single word multinomial NB classifier: ',accuracy
    
hw1_3()


/Users/davidadams/Documents/W261/hw1/hw1_3
0010.2003-12-18.GP	1	0
0010.2001-06-28.SA_and_HP	1	0
0001.2000-01-17.beck	0	0
0018.1999-12-14.kaminski	0	0
0005.1999-12-12.kaminski	0	0
0011.2001-06-29.SA_and_HP	1	0
0008.2004-08-01.BG	1	0
0009.1999-12-14.farmer	0	0
0017.2003-12-18.GP	1	0
0011.2001-06-28.SA_and_HP	1	0
0015.2001-07-05.SA_and_HP	1	0
0015.2001-02-12.kitchen	0	0
0009.2001-06-26.SA_and_HP	1	0
0017.1999-12-14.kaminski	0	0
0012.2000-01-17.beck	0	0
0003.2000-01-17.beck	0	0
0004.2001-06-12.SA_and_HP	1	0
0008.2001-06-12.SA_and_HP	1	0
0007.2001-02-09.kitchen	0	0
0016.2004-08-01.BG	1	0
0015.2000-06-09.lokay	0	0
0005.1999-12-14.farmer	0	0
0016.1999-12-15.farmer	0	0
0013.2004-08-01.BG	1	0
0005.2003-12-18.GP	1	0
0012.2001-02-09.kitchen	0	0
0003.2001-02-08.kitchen	0	0
0009.2001-02-09.kitchen	0	0
0006.2001-02-08.kitchen	0	0
0014.2003-12-19.GP	1	0
0010.1999-12-14.farmer	0	0
0010.2004-08-01.BG	1	0
0014.1999-12-14.kaminski	0	0
0006.1999-12-13.kaminski	0	0
0011.1999-12-14.farmer	0	0
0013.1999-12-14.kaminski	0	0
0001.2001-02-07.kitchen	0	0
0008.2001-02-09.kitchen	0	0
0007.2003-12-18.GP	1	0
0017.2004-08-02.BG	1	0
0014.2004-08-01.BG	1	0
0006.2003-12-18.GP	1	0
0016.2001-07-05.SA_and_HP	1	0
0008.2003-12-18.GP	1	0
0014.2001-07-04.SA_and_HP	1	0
0001.2001-04-02.williams	0	0
0012.2000-06-08.lokay	0	0
0014.1999-12-15.farmer	0	0
0009.2000-06-07.lokay	0	0
0001.1999-12-10.farmer	0	0
0008.2001-06-25.SA_and_HP	1	0
0017.2001-04-03.williams	0	0
0014.2001-02-12.kitchen	0	0
0016.2001-07-06.SA_and_HP	1	0
0015.1999-12-15.farmer	0	0
0009.1999-12-13.kaminski	0	0
0001.2000-06-06.lokay	0	0
0011.2004-08-01.BG	1	0
0004.2004-08-01.BG	1	0
0018.2003-12-18.GP	1	0
0002.1999-12-13.farmer	0	0
0016.2003-12-19.GP	1	0
0004.1999-12-14.farmer	0	0
0015.2003-12-19.GP	1	0
0006.2004-08-01.BG	1	0
0009.2003-12-18.GP	1	0
0007.1999-12-14.farmer	0	0
0005.2000-06-06.lokay	0	0
0010.1999-12-14.kaminski	0	0
0007.2000-01-17.beck	0	0
0003.1999-12-14.farmer	0	0
0003.2004-08-01.BG	1	0
0017.2004-08-01.BG	1	0
0013.2001-06-30.SA_and_HP	1	0
0003.1999-12-10.kaminski	0	0
0012.1999-12-14.farmer	0	0
0004.1999-12-10.kaminski	0	0
0018.2001-07-13.SA_and_HP	1	0
0002.2001-02-07.kitchen	0	0
0007.2004-08-01.BG	1	0
0012.1999-12-14.kaminski	0	0
0005.2001-06-23.SA_and_HP	1	0
0007.1999-12-13.kaminski	0	0
0017.2000-01-17.beck	0	0
0006.2001-06-25.SA_and_HP	1	0
0006.2001-04-03.williams	0	0
0005.2001-02-08.kitchen	0	0
0002.2003-12-18.GP	1	0
0003.2003-12-18.GP	1	0
0013.2001-04-03.williams	0	0
0004.2001-04-02.williams	0	0
0010.2001-02-09.kitchen	0	0
0001.1999-12-10.kaminski	0	0
0013.1999-12-14.farmer	0	0
0015.1999-12-14.kaminski	0	0
0012.2003-12-19.GP	1	0
0016.2001-02-12.kitchen	0	0
0002.2004-08-01.BG	1	0
0002.2001-05-25.SA_and_HP	1	0
0011.2003-12-18.GP	1	0

Accuracy for single word multinomial NB classifier:  0.56

In [ ]:


In [507]:
'''
    HW1.4. Provide a mapper/reducer pair that, when executed by pNaiveBayes.sh
    will classify the email messages by a list of one or more user-specified words. 
    Examine the words “assistance”, “valium”, and “enlargementWithATypo” 
    and report your results
    
    - mapper.py counts all occurrences of a list of words, and
    - reducer.py performs the multiple-word Naive Bayes classification via the chosen list.
'''

# make directory for problem and change to that dir
!mkdir ~/Documents/W261/hw1/hw1_4/
%cd ~/Documents/W261/hw1/hw1_4/
!cp ../enronemail_1h_cleaned.txt ./
!cp ../pNaiveBayes.sh ./


mkdir: /Users/davidadams/Documents/W261/hw1/hw1_4/: File exists
/Users/davidadams/Documents/W261/hw1/hw1_4

In [465]:
%%writefile ./mapper.py
#!/usr/bin/python

import sys
import re

# Data filename and word to count passed in as command line arguments
filename = sys.argv[1]
vocabwords = sys.argv[2].lower().split(' ')

# regular expression for extracting word tokens from email content
WORD_RE = re.compile(r"[\w']+")

# initialize dictionary for storing word counts
vocabWordsCount = dict()

# Read emails in from file
with open (filename, "r") as myfile:
    # Loop over each email (one email per line)
    for line in myfile.readlines():
        # email content is in last column of input file
        linelist = line.split('\t')
        content = linelist[-1]
        # extract words from content
        words = WORD_RE.findall(content.lower())
        
        # add vocab words to word count dictionary keys
        for v in vocabwords:
            vocabWordsCount[v]=0
        
        # loop over words in email
        for w in words:
            # if word matches the word we are looking for, increment counter
            if w in vocabWordsCount.keys():
                vocabWordsCount[w]+=1
        
        # output email ID, spam class, word, count for each vocab word
        for k in vocabWordsCount.keys():
            print linelist[0]+'\t'+linelist[1]+'\t'+k+'\t'+str(vocabWordsCount[k])


Overwriting ./mapper.py

In [466]:
%%writefile ./reducer.py
#!/usr/bin/python

import sys
from collections import defaultdict
from math import log

# initialize counters
vocabWordCounts_spam = defaultdict(int)
vocabWordCounts_ham = defaultdict(int)
totalCount = 0
spamCount = 0
wordcounts = defaultdict(dict)

# initialize set of email ids in spam and ham
spam_ids = set()
ham_ids = set()

# loop over mappers
for i in range(1,len(sys.argv)):
    # read mapper output from each file
    with open (sys.argv[i], "r") as countfile:
        # loop over lines in mapper output
        for line in countfile.readlines():
            
            line = line.strip()
            linelist = line.split('\t')
            
            # add count of word on line to dictionary by email ids
            #    {email_ids: {word1: count_of_word1, word2: count_of_word2,...}}
            wordcounts[linelist[0]][linelist[2]]=int(linelist[3])
            
            # class label for email is in second column
            spam = int(linelist[1])
            
            # add email id to correct set (spam or ham) and increment word counts
            if spam==1:
                spam_ids.add(linelist[0])
                vocabWordCounts_spam[linelist[2]]+=int(linelist[3])
            else:
                ham_ids.add(linelist[0])
                vocabWordCounts_ham[linelist[2]]+=int(linelist[3])

# number of emails in spam, ham, and total
spamCount = len(spam_ids)
hamCount = len(ham_ids)
totalCount = spamCount+hamCount

# prior probabilities of spam and ham calculated as (number of spam emails)/(total number of emails)
prior_spam = 1.0*spamCount/totalCount
prior_ham = 1.0-prior_spam

# sum number of words in vocab in spam and ham
total_all_vocab_in_spam = 0
total_all_vocab_in_ham = 0
for vw in vocabWordCounts_spam.keys():
    total_all_vocab_in_spam+=vocabWordCounts_spam[vw]
for vw in vocabWordCounts_ham.keys():
    total_all_vocab_in_ham+=vocabWordCounts_ham[vw]

# calc the total number of words in vocab as union of words in spam and ham
vocab_spam = set(vocabWordCounts_spam.keys())
vocab_ham = set(vocabWordCounts_ham.keys())
vocab = vocab_spam.union(vocab_ham)
vocab_size = len(vocab)

# initialize dicts for storing conditional probabilities for each vocab word in spam/ham
pr_vw_given_spam = dict()
pr_vw_given_ham = dict()

'''
Conditional Probabilities calculated as:
pr_vw_given_spam = (total_vw_in_spam+1)/(total_all_vocab_in_spam + vocab_size)
pr_vw_given_ham = (total_vw_in_ham+1)/(total_all_vocab_in_ham + vocab_size)
'''

# loop over all vocab words and store conditional probabilities
for vw in vocab:
    if vw in vocabWordCounts_spam.keys():
        pr_vw_given_spam[vw]=1.0*(vocabWordCounts_spam[vw]+1)/(total_all_vocab_in_spam + vocab_size)
    else:
        pr_vw_given_spam[vw]=1.0*(0+1)/(total_all_vocab_in_spam + vocab_size)
    if vw in vocabWordCounts_ham.keys():
        pr_vw_given_ham[vw]=1.0*(vocabWordCounts_ham[vw]+1)/(total_all_vocab_in_ham + vocab_size)
    else:
        pr_vw_given_ham[vw]=1.0*(0+1)/(total_all_vocab_in_ham + vocab_size)

# loop over emails to calculate spam/ham scores and make NB prediction
for email_id,email in wordcounts.iteritems():
    
    # initialize scores as priors
    spam_score = log(prior_spam)
    ham_score = log(prior_ham)
    
    # loop over vocab words in email and counts
    for w,c in email.iteritems():
        # add count*conditional probability for each vocab word in email
        spam_score+=c*log(pr_vw_given_spam[w])
        ham_score+=c*log(pr_vw_given_ham[w])
    
    # true class
    spam = 1*(email_id in spam_ids)
    
    # predicted class
    predict = 1*(spam_score>ham_score)
    
    # output results as 'ID \t truth \t prediction'
    print email_id+'\t'+str(spam)+'\t'+str(predict)


Overwriting ./reducer.py

In [508]:
# Run mapper and reducer cells first to write mapper.py and reducer.py

def score(outputfile):
    '''
        Input: reducer Naive Bayes output file with format
                ID \t truth \t prediction
        Returns: Accuracy of classifier prediction
    '''
    # initialize count of examples and correct predictions
    n = 0
    correct=0
    
    # read lines from reducer output file
    with open (outputfile, "r") as outfile:
        for line in outfile.readlines():
            # increment count of number of examples
            n+=1
            
            # split line
            lineList = line.replace('\n','').split('\t')
            
            # true class is in second column
            truth=int(lineList[1])
            
            # predicted class is in third column
            predict=int(lineList[2])
            
            # increment number of correct examples if prediction matches truth
            correct+=(1*truth==predict)
    
    # return percent of correct predictions (accuracy)
    return 1.0*correct/n
    
    
def hw1_4():
    
    # move to directory for problem and copy data and map reduce script into dir
    %cd ~/Documents/W261/hw1/hw1_4/
    !cp ../enronemail_1h.txt ./
    !cp ../pNaiveBayes.sh ./
    
    # set mapper and reducer permissions
    !chmod a+x ./mapper.py
    !chmod a+x ./reducer.py
    
    # pNaiveBayes.sh m wordlist
    #    with m = number of mappers
    #         wordlist = words in vocab
    !./pNaiveBayes.sh 4 "assistance valium enlargementWithATypo"
    
    # show output file
    !cat enronemail_1h_cleaned.txt.output
    
    # calculate and show accuracy
    outputfile = 'enronemail_1h_cleaned.txt.output'
    accuracy = score(outputfile)
    print 'Accuracy: ',accuracy
    
hw1_4()


/Users/davidadams/Documents/W261/hw1/hw1_4
0010.2003-12-18.GP	1	0
0010.2001-06-28.SA_and_HP	1	0
0001.2000-01-17.beck	0	0
0018.1999-12-14.kaminski	0	0
0005.1999-12-12.kaminski	0	0
0011.2001-06-29.SA_and_HP	1	0
0008.2004-08-01.BG	1	0
0009.1999-12-14.farmer	0	0
0017.2003-12-18.GP	1	0
0011.2001-06-28.SA_and_HP	1	0
0015.2001-07-05.SA_and_HP	1	0
0015.2001-02-12.kitchen	0	0
0009.2001-06-26.SA_and_HP	1	0
0017.1999-12-14.kaminski	0	0
0012.2000-01-17.beck	0	0
0003.2000-01-17.beck	0	0
0004.2001-06-12.SA_and_HP	1	0
0008.2001-06-12.SA_and_HP	1	0
0007.2001-02-09.kitchen	0	0
0016.2004-08-01.BG	1	0
0015.2000-06-09.lokay	0	0
0005.1999-12-14.farmer	0	0
0016.1999-12-15.farmer	0	0
0013.2004-08-01.BG	1	0
0005.2003-12-18.GP	1	0
0012.2001-02-09.kitchen	0	0
0003.2001-02-08.kitchen	0	0
0009.2001-02-09.kitchen	0	0
0006.2001-02-08.kitchen	0	0
0014.2003-12-19.GP	1	0
0010.1999-12-14.farmer	0	0
0010.2004-08-01.BG	1	0
0014.1999-12-14.kaminski	0	0
0006.1999-12-13.kaminski	0	0
0011.1999-12-14.farmer	0	0
0013.1999-12-14.kaminski	0	0
0001.2001-02-07.kitchen	0	0
0008.2001-02-09.kitchen	0	0
0007.2003-12-18.GP	1	0
0017.2004-08-02.BG	1	0
0014.2004-08-01.BG	1	0
0006.2003-12-18.GP	1	0
0016.2001-07-05.SA_and_HP	1	0
0008.2003-12-18.GP	1	0
0014.2001-07-04.SA_and_HP	1	0
0001.2001-04-02.williams	0	0
0012.2000-06-08.lokay	0	0
0014.1999-12-15.farmer	0	0
0009.2000-06-07.lokay	0	0
0001.1999-12-10.farmer	0	0
0008.2001-06-25.SA_and_HP	1	0
0017.2001-04-03.williams	0	0
0014.2001-02-12.kitchen	0	0
0016.2001-07-06.SA_and_HP	1	0
0015.1999-12-15.farmer	0	0
0009.1999-12-13.kaminski	0	0
0001.2000-06-06.lokay	0	0
0011.2004-08-01.BG	1	0
0004.2004-08-01.BG	1	0
0018.2003-12-18.GP	1	0
0002.1999-12-13.farmer	0	0
0016.2003-12-19.GP	1	0
0004.1999-12-14.farmer	0	0
0015.2003-12-19.GP	1	0
0006.2004-08-01.BG	1	0
0009.2003-12-18.GP	1	0
0007.1999-12-14.farmer	0	0
0005.2000-06-06.lokay	0	0
0010.1999-12-14.kaminski	0	0
0007.2000-01-17.beck	0	0
0003.1999-12-14.farmer	0	0
0003.2004-08-01.BG	1	0
0017.2004-08-01.BG	1	0
0013.2001-06-30.SA_and_HP	1	0
0003.1999-12-10.kaminski	0	0
0012.1999-12-14.farmer	0	0
0004.1999-12-10.kaminski	0	0
0018.2001-07-13.SA_and_HP	1	0
0002.2001-02-07.kitchen	0	0
0007.2004-08-01.BG	1	0
0012.1999-12-14.kaminski	0	0
0005.2001-06-23.SA_and_HP	1	0
0007.1999-12-13.kaminski	0	0
0017.2000-01-17.beck	0	0
0006.2001-06-25.SA_and_HP	1	0
0006.2001-04-03.williams	0	0
0005.2001-02-08.kitchen	0	0
0002.2003-12-18.GP	1	0
0003.2003-12-18.GP	1	0
0013.2001-04-03.williams	0	0
0004.2001-04-02.williams	0	0
0010.2001-02-09.kitchen	0	0
0001.1999-12-10.kaminski	0	0
0013.1999-12-14.farmer	0	0
0015.1999-12-14.kaminski	0	0
0012.2003-12-19.GP	1	0
0016.2001-02-12.kitchen	0	0
0002.2004-08-01.BG	1	0
0002.2001-05-25.SA_and_HP	1	0
0011.2003-12-18.GP	1	0
Accuracy:  0.56

In [509]:
'''
    HW1.5. Provide a mapper/reducer pair that, when executed by pNaiveBayes.sh
    will classify the email messages by all words present.
    To do so, make sure that

    - mapper.py counts all occurrences of all words, and
    - reducer.py performs a word-distribution-wide Naive Bayes classification.
'''

# make directory for problem and change to that dir
!mkdir ~/Documents/W261/hw1/hw1_5/
%cd ~/Documents/W261/hw1/hw1_5/
!cp ../enronemail_1h_cleaned.txt ./
!cp ../pNaiveBayes.sh ./


mkdir: /Users/davidadams/Documents/W261/hw1/hw1_5/: File exists
/Users/davidadams/Documents/W261/hw1/hw1_5

In [490]:
%%writefile ./mapper.py
#!/usr/bin/python
import sys
import re
from collections import defaultdict

# Data filename and vocab words passed in as command line arguments
filename = sys.argv[1]
vocabwords = sys.argv[2].lower().split(' ')

# regular expression for extracting word tokens from email content
WORD_RE = re.compile(r"[\w']+")

# Read emails in from file
with open (filename, "r") as myfile:
    # Loop over each email (one email per line)
    for line in myfile.readlines():
        
        # email content is in last column of input file
        linelist = line.split('\t')
        content = linelist[-1]
        # extract words from content
        words = WORD_RE.findall(content.lower())
        
        # loop over words in email and count each, store in dict {word: count,...}
        vocabWordsCount = defaultdict(int)
        for w in words:
            vocabWordsCount[w]+=1
        
        # output email ID, spam class, word, count for each word
        for k in vocabWordsCount.keys():
            print linelist[0]+'\t'+linelist[1]+'\t'+k+'\t'+str(vocabWordsCount[k])


Overwriting ./mapper.py

In [491]:
%%writefile ./reducer.py
#!/usr/bin/python

import sys
from collections import defaultdict
from math import log

# initialize counters
vocabWordCounts_spam = defaultdict(int)
vocabWordCounts_ham = defaultdict(int)
totalCount = 0
spamCount = 0
wordcounts = defaultdict(dict)

# initialize set of email ids in spam and ham
spam_ids = set()
ham_ids = set()

# loop over mappers
for i in range(1,len(sys.argv)):
    # read mapper output from each file
    with open (sys.argv[i], "r") as countfile:
        # loop over lines in mapper output
        for line in countfile.readlines():
            
            line = line.strip()
            linelist = line.split('\t')
            
            # add count of word on line to dictionary by email ids
            #    {email_ids: {word1: count_of_word1, word2: count_of_word2,...}}
            wordcounts[linelist[0]][linelist[2]]=int(linelist[3])
            
            # class label for email is in second column
            spam = int(linelist[1])
            
            # add email id to correct set (spam or ham) and increment word counts
            if spam==1:
                spam_ids.add(linelist[0])
                vocabWordCounts_spam[linelist[2]]+=int(linelist[3])
            else:
                ham_ids.add(linelist[0])
                vocabWordCounts_ham[linelist[2]]+=int(linelist[3])

# number of emails in spam, ham, and total
spamCount = len(spam_ids)
hamCount = len(ham_ids)
totalCount = spamCount+hamCount

# prior probabilities of spam and ham calculated as (number of spam emails)/(total number of emails)
prior_spam = 1.0*spamCount/totalCount
prior_ham = 1.0-prior_spam

# sum number of words in vocab in spam and ham
total_all_vocab_in_spam = 0
total_all_vocab_in_ham = 0
for vw in vocabWordCounts_spam.keys():
    total_all_vocab_in_spam+=vocabWordCounts_spam[vw]
for vw in vocabWordCounts_ham.keys():
    total_all_vocab_in_ham+=vocabWordCounts_ham[vw]
    
# calc the total number of words in vocab as union of words in spam and ham
vocab_spam = set(vocabWordCounts_spam.keys())
vocab_ham = set(vocabWordCounts_ham.keys())
vocab = vocab_spam.union(vocab_ham)
vocab_size = len(vocab)

# initialize dicts for storing conditional probabilities for each vocab word in spam/ham
pr_vw_given_spam = dict()
pr_vw_given_ham = dict()

'''
Conditional Probabilities calculated as:
pr_vw_given_spam = (total_vw_in_spam+1)/(total_all_vocab_in_spam + vocab_size)
pr_vw_given_ham = (total_vw_in_ham+1)/(total_all_vocab_in_ham + vocab_size)
'''

# loop over all vocab words and store conditional probabilities
for vw in vocab:
    if vw in vocabWordCounts_spam.keys():
        pr_vw_given_spam[vw]=1.0*(vocabWordCounts_spam[vw]+1)/(total_all_vocab_in_spam + vocab_size)
    else:
        pr_vw_given_spam[vw]=1.0*(0+1)/(total_all_vocab_in_spam + vocab_size)
    if vw in vocabWordCounts_ham.keys():
        pr_vw_given_ham[vw]=1.0*(vocabWordCounts_ham[vw]+1)/(total_all_vocab_in_ham + vocab_size)
    else:
        pr_vw_given_ham[vw]=1.0*(0+1)/(total_all_vocab_in_ham + vocab_size)

# loop over emails to calculate spam/ham scores and make NB prediction
for email_id,email in wordcounts.iteritems():
    
    # initialize scores as priors
    spam_score = log(prior_spam)
    ham_score = log(prior_ham)
    
    # loop over vocab words in email and counts
    for w,c in email.iteritems():
        # add count*conditional probability for each vocab word in email
        spam_score+=c*log(pr_vw_given_spam[w])
        ham_score+=c*log(pr_vw_given_ham[w])
        
    # true class
    spam = 1*(email_id in spam_ids)
    
    # predicted class
    predict = 1*(spam_score>ham_score)
    
    # output results as 'ID \t truth \t prediction'
    print email_id+'\t'+str(spam)+'\t'+str(predict)


Overwriting ./reducer.py

In [510]:
# Run mapper and reducer cells first to write mapper.py and reducer.py

def score(outputfile):
    '''
        Input: reducer Naive Bayes output file with format
                ID \t truth \t prediction
        Returns: Accuracy of classifier prediction
    '''
    # initialize count of examples and correct predictions
    n = 0
    correct=0
    
    # read lines from reducer output file
    with open (outputfile, "r") as outfile:
        for line in outfile.readlines():
            # increment count of number of examples
            n+=1
            
            # split line
            lineList = line.replace('\n','').split('\t')
            
            # true class is in second column
            truth=int(lineList[1])
            
            # predicted class is in third column
            predict=int(lineList[2])
            
            # increment number of correct examples if prediction matches truth
            correct+=(1*truth==predict)
    
    # return percent of correct predictions (accuracy)
    return 1.0*correct/n    
    
def hw1_5():
    
    # move to directory for problem and copy data and map reduce script into dir
    %cd ~/Documents/W261/hw1/hw1_5/
    !cp ../enronemail_1h.txt ./
    !cp ../pNaiveBayes.sh ./
    
    # set mapper and reducer permissions
    setMapperPermissions()
    setReducerPermissions()
    
    # pNaiveBayes.sh m wordlist
    #    with m = number of mappers
    #         wordlist = words in vocab
    !./pNaiveBayes.sh 4 "*"
    
    # show output file
    !cat enronemail_1h_cleaned.txt.output
    
    # calculate and show accuracy
    outputfile = 'enronemail_1h_cleaned.txt.output'
    accuracy = score(outputfile)
    print 'Multinomial Naive Bayes Classifier Accuracy: ',accuracy
    
hw1_5()


/Users/davidadams/Documents/W261/hw1/hw1_5
0010.2003-12-18.GP	1	0
0010.2001-06-28.SA_and_HP	1	1
0001.2000-01-17.beck	0	0
0018.1999-12-14.kaminski	0	0
0005.1999-12-12.kaminski	0	0
0011.2001-06-29.SA_and_HP	1	1
0008.2004-08-01.BG	1	1
0009.1999-12-14.farmer	0	0
0017.2003-12-18.GP	1	1
0011.2001-06-28.SA_and_HP	1	1
0015.2001-07-05.SA_and_HP	1	1
0015.2001-02-12.kitchen	0	0
0009.2001-06-26.SA_and_HP	1	1
0017.1999-12-14.kaminski	0	0
0012.2000-01-17.beck	0	0
0003.2000-01-17.beck	0	0
0004.2001-06-12.SA_and_HP	1	1
0008.2001-06-12.SA_and_HP	1	1
0007.2001-02-09.kitchen	0	0
0016.2004-08-01.BG	1	1
0015.2000-06-09.lokay	0	0
0005.1999-12-14.farmer	0	0
0016.1999-12-15.farmer	0	0
0013.2004-08-01.BG	1	1
0005.2003-12-18.GP	1	1
0012.2001-02-09.kitchen	0	0
0003.2001-02-08.kitchen	0	0
0009.2001-02-09.kitchen	0	0
0006.2001-02-08.kitchen	0	0
0014.2003-12-19.GP	1	1
0010.1999-12-14.farmer	0	0
0010.2004-08-01.BG	1	1
0014.1999-12-14.kaminski	0	0
0006.1999-12-13.kaminski	0	0
0011.1999-12-14.farmer	0	0
0013.1999-12-14.kaminski	0	0
0001.2001-02-07.kitchen	0	0
0008.2001-02-09.kitchen	0	0
0007.2003-12-18.GP	1	1
0017.2004-08-02.BG	1	1
0014.2004-08-01.BG	1	1
0006.2003-12-18.GP	1	1
0016.2001-07-05.SA_and_HP	1	1
0008.2003-12-18.GP	1	1
0014.2001-07-04.SA_and_HP	1	1
0001.2001-04-02.williams	0	0
0012.2000-06-08.lokay	0	0
0014.1999-12-15.farmer	0	0
0009.2000-06-07.lokay	0	0
0001.1999-12-10.farmer	0	0
0008.2001-06-25.SA_and_HP	1	1
0017.2001-04-03.williams	0	0
0014.2001-02-12.kitchen	0	0
0016.2001-07-06.SA_and_HP	1	1
0015.1999-12-15.farmer	0	0
0009.1999-12-13.kaminski	0	0
0001.2000-06-06.lokay	0	0
0011.2004-08-01.BG	1	1
0004.2004-08-01.BG	1	1
0018.2003-12-18.GP	1	1
0002.1999-12-13.farmer	0	0
0016.2003-12-19.GP	1	1
0004.1999-12-14.farmer	0	0
0015.2003-12-19.GP	1	1
0006.2004-08-01.BG	1	1
0009.2003-12-18.GP	1	1
0007.1999-12-14.farmer	0	0
0005.2000-06-06.lokay	0	0
0010.1999-12-14.kaminski	0	0
0007.2000-01-17.beck	0	0
0003.1999-12-14.farmer	0	0
0003.2004-08-01.BG	1	1
0017.2004-08-01.BG	1	1
0013.2001-06-30.SA_and_HP	1	1
0003.1999-12-10.kaminski	0	0
0012.1999-12-14.farmer	0	0
0004.1999-12-10.kaminski	0	0
0018.2001-07-13.SA_and_HP	1	1
0002.2001-02-07.kitchen	0	0
0007.2004-08-01.BG	1	1
0012.1999-12-14.kaminski	0	0
0005.2001-06-23.SA_and_HP	1	1
0007.1999-12-13.kaminski	0	0
0017.2000-01-17.beck	0	0
0006.2001-06-25.SA_and_HP	1	1
0006.2001-04-03.williams	0	0
0005.2001-02-08.kitchen	0	0
0002.2003-12-18.GP	1	1
0003.2003-12-18.GP	1	1
0013.2001-04-03.williams	0	0
0004.2001-04-02.williams	0	0
0010.2001-02-09.kitchen	0	0
0001.1999-12-10.kaminski	0	1
0013.1999-12-14.farmer	0	0
0015.1999-12-14.kaminski	0	0
0012.2003-12-19.GP	1	1
0016.2001-02-12.kitchen	0	0
0002.2004-08-01.BG	1	1
0002.2001-05-25.SA_and_HP	1	1
0011.2003-12-18.GP	1	1
Multinomial Naive Bayes Classifier Accuracy:  0.98

In [478]:
'''

    HW1.6 Benchmark your code with the Python SciKit-Learn implementation of Naive Bayes

    Training error = misclassification rate with respect to a training set. 
    It is more formally defined here:

        Let DF represent the training set in the following:
        Err(Model, DF) = |{(X, c(X)) ∈ DF : c(X) != Model(x)}|   / |DF|

        Where || denotes set cardinality; 
        c(X) denotes the class of the tuple X in DF; 
        and Model(X) denotes the class inferred by the Model “Model”

    In this exercise, please complete the following:

    — Run the Multinomial Naive Bayes algorithm (using default settings) 
        from SciKit-Learn over the same training data used in HW1.5 and 
        report the Training error (please note some data preparation might 
        be needed to get the Multinomial Naive Bayes algorithm from SkiKit-Learn 
        to run over this dataset)
    — Run the Bernoulli Naive Bayes algorithm from SciKit-Learn (using default settings) 
        over the same training data used in HW1.5 and report the Training error 
    — Run the Multinomial Naive Bayes algorithm you developed for HW1.5 
        over the same data used HW1.5 and report the Training error 
    — Please prepare a table to present your results
    — Explain/justify any differences in terms of training error rates 
        over the dataset in HW1.5 between your Multinomial Naive Bayes 
        implementation (in Map Reduce) versus the Multinomial Naive Bayes 
        implementation in SciKit-Learn
    — Discuss the performance differences in terms of training error rates 
        over the dataset in HW1.5 between the  Multinomial Naive Bayes 
        implementation in SciKit-Learn with the  Bernoulli Naive Bayes 
        implementation in SciKit-Learn
'''


/Users/davidadams/Documents/W261/hw1/hw1_6

In [501]:
def hw1_6():
    
    import numpy as np
    from sklearn.naive_bayes import MultinomialNB, BernoulliNB
    from sklearn.feature_extraction.text import CountVectorizer
    import sys
    import re
    from collections import defaultdict

    # make directory for problem and change to that dir
    !mkdir ~/Documents/W261/hw1/hw1_6/
    %cd ~/Documents/W261/hw1/hw1_6/
    !cp ../enronemail_1h_cleaned.txt ./


    # regular expression for extracting word tokens from email content
    WORD_RE = re.compile(r"[\w']+")

    # initialize lists for storing training data (email contents), 
    #        email ids, email class labels
    train_ids = []
    train_data = []
    train_labels = []
    
    # Read emails in from file
    filename = 'enronemail_1h_cleaned.txt'
    with open (filename, "r") as myfile:
        # Loop over each email (one email per line)
        for line in myfile.readlines():
            
            # store email ids, contents, and spam labels
            linelist = line.split('\t')
            email_id = linelist[0]
            content = linelist[-1]
            label = linelist[1]

            train_ids.append(email_id)
            train_data.append(content)            
            train_labels.append(label)


    # Construct feature matrix for training data using CountVactorizer
    cv_train = CountVectorizer(train_data)
    tdf_train = cv_train.fit_transform(train_data)
    
    '''
    # loop through words in training data and include those with greater than 3 characters
    vocab = []
    for w in cv_train.vocabulary_.keys():
        if len(w)>3:
            vocab.append(w)
    
    # Construct feature matrix for training data using CountVactorizer
    #      with vocab of words greater than lenght 3
    cv_train = CountVectorizer(train_data, vocabulary = vocab)
    tdf_train = cv_train.fit_transform(train_data)
    '''
    # Multinomial Naive Bayes classifier
    print "\nMultinomial Naive Bayes"
    multNB = MultinomialNB()
    multNB.fit(tdf_train, train_labels)
    print("accuracy: %0.2f " % multNB.score(tdf_train, train_labels))

    # Bernoulli Naive Bayes classifier
    print "\nBernoulli Naive Bayes"
    binNB = BernoulliNB()
    binNB.fit(tdf_train, train_labels)
    print("accuracy: %0.2f " % binNB.score(tdf_train, train_labels))
    
hw1_6()


mkdir: /Users/davidadams/Documents/W261/hw1/hw1_6/: File exists
/Users/davidadams/Documents/W261/hw1/hw1_6

Multinomial Naive Bayes
accuracy: 0.98 

Bernoulli Naive Bayes
accuracy: 0.81 

The training accuracy is the same for the sklearn implementation of Multinomial Naive Bayes (NB) and the map reduce implementation in HW1.5.

The sklearn multinomial NB classifier has greater training set accuracy than the sklearn Bernoulli Naive Bayes (NB) classifier because it contains more information by including counts for the words instead of just the presence of them in each email.


In [ ]: