DATASCI W261: Machine Learning at Scale

Week 3, Homework 3

Katrina Adams

kradams@ischool.berkeley.edu
22 September 2015

HW3.0:
What is a merge sort? Where is it used in Hadoop?
How is a combiner function in the context of Hadoop? Give an example where it can be used and justify why it should be used in the context of this problem.
What is the Hadoop shuffle?
What is the Apriori algorithm? Describe an example use in your domain of expertise. Define confidence and lift.

Merge sort: Merge sort is used to sort two or more sorted lists in one sorted list. In Hadoop, merge sort is used to combine sorted keys from mappers to the reducer(s).

Combiner: A combiner function used in Hadoop is a way of combining mapper outputs in-memory before being used in the reducer. A combiner may be used at either the mapper or the reducer and may be used zero or more times; therefore, it must be associative and commutative. Use of a combiner can reduce the communication required between mappers and reducers. A combiner can be used when mappers are generating many key,value pairs with the same keys. For example, in the word count problem, a combiner can be used to sum word counts before sending (word, count) pairs to the reducer. Therefore, the number of pairs that need to be sorted and sent to reducers could be greatly reduced.

Hadoop Shuffle: The Hadoop shuffle is the process of transmitting key,value pairs from mappers to reducers. It includes partitioning items, sorting by keys, and combining then sending to reducer input streams.

Apriori Algorithm: The apriori algorithm is used to find frequent itemsets based on the observation that subsets of a frequent itemset must also be frequent. Therefore, to find frequent itemsets of length k, you can first find itemsets of length k-1, then prune to retain only those that are frequent, then make length k itemsets and prune to retain frequent itemsets. Product recommendation is a classic example of use of the Apriori algorithm. A product recommendation example in environmental engineering could be equipment suppliers, identifying components that are frequently purchased together and recommending those items when a customer is looking to purchase one of them.
Confidence the proportion of itemsets that contain an additional item of itemsets without that item. For example, for the rule {milk, diaper} -> {beer}, the confidence would be equal to the number of itemsets that contain milk, diapers and beer over the number of itemsets that contain milk and diapers. Confidence also reflects the certainty of the discovered pattern. Lift measures the dependence between an item and an added item. It is calculated as the confidence divided by the probability of the added item in the data set.

HW3.1.: Online browsing behavior dataset
Use the online browsing behavior dataset at: https://www.dropbox.com/s/zlfyiwa70poqg74/ProductPurchaseData.txt?dl=0

Each line in this dataset represents a browsing session of a customer.
On each line, each string of 8 characters represents the id of an item browsed during that session. The items are separated by spaces.

Do some exploratory data analysis of this dataset. Report your findings such as number of unique products; largest basket, etc. using Hadoop Map-Reduce.



In [4]:

    
'''
    HW3.1.: Exploratory data analysis 
    Do some exploratory data analysis of this dataset. 
    Report your findings such as number of unique products; 
    largest basket, etc. using Hadoop Map-Reduce.
'''

# make directory for problem and change to that dir
!mkdir ~/Documents/W261/hw3/hw3_1/
%cd ~/Documents/W261/hw3/hw3_1/









    



mkdir: /Users/davidadams/Documents/W261/hw3/hw3_1/: File exists
/Users/davidadams/Documents/W261/hw3/hw3_1



In [5]:

    
'''
    Find number of unique products and 
    distribution of product frequencies
'''
!mkdir ~/Documents/W261/hw3/hw3_1/hw3_1_1/
%cd ~/Documents/W261/hw3/hw3_1/hw3_1_1/









    



/Users/davidadams/Documents/W261/hw3/hw3_1/hw3_1_1



In [25]:

    
%%writefile mapper_countitems.py
#!/usr/bin/python

import sys

# input comes from STDIN (standard input)
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # split the line into words
    items = line.split(' ')

    # write the results to STDOUT (standard output);
    # tab-delimited
    for i in items:
        print '%s\t%d' % (i, 1)









    



Writing mapper_countitems.py



In [26]:

    
%%writefile reducer_countitems.py
#!/usr/bin/python

#from operator import itemgetter
import sys
from collections import defaultdict

counts = defaultdict(int)

# input comes from STDIN
for line in sys.stdin:

    # remove leading and trailing whitespace
    line = line.strip()

    # parse the input we got from mapper.py
    item, c = line.split('\t')

    # convert count (currently a string) to int
    try:
        c = int(c)
    except ValueError:
        # count was not a number, so silently
        # ignore/discard this line
        continue

    counts[item]+=c
    
for item,count in counts.iteritems():
    print '%s\t%d' % (item, count)









    



Writing reducer_countitems.py



In [27]:

    
%%writefile mapper_sortitemcounts.py
#!/usr/bin/python

import sys

# input comes from STDIN (standard input)
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # split the line into words
    item, count = line.split('\t')
    
    count = int(count)

    # write the results to STDOUT (standard output);
    # tab-delimited
    print '%d\t%s' % (count, item)









    



Writing mapper_sortitemcounts.py



In [28]:

    
%%writefile reducer_sortitemcounts.py
#!/usr/bin/python

#from operator import itemgetter
import sys
from collections import defaultdict

counts = defaultdict(int)

# input comes from STDIN
for line in sys.stdin:

    # remove leading and trailing whitespace
    line = line.strip()

    # parse the input we got from mapper.py
    count, item = line.split('\t')

    count = int(count)
    
    print '%s\t%d' % (item, count)









    



Writing reducer_sortitemcounts.py



In [32]:

    
# Run mapper and reducer cells first to write mapper.py and reducer.py

def hw3_1():
    
    def problemsetup():
        # mkdir on hdfs
        !hdfs dfs -mkdir -p /user/davidadams #this dir seems to have to match the computer's user and my husband (davidadams) set up this computer years ago

        # change to problem dir and copy data into dir
        %cd ~/Documents/W261/hw3/hw3_1/hw3_1_1
        !cp ~/Documents/W261/hw3/ProductPurchaseData.txt ./
        
        # put data file into HDFS
        !hdfs dfs -rm ProductPurchaseData.txt
        !hdfs dfs -put ProductPurchaseData.txt /user/davidadams
        
        return None
    
    def runhadoop():
        # run Hadoop streaming mapreduce to count items in input dataset
        !hdfs dfs -rm -r hw3_1_1Output
        !hadoop jar ~/Documents/W261/hw2/hadoop-*streaming*.jar -mapper mapper_countitems.py -reducer reducer_countitems.py -input ProductPurchaseData.txt -output hw3_1_countitems_Output
    
        # show item counts output file
        print '\n================================================='
        !hdfs dfs -cat hw3_1_countitems_Output/part-00000
    
        # run Hadoop streaming mapreduce to sort item counts
        !hdfs dfs -rm -r hw3_1_1_2Output
        !hadoop jar ~/Documents/W261/hw2/hadoop-*streaming*.jar -D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator -D mapred.text.key.comparator.options=-n -mapper mapper_sortitemcounts.py -reducer reducer_sortitemcounts.py -input hw3_1_countitems_Output/part-00000 -output hw3_1_sorteditemcounts_Output
    
        # show output file
        print '\n================================================='
        !hdfs dfs -cat hw3_1_sorteditemcounts_Output/part-00000
        
        return None
    
    def countproducts(productlistfile):
        '''
            Input: sorted item list file from reducer output
            Returns: number of unique products
        '''
        # initialize count of examples and correct predictions
        numproducts = 0

        # read lines from reducer output file
        with open (productlistfile, "r") as outfile:
            for line in outfile.readlines():
                numproducts+=1
        return numproducts
        
   
    
    
    problemsetup()
    
    runhadoop()
    
    
    # calculate number of unique products
    !hdfs dfs -cat hw3_1_sorteditemcounts_Output/part-00000 > 'productlist.txt'
    productlistfile = 'productlist.txt'
    numproducts = countproducts(productlistfile)
    print '\n================================================='
    print '\nNumber of unique products: ', numproducts
    

hw3_1()









    



15/09/21 10:05:17 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

=================================================

Number of unique products:  12592

Exploratory data analysis

HW3.2: Apriori Algorithm
Note: for this part the writeup will require a specific rule ordering but the program need not sort the output.

List the top 5 rules with corresponding confidence scores in decreasing order of confidence score for frequent (100< count) itemsets of size 2.
A rule is of the form:
(item1) ⇒ item2.

Fix the ordering of the rule lexicographically (left to right), and break ties in confidence (between rules, if any exist) by taking the first ones in lexicographically increasing order.
Use Hadoop MapReduce to complete this part of the assignment; use a single mapper and single reducer; use a combiner if you think it will help and justify.



In [55]:

    
'''
    HW3.2. (Computationally prohibitive but then again Hadoop can handle this)

    Note: for this part the writeup will require a specific rule 
          ordering but the program need not sort the output.

    List the top 5 rules with corresponding confidence scores 
    in decreasing order of confidence score 
    for frequent (100>count) itemsets of size 2. 
    A rule is of the form: 
        (item1) ⇒ item2.

    Fix the ordering of the rule lexicographically (left to right), 
    and break ties in confidence (between rules, if any exist) 
    by taking the first ones in lexicographically increasing order. 
    
    Use Hadoop MapReduce to complete this part of the assignment; 
    use a single mapper and single reducer; use a combiner if you think it will help and justify. 

'''

# make directory for problem and change to that dir
!mkdir ~/Documents/W261/hw3/hw3_2/
%cd ~/Documents/W261/hw3/hw3_2/
!cp ../ProductPurchaseData.txt ./









    



mkdir: /Users/davidadams/Documents/W261/hw3/hw3_2/: File exists
/Users/davidadams/Documents/W261/hw3/hw3_2



In [129]:

    
%%writefile ./mapper2.py
#!/usr/bin/python
import sys
import re
from collections import defaultdict

# input as stdin
for line in sys.stdin:

    # remove leading and trailing whitespace
    line = line.strip()
    # split the line into words
    items = line.split(' ')
    
    # loop over items in basket
    for item in items:
        # output for counting each item
        print str((item,'*'))+'\t'+str(1)
        # second loop over items in basket
        for item2 in items:
            if item2>item:
                # output pairs (in lexographical order only)
                print str((item,item2))+'\t'+str(1)









    



Overwriting ./mapper2.py



In [130]:

    
%%writefile reducer2.py
#!/usr/bin/python
from operator import itemgetter
import sys
from ast import literal_eval
from collections import defaultdict
from operator import itemgetter

# initialize dictionaries
items = dict()
counts = defaultdict(int)

# input comes from STDIN
for line in sys.stdin:
    
    # remove leading/trailing white space
    line = line.strip()
    # mapper output is (item1, item2) as key and 1 as value
    pair, count = line.split('\t')
    pair = literal_eval(pair)
    
    if pair[0] not in items.keys():
        items[pair[0]]=defaultdict(int)
    
    if pair[1]=='*':
        # increment count for item
        counts[pair[0]]+=1
    else:
        # increment count for pair
        items[pair[0]][pair[1]]+=1

# initialize dictionaries for frequent items
freqitems = dict()
freqitemcounts = dict()

# loop over items and keep only those that are above min support
for item,count in counts.iteritems():
    if count>100:
        freqitems[item]=items[item]
        freqitemcounts[item]=counts[item]

# initialize dictionary for confidence values for frequnt item pairs
confidence = dict()

# loop over frequent items
for item,count in freqitemcounts.iteritems():
    # loop over items that are in same baskets are frequent items
    for item2, paircount in freqitems[item].iteritems():
        # calculate confidence
        confidence[(item,item2)]=1.0*paircount/count

# sort confidence dictionary by confidence values
confsorted = sorted(confidence.items(), key=itemgetter(1))
confsorted.reverse() # descending

# output the top five rules
topcount=0
for c in confsorted:
    topcount+=1
    if topcount>5:
        break
    print c









    



Overwriting reducer2.py



In [131]:

    
# Run mapper and reducer cells first to write mapper.py and reducer.py

def hw3_2():
    
    def problemsetup():
        # mkdir on hdfs
        !hdfs dfs -mkdir -p /user/davidadams

        # change to problem dir and copy data into dir
        %cd ~/Documents/W261/hw3/hw3_2/
        !cp ~/Documents/W261/hw3/ProductPurchaseData.txt ./
        
        # put data file into HDFS
        !hdfs dfs -rm ProductPurchaseData.txt
        !hdfs dfs -put ProductPurchaseData.txt /user/davidadams
        
        return None
    
    def runhadoop():
        # run Hadoop streaming mapreduce to count items in input dataset
        !hdfs dfs -rm -r hw3_2Output
        !hadoop jar ~/Documents/W261/hw2/hadoop-*streaming*.jar -mapper mapper2.py -reducer reducer2.py -input ProductPurchaseData.txt -output hw3_2Output
    
        # show item counts output file
        print '\n================================================='
        print 'Top 5 Rules with confidence values:\n'
        !hdfs dfs -cat hw3_2Output/part-00000
        
        return None
    
    
    
    problemsetup()
    
    runhadoop()
    
    return None
    

hw3_2()









    



15/09/22 22:56:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
/Users/davidadams/Documents/W261/hw3/hw3_2
15/09/22 22:56:15 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/09/22 22:56:16 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted ProductPurchaseData.txt
15/09/22 22:56:17 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/09/22 22:56:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/09/22 22:56:20 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted hw3_2Output
15/09/22 22:56:22 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/09/22 22:56:23 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
15/09/22 22:56:23 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
15/09/22 22:56:23 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
15/09/22 22:56:24 INFO mapred.FileInputFormat: Total input paths to process : 1
15/09/22 22:56:24 INFO mapreduce.JobSubmitter: number of splits:1
15/09/22 22:56:24 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1244075997_0001
15/09/22 22:56:24 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
15/09/22 22:56:24 INFO mapred.LocalJobRunner: OutputCommitter set in config null
15/09/22 22:56:24 INFO mapreduce.Job: Running job: job_local1244075997_0001
15/09/22 22:56:24 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter
15/09/22 22:56:24 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
15/09/22 22:56:24 INFO mapred.LocalJobRunner: Waiting for map tasks
15/09/22 22:56:24 INFO mapred.LocalJobRunner: Starting task: attempt_local1244075997_0001_m_000000_0
15/09/22 22:56:24 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
15/09/22 22:56:24 INFO util.ProcfsBasedProcessTree: ProcfsBasedProcessTree currently is supported only on Linux.
15/09/22 22:56:24 INFO mapred.Task:  Using ResourceCalculatorProcessTree : null
15/09/22 22:56:24 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/user/davidadams/ProductPurchaseData.txt:0+3458517
15/09/22 22:56:24 INFO mapred.MapTask: numReduceTasks: 1
15/09/22 22:56:24 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
15/09/22 22:56:24 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
15/09/22 22:56:24 INFO mapred.MapTask: soft limit at 83886080
15/09/22 22:56:24 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
15/09/22 22:56:24 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
15/09/22 22:56:24 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
15/09/22 22:56:24 INFO streaming.PipeMapRed: PipeMapRed exec [/Users/davidadams/Documents/W261/hw3/hw3_2/./mapper2.py]
15/09/22 22:56:24 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
15/09/22 22:56:24 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name
15/09/22 22:56:24 INFO Configuration.deprecation: map.input.start is deprecated. Instead, use mapreduce.map.input.start
15/09/22 22:56:24 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
15/09/22 22:56:24 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
15/09/22 22:56:24 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
15/09/22 22:56:24 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
15/09/22 22:56:24 INFO Configuration.deprecation: map.input.length is deprecated. Instead, use mapreduce.map.input.length
15/09/22 22:56:24 INFO Configuration.deprecation: mapred.local.dir is deprecated. Instead, use mapreduce.cluster.local.dir
15/09/22 22:56:24 INFO Configuration.deprecation: mapred.work.output.dir is deprecated. Instead, use mapreduce.task.output.dir
15/09/22 22:56:24 INFO Configuration.deprecation: map.input.file is deprecated. Instead, use mapreduce.map.input.file
15/09/22 22:56:24 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
15/09/22 22:56:25 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
15/09/22 22:56:25 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
15/09/22 22:56:25 INFO streaming.PipeMapRed: R/W/S=100/0/0 in:NA [rec/s] out:NA [rec/s]
15/09/22 22:56:25 INFO streaming.PipeMapRed: R/W/S=1000/0/0 in:NA [rec/s] out:NA [rec/s]
15/09/22 22:56:25 INFO streaming.PipeMapRed: Records R/W=1216/1
15/09/22 22:56:25 INFO mapreduce.Job: Job job_local1244075997_0001 running in uber mode : false
15/09/22 22:56:25 INFO mapreduce.Job:  map 0% reduce 0%
15/09/22 22:56:27 INFO streaming.PipeMapRed: R/W/S=10000/975243/0 in:5000=10000/2 [rec/s] out:487627=975255/2 [rec/s]
15/09/22 22:56:30 INFO mapred.MapTask: Spilling map output
15/09/22 22:56:30 INFO mapred.MapTask: bufstart = 0; bufend = 52006640; bufvoid = 104857600
15/09/22 22:56:30 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 18244536(72978144); length = 7969861/6553600
15/09/22 22:56:30 INFO mapred.MapTask: (EQUATOR) 59995776 kvi 14998940(59995760)
15/09/22 22:56:30 INFO mapred.LocalJobRunner: Records R/W=1216/1 > map
15/09/22 22:56:31 INFO mapreduce.Job:  map 48% reduce 0%
15/09/22 22:56:33 INFO mapred.LocalJobRunner: Records R/W=1216/1 > map
15/09/22 22:56:34 INFO mapreduce.Job:  map 58% reduce 0%
15/09/22 22:56:35 INFO mapred.MapTask: Finished spill 0
15/09/22 22:56:35 INFO mapred.MapTask: (RESET) equator 59995776 kv 14998940(59995760) kvi 13008716(52034864)
15/09/22 22:56:35 INFO streaming.PipeMapRed: Records R/W=26917/2490023
15/09/22 22:56:36 INFO streaming.PipeMapRed: MRErrorThread done
15/09/22 22:56:36 INFO streaming.PipeMapRed: mapRedFinished
15/09/22 22:56:36 INFO mapred.LocalJobRunner: Records R/W=1216/1 > map
15/09/22 22:56:36 INFO mapred.MapTask: Starting flush of map output
15/09/22 22:56:36 INFO mapred.MapTask: Spilling map output
15/09/22 22:56:36 INFO mapred.MapTask: bufstart = 59995776; bufend = 84025074; bufvoid = 104857600
15/09/22 22:56:36 INFO mapred.MapTask: kvstart = 14998940(59995760); kvend = 11309296(45237184); length = 3689645/6553600
15/09/22 22:56:36 INFO mapred.LocalJobRunner: Records R/W=26917/2490023 > sort
15/09/22 22:56:37 INFO mapreduce.Job:  map 67% reduce 0%
15/09/22 22:56:38 INFO mapred.MapTask: Finished spill 1
15/09/22 22:56:38 INFO mapred.Merger: Merging 2 sorted segments
15/09/22 22:56:38 INFO mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 81865666 bytes
15/09/22 22:56:39 INFO mapred.LocalJobRunner: Records R/W=26917/2490023 > sort > 
15/09/22 22:56:40 INFO mapreduce.Job:  map 78% reduce 0%
15/09/22 22:56:42 INFO mapred.LocalJobRunner: Records R/W=26917/2490023 > sort > 
15/09/22 22:56:43 INFO mapred.Task: Task:attempt_local1244075997_0001_m_000000_0 is done. And is in the process of committing
15/09/22 22:56:43 INFO mapred.LocalJobRunner: Records R/W=26917/2490023 > sort
15/09/22 22:56:43 INFO mapred.Task: Task 'attempt_local1244075997_0001_m_000000_0' done.
15/09/22 22:56:43 INFO mapred.LocalJobRunner: Finishing task: attempt_local1244075997_0001_m_000000_0
15/09/22 22:56:43 INFO mapred.LocalJobRunner: map task executor complete.
15/09/22 22:56:43 INFO mapred.LocalJobRunner: Waiting for reduce tasks
15/09/22 22:56:43 INFO mapred.LocalJobRunner: Starting task: attempt_local1244075997_0001_r_000000_0
15/09/22 22:56:43 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
15/09/22 22:56:43 INFO util.ProcfsBasedProcessTree: ProcfsBasedProcessTree currently is supported only on Linux.
15/09/22 22:56:43 INFO mapred.Task:  Using ResourceCalculatorProcessTree : null
15/09/22 22:56:43 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@598da1c3
15/09/22 22:56:43 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=333971456, maxSingleShuffleLimit=83492864, mergeThreshold=220421168, ioSortFactor=10, memToMemMergeOutputsThreshold=10
15/09/22 22:56:43 INFO reduce.EventFetcher: attempt_local1244075997_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
15/09/22 22:56:43 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1244075997_0001_m_000000_0 decomp: 81865696 len: 81865700 to MEMORY
15/09/22 22:56:43 INFO reduce.InMemoryMapOutput: Read 81865696 bytes from map-output for attempt_local1244075997_0001_m_000000_0
15/09/22 22:56:43 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 81865696, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->81865696
15/09/22 22:56:43 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
15/09/22 22:56:43 INFO mapred.LocalJobRunner: 1 / 1 copied.
15/09/22 22:56:43 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
15/09/22 22:56:43 INFO mapred.Merger: Merging 1 sorted segments
15/09/22 22:56:43 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 81865676 bytes
15/09/22 22:56:43 INFO mapreduce.Job:  map 100% reduce 0%
15/09/22 22:56:45 INFO reduce.MergeManagerImpl: Merged 1 segments, 81865696 bytes to disk to satisfy reduce memory limit
15/09/22 22:56:45 INFO reduce.MergeManagerImpl: Merging 1 files, 81865700 bytes from disk
15/09/22 22:56:45 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
15/09/22 22:56:45 INFO mapred.Merger: Merging 1 sorted segments
15/09/22 22:56:45 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 81865676 bytes
15/09/22 22:56:45 INFO mapred.LocalJobRunner: 1 / 1 copied.
15/09/22 22:56:45 INFO streaming.PipeMapRed: PipeMapRed exec [/Users/davidadams/Documents/W261/hw3/hw3_2/./reducer2.py]
15/09/22 22:56:45 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
15/09/22 22:56:45 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
15/09/22 22:56:46 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
15/09/22 22:56:46 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
15/09/22 22:56:46 INFO streaming.PipeMapRed: R/W/S=100/0/0 in:NA [rec/s] out:NA [rec/s]
15/09/22 22:56:46 INFO streaming.PipeMapRed: R/W/S=1000/0/0 in:NA [rec/s] out:NA [rec/s]
15/09/22 22:56:46 INFO streaming.PipeMapRed: R/W/S=10000/0/0 in:NA [rec/s] out:NA [rec/s]
15/09/22 22:56:48 INFO streaming.PipeMapRed: R/W/S=100000/0/0 in:50000=100000/2 [rec/s] out:0=0/2 [rec/s]
15/09/22 22:56:49 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:56:49 INFO mapreduce.Job:  map 100% reduce 68%
15/09/22 22:56:51 INFO streaming.PipeMapRed: R/W/S=200000/0/0 in:40000=200000/5 [rec/s] out:0=0/5 [rec/s]
15/09/22 22:56:52 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:56:52 INFO mapreduce.Job:  map 100% reduce 69%
15/09/22 22:56:55 INFO streaming.PipeMapRed: R/W/S=300000/0/0 in:33333=300000/9 [rec/s] out:0=0/9 [rec/s]
15/09/22 22:56:55 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:56:55 INFO mapreduce.Job:  map 100% reduce 70%
15/09/22 22:56:58 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:56:58 INFO mapreduce.Job:  map 100% reduce 71%
15/09/22 22:56:59 INFO streaming.PipeMapRed: R/W/S=400000/0/0 in:30769=400000/13 [rec/s] out:0=0/13 [rec/s]
15/09/22 22:57:01 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:57:01 INFO mapreduce.Job:  map 100% reduce 72%
15/09/22 22:57:03 INFO streaming.PipeMapRed: R/W/S=500000/0/0 in:29411=500000/17 [rec/s] out:0=0/17 [rec/s]
15/09/22 22:57:04 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:57:04 INFO mapreduce.Job:  map 100% reduce 73%
15/09/22 22:57:07 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:57:09 INFO streaming.PipeMapRed: R/W/S=600000/0/0 in:26086=600000/23 [rec/s] out:0=0/23 [rec/s]
15/09/22 22:57:10 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:57:10 INFO mapreduce.Job:  map 100% reduce 74%
15/09/22 22:57:13 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:57:15 INFO streaming.PipeMapRed: R/W/S=700000/0/0 in:24137=700000/29 [rec/s] out:0=0/29 [rec/s]
15/09/22 22:57:16 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:57:16 INFO mapreduce.Job:  map 100% reduce 75%
15/09/22 22:57:19 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:57:22 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:57:22 INFO mapreduce.Job:  map 100% reduce 76%
15/09/22 22:57:22 INFO streaming.PipeMapRed: R/W/S=800000/0/0 in:21621=800000/37 [rec/s] out:0=0/37 [rec/s]
15/09/22 22:57:25 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:57:28 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:57:28 INFO mapreduce.Job:  map 100% reduce 77%
15/09/22 22:57:30 INFO streaming.PipeMapRed: R/W/S=900000/0/0 in:20454=900000/44 [rec/s] out:0=0/44 [rec/s]
15/09/22 22:57:31 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:57:34 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:57:34 INFO mapreduce.Job:  map 100% reduce 78%
15/09/22 22:57:37 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:57:39 INFO streaming.PipeMapRed: R/W/S=1000000/0/0 in:18867=1000000/53 [rec/s] out:0=0/53 [rec/s]
15/09/22 22:57:40 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:57:43 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:57:43 INFO mapreduce.Job:  map 100% reduce 79%
15/09/22 22:57:46 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:57:49 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:57:49 INFO streaming.PipeMapRed: R/W/S=1100000/0/0 in:17460=1100000/63 [rec/s] out:0=0/63 [rec/s]
15/09/22 22:57:52 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:57:52 INFO mapreduce.Job:  map 100% reduce 80%
15/09/22 22:57:55 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:57:58 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:57:58 INFO mapreduce.Job:  map 100% reduce 81%
15/09/22 22:57:59 INFO streaming.PipeMapRed: R/W/S=1200000/0/0 in:16438=1200000/73 [rec/s] out:0=0/73 [rec/s]
15/09/22 22:58:01 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:58:04 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:58:07 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:58:07 INFO mapreduce.Job:  map 100% reduce 82%
15/09/22 22:58:09 INFO streaming.PipeMapRed: R/W/S=1300000/0/0 in:15662=1300000/83 [rec/s] out:0=0/83 [rec/s]
15/09/22 22:58:10 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:58:13 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:58:16 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:58:19 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:58:19 INFO mapreduce.Job:  map 100% reduce 83%
15/09/22 22:58:21 INFO streaming.PipeMapRed: R/W/S=1400000/0/0 in:14736=1400000/95 [rec/s] out:0=0/95 [rec/s]
15/09/22 22:58:22 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:58:25 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:58:28 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:58:28 INFO mapreduce.Job:  map 100% reduce 84%
15/09/22 22:58:31 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:58:34 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:58:34 INFO streaming.PipeMapRed: R/W/S=1500000/0/0 in:13888=1500000/108 [rec/s] out:0=0/108 [rec/s]
15/09/22 22:58:37 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:58:40 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:58:43 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:58:43 INFO mapreduce.Job:  map 100% reduce 85%
15/09/22 22:58:46 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:58:49 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:58:50 INFO streaming.PipeMapRed: R/W/S=1600000/0/0 in:12800=1600000/125 [rec/s] out:0=0/125 [rec/s]
15/09/22 22:58:52 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:58:55 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:58:58 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:58:59 INFO mapreduce.Job:  map 100% reduce 86%
15/09/22 22:59:01 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:59:04 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:59:07 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:59:07 INFO streaming.PipeMapRed: R/W/S=1700000/0/0 in:12056=1700000/141 [rec/s] out:0=0/141 [rec/s]
15/09/22 22:59:10 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:59:11 INFO mapreduce.Job:  map 100% reduce 87%
15/09/22 22:59:13 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:59:16 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:59:19 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:59:22 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:59:24 INFO streaming.PipeMapRed: R/W/S=1800000/0/0 in:11392=1800000/158 [rec/s] out:0=0/158 [rec/s]
15/09/22 22:59:25 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:59:26 INFO mapreduce.Job:  map 100% reduce 88%
15/09/22 22:59:28 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:59:31 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:59:34 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:59:37 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:59:40 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:59:43 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:59:44 INFO streaming.PipeMapRed: R/W/S=1900000/0/0 in:10674=1900000/178 [rec/s] out:0=0/178 [rec/s]
15/09/22 22:59:44 INFO mapreduce.Job:  map 100% reduce 89%
15/09/22 22:59:46 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:59:49 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:59:52 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:59:55 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 22:59:58 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:00:01 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:00:02 INFO mapreduce.Job:  map 100% reduce 90%
15/09/22 23:00:04 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:00:06 INFO streaming.PipeMapRed: R/W/S=2000000/0/0 in:10000=2000000/200 [rec/s] out:0=0/200 [rec/s]
15/09/22 23:00:07 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:00:10 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:00:13 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:00:16 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:00:19 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:00:22 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:00:25 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:00:26 INFO mapreduce.Job:  map 100% reduce 91%
15/09/22 23:00:28 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:00:31 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:00:31 INFO streaming.PipeMapRed: R/W/S=2100000/0/0 in:9333=2100000/225 [rec/s] out:0=0/225 [rec/s]
15/09/22 23:00:34 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:00:37 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:00:40 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:00:43 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:00:46 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:00:49 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:00:50 INFO mapreduce.Job:  map 100% reduce 92%
15/09/22 23:00:52 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:00:55 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:00:58 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:01:01 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:01:03 INFO streaming.PipeMapRed: R/W/S=2200000/0/0 in:8560=2200000/257 [rec/s] out:0=0/257 [rec/s]
15/09/22 23:01:04 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:01:07 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:01:10 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:01:13 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:01:16 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:01:17 INFO mapreduce.Job:  map 100% reduce 93%
15/09/22 23:01:19 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:01:22 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:01:25 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:01:28 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:01:31 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:01:34 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:01:35 INFO streaming.PipeMapRed: R/W/S=2300000/0/0 in:7958=2300000/289 [rec/s] out:0=0/289 [rec/s]
15/09/22 23:01:37 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:01:40 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:01:43 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:01:46 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:01:47 INFO mapreduce.Job:  map 100% reduce 94%
15/09/22 23:01:49 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:01:52 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:01:55 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:01:58 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:02:01 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:02:04 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:02:07 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:02:07 INFO streaming.PipeMapRed: R/W/S=2400000/0/0 in:7476=2400000/321 [rec/s] out:0=0/321 [rec/s]
15/09/22 23:02:10 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:02:13 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:02:13 INFO mapreduce.Job:  map 100% reduce 95%
15/09/22 23:02:16 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:02:19 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:02:22 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:02:25 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:02:28 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:02:31 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:02:34 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:02:37 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:02:39 INFO streaming.PipeMapRed: R/W/S=2500000/0/0 in:7082=2500000/353 [rec/s] out:0=0/353 [rec/s]
15/09/22 23:02:40 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:02:43 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:02:43 INFO mapreduce.Job:  map 100% reduce 96%
15/09/22 23:02:46 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:02:49 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:02:52 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:02:55 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:02:58 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:03:01 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:03:04 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:03:07 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:03:10 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:03:13 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:03:16 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:03:16 INFO mapreduce.Job:  map 100% reduce 97%
15/09/22 23:03:19 INFO streaming.PipeMapRed: R/W/S=2600000/0/0 in:6615=2600000/393 [rec/s] out:0=0/393 [rec/s]
15/09/22 23:03:19 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:03:22 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:03:25 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:03:28 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:03:31 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:03:34 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:03:37 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:03:40 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:03:43 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:03:46 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:03:49 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:03:52 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:03:52 INFO mapreduce.Job:  map 100% reduce 98%
15/09/22 23:03:55 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:03:58 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:03:59 INFO streaming.PipeMapRed: R/W/S=2700000/0/0 in:6235=2700000/433 [rec/s] out:0=0/433 [rec/s]
15/09/22 23:04:01 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:04:04 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:04:07 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:04:10 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:04:13 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:04:16 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:04:19 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:04:22 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:04:25 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:04:28 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:04:31 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:04:34 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:04:35 INFO mapreduce.Job:  map 100% reduce 99%
15/09/22 23:04:37 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:04:40 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:04:43 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:04:45 INFO streaming.PipeMapRed: R/W/S=2800000/0/0 in:5845=2800000/479 [rec/s] out:0=0/479 [rec/s]
15/09/22 23:04:46 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:04:49 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:04:52 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:04:55 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:04:58 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:05:01 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:05:04 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:05:07 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:05:10 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:05:13 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:05:16 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:05:17 INFO mapreduce.Job:  map 100% reduce 100%
15/09/22 23:05:19 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:05:22 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:05:25 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:05:28 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:05:31 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:05:32 INFO streaming.PipeMapRed: R/W/S=2900000/0/0 in:5513=2900000/526 [rec/s] out:0=0/526 [rec/s]
15/09/22 23:05:34 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:05:37 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:05:40 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:05:43 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:05:46 INFO streaming.PipeMapRed: Records R/W=2914878/1
15/09/22 23:05:46 INFO streaming.PipeMapRed: MRErrorThread done
15/09/22 23:05:46 INFO streaming.PipeMapRed: mapRedFinished
15/09/22 23:05:46 INFO mapred.Task: Task:attempt_local1244075997_0001_r_000000_0 is done. And is in the process of committing
15/09/22 23:05:46 INFO mapred.LocalJobRunner: reduce > reduce
15/09/22 23:05:46 INFO mapred.Task: Task attempt_local1244075997_0001_r_000000_0 is allowed to commit now
15/09/22 23:05:46 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1244075997_0001_r_000000_0' to hdfs://localhost:9000/user/davidadams/hw3_2Output/_temporary/0/task_local1244075997_0001_r_000000
15/09/22 23:05:46 INFO mapred.LocalJobRunner: Records R/W=2914878/1 > reduce
15/09/22 23:05:46 INFO mapred.Task: Task 'attempt_local1244075997_0001_r_000000_0' done.
15/09/22 23:05:46 INFO mapred.LocalJobRunner: Finishing task: attempt_local1244075997_0001_r_000000_0
15/09/22 23:05:46 INFO mapred.LocalJobRunner: reduce task executor complete.
15/09/22 23:05:47 INFO mapreduce.Job: Job job_local1244075997_0001 completed successfully
15/09/22 23:05:47 INFO mapreduce.Job: Counters: 35
	File System Counters
		FILE: Number of bytes read=327674664
		FILE: Number of bytes written=410133188
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=6917034
		HDFS: Number of bytes written=224
		HDFS: Number of read operations=13
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=4
	Map-Reduce Framework
		Map input records=31101
		Map output records=2914878
		Map output bytes=76035938
		Map output materialized bytes=81865700
		Input split bytes=113
		Combine input records=0
		Combine output records=0
		Reduce input groups=889687
		Reduce shuffle bytes=81865700
		Reduce input records=2914878
		Reduce output records=5
		Spilled Records=8744634
		Shuffled Maps =1
		Failed Shuffles=0
		Merged Map outputs=1
		GC time elapsed (ms)=164
		Total committed heap usage (bytes)=471859200
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=3458517
	File Output Format Counters 
		Bytes Written=224
15/09/22 23:05:47 INFO streaming.StreamJob: Output directory: hw3_2Output

=================================================
Top 5 Rules with confidence values:

15/09/22 23:05:48 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
(('DAI93865', 'FRO40251'), 1.0)	
(('ELE12951', 'FRO40251'), 0.9905660377358491)	
(('DAI88079', 'FRO40251'), 0.9867256637168141)	
(('DAI43868', 'SNA82528'), 0.972972972972973)	
(('DAI23334', 'DAI62779'), 0.9545454545454546)

HW3.4 Apriori Algorithm Conceptual Exercise

Suppose that you wished to perform the Apriori algorithm once again, though this time now with the goal of listing the top 5 rules with corresponding confidence scores in decreasing order of confidence score for itemsets of size 3 using Hadoop MapReduce.
A rule is now of the form:
(item1, item2) ⇒ item3

Recall that the Apriori algorithm is iterative for increasing itemset size, working off of the frequent itemsets of the previous size to explore ONLY the NECESSARY subset of a large combinatorial space.
Describe how you might design a framework to perform this exercise.

In particular, focus on the following:
— map-reduce steps required
— enumeration of item sets and filtering for frequent candidates

Finding itemsets with 3 items would require multiple map and reduce steps. The first mapper would create pairs with single items and counts of those pairs. The reducer would then eliminate items that do not meet the minimum support. The second mapper would use the pairs with frequent items and output “pairs” with ((item1, item2), item3) as the key. The reducer would be the same as the first step, totaling counts of (item1, item2) and eliminating those that do not meet minimum support, then calculating confidence scores for (item1, item2) ⇒ item3 rules.



In [ ]: