HW3.0:
What is a merge sort? Where is it used in Hadoop?
How is a combiner function in the context of Hadoop? Give an example where it can be used and justify why it should be used in the context of this problem.
What is the Hadoop shuffle?
What is the Apriori algorithm? Describe an example use in your domain of expertise. Define confidence and lift.
Merge sort: Merge sort is used to sort two or more sorted lists in one sorted list. In Hadoop, merge sort is used to combine sorted keys from mappers to the reducer(s).
Combiner: A combiner function used in Hadoop is a way of combining mapper outputs in-memory before being used in the reducer. A combiner may be used at either the mapper or the reducer and may be used zero or more times; therefore, it must be associative and commutative. Use of a combiner can reduce the communication required between mappers and reducers. A combiner can be used when mappers are generating many key,value pairs with the same keys. For example, in the word count problem, a combiner can be used to sum word counts before sending (word, count) pairs to the reducer. Therefore, the number of pairs that need to be sorted and sent to reducers could be greatly reduced.
Hadoop Shuffle: The Hadoop shuffle is the process of transmitting key,value pairs from mappers to reducers. It includes partitioning items, sorting by keys, and combining then sending to reducer input streams.
Apriori Algorithm: The apriori algorithm is used to find frequent itemsets based on the observation that subsets of a frequent itemset must also be frequent. Therefore, to find frequent itemsets of length k, you can first find itemsets of length k-1, then prune to retain only those that are frequent, then make length k itemsets and prune to retain frequent itemsets. Product recommendation is a classic example of use of the Apriori algorithm. A product recommendation example in environmental engineering could be equipment suppliers, identifying components that are frequently purchased together and recommending those items when a customer is looking to purchase one of them.
Confidence the proportion of itemsets that contain an additional item of itemsets without that item. For example, for the rule {milk, diaper} -> {beer}, the confidence would be equal to the number of itemsets that contain milk, diapers and beer over the number of itemsets that contain milk and diapers. Confidence also reflects the certainty of the discovered pattern.
Lift measures the dependence between an item and an added item. It is calculated as the confidence divided by the probability of the added item in the data set.
HW3.1.: Online browsing behavior dataset
Use the online browsing behavior dataset at:
https://www.dropbox.com/s/zlfyiwa70poqg74/ProductPurchaseData.txt?dl=0
Each line in this dataset represents a browsing session of a customer.
On each line, each string of 8 characters represents the id of an item browsed during that session. The items are separated by spaces.
Do some exploratory data analysis of this dataset. Report your findings such as number of unique products; largest basket, etc. using Hadoop Map-Reduce.
In [4]:
'''
HW3.1.: Exploratory data analysis
Do some exploratory data analysis of this dataset.
Report your findings such as number of unique products;
largest basket, etc. using Hadoop Map-Reduce.
'''
# make directory for problem and change to that dir
!mkdir ~/Documents/W261/hw3/hw3_1/
%cd ~/Documents/W261/hw3/hw3_1/
In [5]:
'''
Find number of unique products and
distribution of product frequencies
'''
!mkdir ~/Documents/W261/hw3/hw3_1/hw3_1_1/
%cd ~/Documents/W261/hw3/hw3_1/hw3_1_1/
In [25]:
%%writefile mapper_countitems.py
#!/usr/bin/python
import sys
# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
items = line.split(' ')
# write the results to STDOUT (standard output);
# tab-delimited
for i in items:
print '%s\t%d' % (i, 1)
In [26]:
%%writefile reducer_countitems.py
#!/usr/bin/python
#from operator import itemgetter
import sys
from collections import defaultdict
counts = defaultdict(int)
# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from mapper.py
item, c = line.split('\t')
# convert count (currently a string) to int
try:
c = int(c)
except ValueError:
# count was not a number, so silently
# ignore/discard this line
continue
counts[item]+=c
for item,count in counts.iteritems():
print '%s\t%d' % (item, count)
In [27]:
%%writefile mapper_sortitemcounts.py
#!/usr/bin/python
import sys
# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
item, count = line.split('\t')
count = int(count)
# write the results to STDOUT (standard output);
# tab-delimited
print '%d\t%s' % (count, item)
In [28]:
%%writefile reducer_sortitemcounts.py
#!/usr/bin/python
#from operator import itemgetter
import sys
from collections import defaultdict
counts = defaultdict(int)
# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from mapper.py
count, item = line.split('\t')
count = int(count)
print '%s\t%d' % (item, count)
In [32]:
# Run mapper and reducer cells first to write mapper.py and reducer.py
def hw3_1():
def problemsetup():
# mkdir on hdfs
!hdfs dfs -mkdir -p /user/davidadams #this dir seems to have to match the computer's user and my husband (davidadams) set up this computer years ago
# change to problem dir and copy data into dir
%cd ~/Documents/W261/hw3/hw3_1/hw3_1_1
!cp ~/Documents/W261/hw3/ProductPurchaseData.txt ./
# put data file into HDFS
!hdfs dfs -rm ProductPurchaseData.txt
!hdfs dfs -put ProductPurchaseData.txt /user/davidadams
return None
def runhadoop():
# run Hadoop streaming mapreduce to count items in input dataset
!hdfs dfs -rm -r hw3_1_1Output
!hadoop jar ~/Documents/W261/hw2/hadoop-*streaming*.jar -mapper mapper_countitems.py -reducer reducer_countitems.py -input ProductPurchaseData.txt -output hw3_1_countitems_Output
# show item counts output file
print '\n================================================='
!hdfs dfs -cat hw3_1_countitems_Output/part-00000
# run Hadoop streaming mapreduce to sort item counts
!hdfs dfs -rm -r hw3_1_1_2Output
!hadoop jar ~/Documents/W261/hw2/hadoop-*streaming*.jar -D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator -D mapred.text.key.comparator.options=-n -mapper mapper_sortitemcounts.py -reducer reducer_sortitemcounts.py -input hw3_1_countitems_Output/part-00000 -output hw3_1_sorteditemcounts_Output
# show output file
print '\n================================================='
!hdfs dfs -cat hw3_1_sorteditemcounts_Output/part-00000
return None
def countproducts(productlistfile):
'''
Input: sorted item list file from reducer output
Returns: number of unique products
'''
# initialize count of examples and correct predictions
numproducts = 0
# read lines from reducer output file
with open (productlistfile, "r") as outfile:
for line in outfile.readlines():
numproducts+=1
return numproducts
problemsetup()
runhadoop()
# calculate number of unique products
!hdfs dfs -cat hw3_1_sorteditemcounts_Output/part-00000 > 'productlist.txt'
productlistfile = 'productlist.txt'
numproducts = countproducts(productlistfile)
print '\n================================================='
print '\nNumber of unique products: ', numproducts
hw3_1()
Exploratory data analysis
HW3.2: Apriori Algorithm
Note: for this part the writeup will require a specific rule ordering but the program need not sort the output.
List the top 5 rules with corresponding confidence scores in decreasing order of confidence score for frequent (100< count) itemsets of size 2.
A rule is of the form:
(item1) ⇒ item2.
Fix the ordering of the rule lexicographically (left to right), and break ties in confidence (between rules, if any exist) by taking the first ones in lexicographically increasing order.
Use Hadoop MapReduce to complete this part of the assignment;
use a single mapper and single reducer; use a combiner if you think it will help and justify.
In [55]:
'''
HW3.2. (Computationally prohibitive but then again Hadoop can handle this)
Note: for this part the writeup will require a specific rule
ordering but the program need not sort the output.
List the top 5 rules with corresponding confidence scores
in decreasing order of confidence score
for frequent (100>count) itemsets of size 2.
A rule is of the form:
(item1) ⇒ item2.
Fix the ordering of the rule lexicographically (left to right),
and break ties in confidence (between rules, if any exist)
by taking the first ones in lexicographically increasing order.
Use Hadoop MapReduce to complete this part of the assignment;
use a single mapper and single reducer; use a combiner if you think it will help and justify.
'''
# make directory for problem and change to that dir
!mkdir ~/Documents/W261/hw3/hw3_2/
%cd ~/Documents/W261/hw3/hw3_2/
!cp ../ProductPurchaseData.txt ./
In [129]:
%%writefile ./mapper2.py
#!/usr/bin/python
import sys
import re
from collections import defaultdict
# input as stdin
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
items = line.split(' ')
# loop over items in basket
for item in items:
# output for counting each item
print str((item,'*'))+'\t'+str(1)
# second loop over items in basket
for item2 in items:
if item2>item:
# output pairs (in lexographical order only)
print str((item,item2))+'\t'+str(1)
In [130]:
%%writefile reducer2.py
#!/usr/bin/python
from operator import itemgetter
import sys
from ast import literal_eval
from collections import defaultdict
from operator import itemgetter
# initialize dictionaries
items = dict()
counts = defaultdict(int)
# input comes from STDIN
for line in sys.stdin:
# remove leading/trailing white space
line = line.strip()
# mapper output is (item1, item2) as key and 1 as value
pair, count = line.split('\t')
pair = literal_eval(pair)
if pair[0] not in items.keys():
items[pair[0]]=defaultdict(int)
if pair[1]=='*':
# increment count for item
counts[pair[0]]+=1
else:
# increment count for pair
items[pair[0]][pair[1]]+=1
# initialize dictionaries for frequent items
freqitems = dict()
freqitemcounts = dict()
# loop over items and keep only those that are above min support
for item,count in counts.iteritems():
if count>100:
freqitems[item]=items[item]
freqitemcounts[item]=counts[item]
# initialize dictionary for confidence values for frequnt item pairs
confidence = dict()
# loop over frequent items
for item,count in freqitemcounts.iteritems():
# loop over items that are in same baskets are frequent items
for item2, paircount in freqitems[item].iteritems():
# calculate confidence
confidence[(item,item2)]=1.0*paircount/count
# sort confidence dictionary by confidence values
confsorted = sorted(confidence.items(), key=itemgetter(1))
confsorted.reverse() # descending
# output the top five rules
topcount=0
for c in confsorted:
topcount+=1
if topcount>5:
break
print c
In [131]:
# Run mapper and reducer cells first to write mapper.py and reducer.py
def hw3_2():
def problemsetup():
# mkdir on hdfs
!hdfs dfs -mkdir -p /user/davidadams
# change to problem dir and copy data into dir
%cd ~/Documents/W261/hw3/hw3_2/
!cp ~/Documents/W261/hw3/ProductPurchaseData.txt ./
# put data file into HDFS
!hdfs dfs -rm ProductPurchaseData.txt
!hdfs dfs -put ProductPurchaseData.txt /user/davidadams
return None
def runhadoop():
# run Hadoop streaming mapreduce to count items in input dataset
!hdfs dfs -rm -r hw3_2Output
!hadoop jar ~/Documents/W261/hw2/hadoop-*streaming*.jar -mapper mapper2.py -reducer reducer2.py -input ProductPurchaseData.txt -output hw3_2Output
# show item counts output file
print '\n================================================='
print 'Top 5 Rules with confidence values:\n'
!hdfs dfs -cat hw3_2Output/part-00000
return None
problemsetup()
runhadoop()
return None
hw3_2()
HW3.4 Apriori Algorithm Conceptual Exercise
Suppose that you wished to perform the Apriori algorithm once again, though this time now with the goal of listing the top 5 rules with corresponding confidence scores in decreasing order of confidence score for itemsets of size 3 using Hadoop MapReduce.
A rule is now of the form:
(item1, item2) ⇒ item3
Recall that the Apriori algorithm is iterative for increasing itemset size, working off of the frequent itemsets of the previous size to explore ONLY the NECESSARY subset of a large combinatorial space.
Describe how you might design a framework to perform this exercise.
In particular, focus on the following:
— map-reduce steps required
— enumeration of item sets and filtering for frequent candidates
Finding itemsets with 3 items would require multiple map and reduce steps. The first mapper would create pairs with single items and counts of those pairs. The reducer would then eliminate items that do not meet the minimum support. The second mapper would use the pairs with frequent items and output “pairs” with ((item1, item2), item3) as the key. The reducer would be the same as the first step, totaling counts of (item1, item2) and eliminating those that do not meet minimum support, then calculating confidence scores for (item1, item2) ⇒ item3 rules.
In [ ]: