DATASCI W261: Machine Learning at Scale

Katrina Adams

kradams@ischool.berkeley.edu

MIDS W261, Week 1

4 September 2015

This notebook provides a poor man Hadoop through command-line and python. Please insert the python code by yourself.

Map


In [9]:
%%writefile mapper.py
#!/usr/bin/python
import sys
import re
count = 0
WORD_RE = re.compile(r"[\w']+")
filename = sys.argv[2]
findword = sys.argv[1]
with open (filename, "r") as myfile:
    #Please insert your code
    for line in myfile.readlines():
        words = WORD_RE.findall(line.lower())
        for w in words:
            if w==findword.lower():
                count+=1
                break
    print count


Writing mapper.py

In [10]:
!chmod a+x mapper.py

Reduce


In [11]:
%%writefile reducer.py
#!/usr/bin/python
import sys
sum = 0
for line in sys.stdin:
    #Please insert your code
    sum += int(line)
print sum


Writing reducer.py

In [12]:
!chmod a+x reducer.py

Write script to file


In [13]:
%%writefile pGrepCount.sh
ORIGINAL_FILE=$1
FIND_WORD=$2
BLOCK_SIZE=$3
CHUNK_FILE_PREFIX=$ORIGINAL_FILE.split
SORTED_CHUNK_FILES=$CHUNK_FILE_PREFIX*.sorted
usage()
{
    echo Parallel grep
    echo usage: pGrepCount filename word chuncksize
    echo greps file file1 in $ORIGINAL_FILE and counts the number of lines
    echo Note: file1 will be split in chunks up to $ BLOCK_SIZE chunks each
    echo $FIND_WORD each chunk will be grepCounted in parallel
}
#Splitting $ORIGINAL_FILE INTO CHUNKS
split -b $BLOCK_SIZE $ORIGINAL_FILE $CHUNK_FILE_PREFIX
#DISTRIBUTE
for file in $CHUNK_FILE_PREFIX*
do
    #grep -i $FIND_WORD $file|wc -l >$file.intermediateCount &
    ./mapper.py $FIND_WORD $file >$file.intermediateCount &
done
wait
#MERGEING INTERMEDIATE COUNT CAN TAKE THE FIRST COLUMN AND TOTOL...
#numOfInstances=$(cat *.intermediateCount | cut -f 1 | paste -sd+ - |bc)
numOfInstances=$(cat *.intermediateCount | ./reducer.py)
echo "found [$numOfInstances] [$FIND_WORD] in the file [$ORIGINAL_FILE]"


Writing pGrepCount.sh

Run the file


In [14]:
!chmod a+x pGrepCount.sh

Usage: usage: pGrepCount filename word chuncksize


In [22]:
!./pGrepCount.sh License.txt COPYRIGHT 4k


found [57] [COPYRIGHT] in the file [License.txt]

In [ ]: