In [30]:
%%javascript
/**********************************************************************************************
Known Mathjax Issue with Chrome - a rounding issue adds a border to the right of mathjax markup
https://github.com/mathjax/MathJax/issues/1300
A quick hack to fix this based on stackoverflow discussions:
http://stackoverflow.com/questions/34277967/chrome-rendering-mathjax-equations-with-a-trailing-vertical-line
**********************************************************************************************/
$('.math>span').css("border-left-color","transparent")
In [31]:
%reload_ext autoreload
%autoreload 2
Course Lead: Dr James G. Shanahan (email Jimi via James.Shanahan AT gmail.com)
Name: Jason Sanchez (26989981)
Class: MIDS w261 (Section Fall 2016 Group 2)
Email: jason.sanchez@iSchool.Berkeley.edu
Week: 1
Due Time: HW is due the Tuesday of the following week by 8AM (West coast time). I.e., Tuesday, Sept 6, 2016 in the case of this homework.
Submit Date: Sept 6, 2016
MIDS UC Berkeley, Machine Learning at Scale DATSCIW261 ASSIGNMENT #1
Version 2016-09-2
=== INSTRUCTIONS for SUBMISSIONS === Follow the instructions for submissions carefully.
https://docs.google.com/forms/d/1ZOr9RnIe_A06AcZDB6K1mJN4vrLeSmS2PD6Xm3eOiis/viewform?usp=send_form
HW1 can be completed locally on your computer
W1.0.0 Prepare your bio and include it in this HW submission. Please limit to 100 words. Count the words in your bio and print the length of your bio (in terms of words) in a separate cell.
Fill in the following information [Optional]
%%writefile bio
Hi! My name is Jason Sanchez. I live in San Francisco. I started MIDS in Fall 2015 and expect to finish in Summer 2017. Why so long? I am a cofounder of a startup that is using data science to modernize the insurance industry.
After completing the first week of class, I am extremely hopeful that this class will define my Berkeley experience. This class seems difficult, but extremely worth the effort and I would like to feel comfortable tackling big data problems in production environments.
In [23]:
!wc -w < bio
Big data problems are ones that cannot be solved in a reasonable amount of time using one computer.
Processing trip data from telematics devices (i.e. systems that monitor driving behavior) in real time.
bias: The difference in predictions between the trained model and the true function that generated the data (not including the irreducible noise).
variance: The variablility in the predictions of a model given different training data.
Simple models (i.e. models that produce simple decision boundaries) have high bias and adding a lot more data won't dramatically increase the accuracy of the model. Such models are said to underfit the data because they do not have the ability to use the data available to better infer the true decision boundary.
Complex models that produce highly complex decision boundaries tend to have high variance - especially in smaller datasets. This variance is reducible by adding more data. In fact, adding an infinite amount of data would drive the variance of any model down to zero (leaving only the bias).
The problem is we don't have an infinite amount of data and we must make a tradeoff between bias and variance. By plotting the cross validation error rates attained from moving from a simple model to a complex model (in terms of the hyperparameters of the model), we can see where the inflection point between bias and variance occurs.
When the model is simple, it misses a lot of the signal and thus the error is high. Increasing the complexity or flexibility of the model tends to increase the accuracy because more of the signal is being learned. Increasing the complexity of the model further can then lead to performance degradation because the model starts to learn patterns that only exist in the training dataset.
Therefore, in machine learning we must balance the bias and variance of a model given the size of the dataset available.
Write a program called alice_words.py that creates a text file named alice_words.txt containing an alphabetical listing of all the words, and the number of times each occurs, in the text version of Alice’s Adventures in Wonderland. (You can obtain a free plain text version of the book, along with many others, from here The first 10 lines of your output file should look something like this (the counts are not totally precise):
In [24]:
# check where is the current directory and change if necessary using something like: %cd W261MasterDir
!pwd
In [1]:
ls
In [8]:
!curl 'http://www.gutenberg.org/cache/epub/11/pg11.txt' -o alice.txt
In [9]:
#display the first few lines
!head alice.txt
In [5]:
#example of a regular expression to detect words in a string.
import re
line = """ 0017.2000-01-17.beck 0 global risk management operations " congratulations, sally!!! kk ----------------------forwarded by kathy"""
re.findall(r'[a-z]+', line.lower())[0:10]
Out[5]:
wordCounts={}
One way to do word counting but not best. A defaultdict is like a regular dictionary, except that when you try to look up a key it doesn’t contain, it first adds a value for it using a zero-argument function you provided when you created it. In order to use defaultdicts, you have to import them
In [6]:
# Here is an example of wordcounting with a defaultdict (dictionary structure with a nice
# default behaviours when a key does not exist in the dictionary
import re
from collections import defaultdict
line = """ 0017.2000-01-17.beck 0 global risk management operations " congratulations, sally!!! kk ----------------------forwarded by kathy kokas/corp/enron on 01/17/2000 08:08 pm--------------------------- from: rick causey 01/17/2000 06:04 pm sent by: enron announcements to: all enron worldwide cc: subject: global risk management operations recognizing enron , s increasing worldwide presence in the wholesale energy business and the need to insure outstanding internal controls for all of our risk management activities, regardless of location, a global risk management operations function has been created under the direction of sally w. beck, vice president. in this role, sally will report to rick causey, executive vice president and chief accounting officer. sally , s responsibilities with regard to global risk management operations will mirror those of other recently created enron global functions. in this role, sally will work closely with all enron geographic regions and wholesale companies to insure that each entity receives individualized regional support while also focusing on the following global responsibilities: 1. enhance communication among risk management operations professionals. 2. assure the proliferation of best operational practices around the globe. 3. facilitate the allocation of human resources. 4. provide training for risk management operations personnel. 5. coordinate user requirements for shared operational systems. 6. oversee the creation of a global internal control audit plan for risk management activities. 7. establish procedures for opening new risk management operations offices and create key benchmarks for measuring on-going risk controls. each regional operations team will continue its direct reporting relationship within its business unit, and will collaborate with sally in the delivery of these critical items. the houston-based risk management operations team under sue frusco , s leadership, which currently supports risk management activities for south america and australia, will also report directly to sally. sally retains her role as vice president of energy operations for enron north america, reporting to the ena office of the chairman. she has been in her current role over energy operations since 1997, where she manages risk consolidation and reporting, risk management administration, physical product delivery, confirmations and cash management for ena , s physical commodity trading, energy derivatives trading and financial products trading. sally has been with enron since 1992, when she joined the company as a manager in global credit. prior to joining enron, sally had four years experience as a commercial banker and spent seven years as a registered securities principal with a regional investment banking firm. she also owned and managed a retail business for several years. please join me in supporting sally in this additional coordination role for global risk management operations."""
wordCounts=defaultdict(int)
for word in re.findall(r'[a-z]+', line.lower()):
#if word in ["a"]:
#print word,"\n"
wordCounts[word] += 1
for key in sorted(wordCounts)[0:10]:
print (key, wordCounts[key])
In [433]:
with open("alice.txt", "r") as alice:
print(len(re.findall(r'alice+', alice.read().lower())))
The above results doesn't make sense in light of the bottom result
In [435]:
with open("alice.txt", "r") as alice:
print(len(re.findall(r'[Aa]lice+', alice.read())))
In [440]:
with open("alice.txt", "r") as alice:
print(len(re.findall(r'alice+', alice.read())))
In [446]:
%%writefile pWordCount.sh
#!/bin/bash
## pWordCount.sh
## Author: James G. Shanahan
## Usage: pWordCount.sh m wordlist testFile.txt
## Input:
## m = number of processes (maps), e.g., 4
## wordlist = a space-separated list of words in quotes, e.g., "the and of" if set to "*", then all words are used
## inputFile = a text input file
##
## Instructions: Read this script and its comments closely.
## Do your best to understand the purpose of each command,
## and focus on how arguments are supplied to mapper.py/reducer.py,
## as this will determine how the python scripts take input.
## When you are comfortable with the unix code below,
## answer the questions on the LMS for HW1 about the starter code.
usage()
{
echo ERROR: No arguments supplied
echo
echo To run use
echo " pWordCount.sh m wordlist inputFile"
echo Input:
echo " number of processes/maps, EG, 4"
echo " wordlist = a space-separated list of words in quotes, e.g., 'the and of'"
echo " inputFile = a text input file"
}
if [ $# -eq 0 ] # I removed a hash after the $ sign
then
usage
exit 1
fi
## collect user input
m=$1 ## the number of parallel processes (maps) to run
wordlist=$2 ## if set to "*", then all words are used
## a text file
data=$3
## Clean up remaining chunks from a previous run
rm -f $data.chunk.*
## 'wc' determines the number of lines in the data
## 'perl -pe' regex strips the piped wc output to a number
linesindata=`wc -l $data | perl -pe 's/^.*?(\d+).*?$/$1/'`
## determine the lines per chunk for the desired number of processes
linesinchunk=`echo "$linesindata/$m+1" | bc`
## split the original file into chunks by line
split -l $linesinchunk $data $data.chunk.
## assign python mappers (mapper.py) to the chunks of data
## and emit their output to temporary files
for datachunk in $data.chunk.*; do
## feed word list to the python mapper here and redirect STDOUT to a temporary file on disk
####
####
./mapper.py "$wordlist" <$datachunk > $datachunk.counts &
####
####
done
## wait for the mappers to finish their work
wait
###----------------------------------------------------------------------------------------
# Sorting magic
hash_to_bucket ()
{
# Takes as input a string and n_buckets and assigns the string to one of the buckets.
# This function is the bottle neck and takes forever to run
string=$1
n_buckets=$2
# Convert to checksum then take the mod of it
echo -n $string | cksum | cut -f1 -d" " | awk -v n="$n_buckets" '{print $1"%"n}' | bc
}
sorted_file_prefix=sorted_pwordcount
rm -f $sorted_file_prefix* *output
# For each file with counts, create coallated files
for file in $data.chunk.*.counts
do
cat $file | while read line
do
key=$(echo $line | cut -f1 -d" ")
hash_key=$(hash_to_bucket $key $m)
echo $line >> $sorted_file_prefix.$hash_key
done
done
# Sort these files
for file in $sorted_file_prefix*
do
sort $file -o $file
done
#
###----------------------------------------------------------------------------------------
## 'ls' makes a list of the temporary count files
## 'perl -pe' regex replaces line breaks with spaces
countfiles=`\ls $sorted_file_prefix* | perl -pe 's/\n/ /'` # replace file reference to one from the sort.
## feed the list of countfiles to the python reducer and redirect STDOUT to disk
####
####
for file in $sorted_file_prefix*
do
<$file ./reducer.py >> $data.output
done
sort $data.output -o $data.output
####
####
## clean up the data chunks and temporary count files
rm $data.chunk.*
In [447]:
# Change the execution priviledges to make the shell script executable by all
!chmod a+x pWordCount.sh
In [369]:
%%writefile mapper.py
#!/usr/bin/python
import sys
import re
findword = sys.argv[1]
findword = findword.lower()
if findword != "*":
findword = set(findword.split())
for line in sys.stdin:
# Turn the list into a collection of lowercase words
for word in re.findall(r'[a-z]+', line.lower()):
if (word in findword) or (findword == "*"):
print("%s 1" % word)
In [370]:
!chmod +x mapper.py
In [417]:
%%writefile reducer.py
#!/usr/bin/python
import sys
total = 1
current_word = None
for countStr in sys.stdin:
word, count = countStr.split()
if current_word == word:
total += int(count)
elif current_word is None:
current_word = word
else:
print(current_word, total)
current_word = word
total = 1
print(current_word, total)
In [418]:
! chmod +x reducer.py
I.e., insert code
In [441]:
!chmod a+x mapper.py
!chmod a+x reducer.py
!chmod a+x pWordCount.sh
! ./pWordCount.sh
Write the mapper.py/reducer.py combination to perform WordCount using the command line mapreeduce framework containing an alphabetical listing of all the words, and the number of times each occurs, in the text version of Alice’s Adventures in Wonderland. (You can obtain a free plain text version of the book, along with many others, from here The first 10 lines of your output file should look something like this (the counts are not totally precise):
To do so, make sure of the following:
Here, mapper.py will read in a portion (i.e., a single record corresponding to a row) of the email data, count the number of occurences of the word in questions and print/emit a count to the output stream. While the utility of the reducer responsible for reading in counts of the word and summarizing them before printing that summary to the output stream. See example the notebook See video section 1.12.1 1.12.1 Poor Man's MapReduce Using Command Line (Part 2) located at: https://learn.datascience.berkeley.edu/mod/page/view.php?id=10961
NOTE in your python notebook create a cell to save your mapper/reducer to disk using magic commands (see example here)
In [442]:
! ./pWordCount.sh 4 "*" alice.txt
In [450]:
!head alice.txt.output
Change the mapper.py/reducer.py combination so that you get only the number of words starting with an uppercase letter, and the number of words starting with a lowercase letter for Alice in Wonderland available here. In other words, you need an output file with only 2 lines, one giving you the number of words staring with a lowercase ('a' to 'z'), and the other line indicating the number of words starting with an uppercase letter ('A' to 'Z'). In the pWordCount.sh, please insert a sort command between the mappers (after the for loop) and the reducer calls to collate the output key-value pair records by key from the mappers. E.g., sort -k1,1. Use "man sort" to learn more about Unix sorts.
In [451]:
%%writefile mapper.py
#!/usr/bin/python
import sys
import re
for line in sys.stdin:
# Turn the list into a collection of lowercase words
for word in re.findall(r'[a-zA-Z]+', line):
if word.isupper():
print("Capital 1")
else:
print("Lower 1")
In [452]:
! ./pWordCount.sh 4 "*" alice.txt
In [453]:
! cat alice.txt.output
Provide and example of bias variance in action for a similated function y = f(x). E.g., y = sin(x+x^2). Provide code, data, and graphs.
Using a bias-variance decomposition analsysis on your choosen problem, describe how you would decide which model to choose when you dont know the true function and how does this choice compares to the choice you made using the true function.
In [ ]: