Homework 7

Use this notebook to work on your answers and check solutions. You can then submit your functions using "HW7_submission.ipynb" or directly write your functions in a file named "hw7_answers.py". Note that "hw7_answers.py" will be the only file collected and graded for this assignment.

You will use the cereal dataset from last week.



In [ ]:

    
import pandas as pd
import numpy as np
from __future__ import division



In [ ]:

    
%%sh 
## RUN BUT DO NOT EDIT THIS CELL
## run this cell to download the cereal dataset into your current directory
cp /home/data/cereal/cereal.csv .



In [ ]:

    
## RUN BUT DO NOT EDIT THIS CELL

# load the data, define ratingID
cer = pd.read_csv('cereal.csv', skiprows=[1], delimiter=';')
cer['ratingID'] = cer['rating'].apply(lambda x: 0 if x<55 else 1)

Question 1

Write a function called "get_corrs" which takes one argument:

df, which is a pandas data frame

and returns:

m, a correlation matrix for the numerical variables in df.



In [ ]:

    
#### play with code here #####

Sample output:

In [1]: get_corrs(cer['name','calories','carbo','sugars'])
Out[1]:     calories    carbo       sugars
   calories 1.000000    0.250681    0.562340
      carbo    0.250681    1.000000    -0.331665
     sugars 0.562340   -0.331665     1.000000

Question 2

Write a function called "get_corr_pairs" which takes one argument:

df, which is a pandas data frame

and returns:

corr_pairs, a dictionary where keys are names of columns of df corresponding to numerical features, and values are arrays of names of columns whose correlation coefficient with the key has magnitude 0.3 or greater.

You can use your function from question 1 to get the correlation values.



In [ ]:

    
#### play with code here #####

Sample output:

In [1]: get_corr_pairs(cer[['name','calories','fat','sugars','carbo']])
Out[1]: {'fat': ['rating'], 'rating': ['fat', 'sugars'], 'sugars': ['rating']}

Short explanation: the correlation between 'fat' and 'rating' is -0.409, 'sugars' and 'rating' is -0.760; the remaining correlations have magnitude < 0.3.

Question 3

Write a function called "sample_cereal" which takes two arguments:

df, which is a pandas data frame
kind, which is a string that can take value 'up' or 'down'

and returns:

a pandas data frame with balanced target class 'ratingID', using up sampling if kind='up' and downsampling if kind='down'.



In [ ]:

    
#### play with code here #####

Sample output:

In [1]: sample_cereal(cer.ix[3:5,['name','mfr','type','calories','protein','ratingID']], 'up')
Out[1]:     name                mfr     type    calories    protein     ratingID
3   All-Bran with Extra Fiber   K   C        50           4          1
3   All-Bran with Extra Fiber   K   C        50           4          1
4   Almond Delight                R     C        110         2       0
5   Apple Cinnamon Cheerios       G     C        110         2       0

Short explanation: The input has only one positive sample and two negative samples; random sampling from a distribution of 1 can only return one possible result, so our up-sampling of the smaller class merely replicates the row for "All-Bran with Extra Fiber".

Question 4

Write a function called "find_H" which takes two arguments:

df, which is a pandas data frame
cname, which is the name of the target column (that should correspond to a categorical variable)

and returns:

H, the entropy in the column cname (use logarithm base 2)



In [ ]:

    
#### play with code here #####

Sample output:

In [1]: find_H(cer.iloc[:20], 'ratingID')
Out[1]: 0.60984030471640038

Question 5

Write a function called "info_gain" which takes four arguments:

df, which is a pandas data frame
cname, which is the name of the target column (that should correspond to a categorical variable)
csplit, which is the name of a numeric column in df
threshold, which is a numeric value

and returns:

info_gain, the information gain you get in column cname by splitting the dataset on the threshold value in column csplit.



In [ ]:

    
#### play with code here #####

Sample output:

In [1]: info_gain(cer.iloc[:20], 'ratingID', 'sugars', '7.0')
Out[1]: 0.2280667035464144

Note: for a probability of 0, use the fact that lim_(p->0+) p log(p) = 0.