Use this notebook to work on your answers and check solutions. You can then submit your functions using "HW7_submission.ipynb" or directly write your functions in a file named "hw7_answers.py". Note that "hw7_answers.py" will be the only file collected and graded for this assignment.
You will use the cereal dataset from last week.
In [ ]:
import pandas as pd
import numpy as np
from __future__ import division
In [ ]:
%%sh
## RUN BUT DO NOT EDIT THIS CELL
## run this cell to download the cereal dataset into your current directory
cp /home/data/cereal/cereal.csv .
In [ ]:
## RUN BUT DO NOT EDIT THIS CELL
# load the data, define ratingID
cer = pd.read_csv('cereal.csv', skiprows=[1], delimiter=';')
cer['ratingID'] = cer['rating'].apply(lambda x: 0 if x<55 else 1)
In [ ]:
#### play with code here #####
Write a function called "get_corr_pairs" which takes one argument:
and returns:
You can use your function from question 1 to get the correlation values.
In [ ]:
#### play with code here #####
In [1]: get_corr_pairs(cer[['name','calories','fat','sugars','carbo']])
Out[1]: {'fat': ['rating'], 'rating': ['fat', 'sugars'], 'sugars': ['rating']}
Short explanation: the correlation between 'fat' and 'rating' is -0.409, 'sugars' and 'rating' is -0.760; the remaining correlations have magnitude < 0.3.
Write a function called "sample_cereal" which takes two arguments:
and returns:
In [ ]:
#### play with code here #####
In [1]: sample_cereal(cer.ix[3:5,['name','mfr','type','calories','protein','ratingID']], 'up')
Out[1]: name mfr type calories protein ratingID
3 All-Bran with Extra Fiber K C 50 4 1
3 All-Bran with Extra Fiber K C 50 4 1
4 Almond Delight R C 110 2 0
5 Apple Cinnamon Cheerios G C 110 2 0
Short explanation: The input has only one positive sample and two negative samples; random sampling from a distribution of 1 can only return one possible result, so our up-sampling of the smaller class merely replicates the row for "All-Bran with Extra Fiber".
In [ ]:
#### play with code here #####
Write a function called "info_gain" which takes four arguments:
and returns:
In [ ]:
#### play with code here #####