Use this notebook to work on your answers and check solutions. You can then submit your functions using "HW7_submission.ipynb" or directly write your functions in a file named "hw7_answers.py". Note that "hw7_answers.py" will be the only file collected and graded for this assignment.
You will use the cereal dataset from last week.
In [1]:
import pandas as pd
import numpy as np
from __future__ import division
In [2]:
%%sh
## RUN BUT DO NOT EDIT THIS CELL
## run this cell to download the cereal dataset into your current directory
cp /home/data/cereal/cereal.csv .
In [3]:
## RUN BUT DO NOT EDIT THIS CELL
# load the data, define ratingID
cer = pd.read_csv('cereal.csv', skiprows=[1], delimiter=';')
cer['ratingID'] = cer['rating'].apply(lambda x: 0 if x<55 else 1)
In [4]:
def get_corrs(df):
return df.corr()
In [5]:
get_corrs(cer[['name','calories','carbo','sugars']])
Out[5]:
Write a function called "get_corr_pairs" which takes one argument:
and returns:
You can use your function from question 1 to get the correlation values.
In [17]:
def get_corr_pairs(df):
cmat = get_corrs(df)
d = dict.fromkeys(cmat.columns.values)
for key in d.iterkeys():
d[key] = cmat.loc[key][cmat.loc[key].abs()>=0.3].index.values.tolist()
d[key].remove(key)
return d
In [18]:
get_corr_pairs(cer[['name','fat','sugars','rating']])
Out[18]:
In [1]: get_corr_pairs(cer[['name','fat','sugars','rating']])
Out[1]: {'fat': ['rating'], 'rating': ['fat', 'sugars'], 'sugars': ['rating']}
Short explanation: the correlation between 'fat' and 'rating' is -0.409, 'sugars' and 'rating' is -0.760; the remaining correlations have magnitude < 0.3.
Write a function called "sample_cereal" which takes two arguments:
and returns:
In [21]:
def sample_cereal(df, kind):
id_counts = df.groupby('ratingID').ratingID.count()
o1 = df[df.ratingID==id_counts.argmax()]
o2 = df[df.ratingID==id_counts.argmin()]
if (kind=='up'):
return o1.append(o2.iloc[np.random.choice(len(o2), len(o1), replace=True)])
elif (kind=='down'):
return o2.append(o1.iloc[np.random.choice(len(o1), len(o2), replace=False)])
else:
print 'This kind of sampling is not recognized! Returning original DataFrame!'
return df
In [32]:
sample_cereal(cer.ix[3:5,['name','mfr','type','calories','protein','ratingID']], 'up')
Out[32]:
In [34]:
sample_cereal(cer.ix[3:5,['name','mfr','type','calories','protein','ratingID']], 'down')
Out[34]:
In [1]: sample_cereal(cer.ix[3:5,['name','mfr','type','calories','protein','ratingID']], 'up')
Out[1]: name mfr type calories protein ratingID
3 All-Bran with Extra Fiber K C 50 4 1
3 All-Bran with Extra Fiber K C 50 4 1
4 Almond Delight R C 110 2 0
5 Apple Cinnamon Cheerios G C 110 2 0
Short explanation: The input has only one positive sample and two negative samples; random sampling from a distribution of 1 can only return one possible result, so our up-sampling of the smaller class merely replicates the row for "All-Bran with Extra Fiber".
In [39]:
def find_H(df, cname):
if (cname not in df.columns.values):
print 'Column name not recognized!'
return 0
p = df.groupby(cname)[cname].count()/len(df)
return -sum(p*np.log2(p))
In [43]:
find_H(cer.iloc[:20], 'ratingID')
Out[43]:
Write a function called "info_gain" which takes four arguments:
and returns:
In [47]:
#### play with code here #####
def info_gain(df, cname, csplit, thr):
H0 = find_H(df, cname)
o1 = df[df[csplit]<thr]
o2 = df[df[csplit]>=thr]
R1 = find_H(o1, cname)
R2 = find_H(o2, cname)
return H0 - len(o1)/len(df)*R1 - len(o2)/len(df)*R2
In [48]:
info_gain(cer.iloc[:20], 'ratingID', 'sugars', 7.0)
Out[48]: