Homework 7

Use this notebook to work on your answers and check solutions. You can then submit your functions using "HW7_submission.ipynb" or directly write your functions in a file named "hw7_answers.py". Note that "hw7_answers.py" will be the only file collected and graded for this assignment.

You will use the cereal dataset from last week.



In [1]:

    
import pandas as pd
import numpy as np
from __future__ import division



In [2]:

    
%%sh 
## RUN BUT DO NOT EDIT THIS CELL
## run this cell to download the cereal dataset into your current directory
cp /home/data/cereal/cereal.csv .



In [3]:

    
## RUN BUT DO NOT EDIT THIS CELL

# load the data, define ratingID
cer = pd.read_csv('cereal.csv', skiprows=[1], delimiter=';')
cer['ratingID'] = cer['rating'].apply(lambda x: 0 if x<55 else 1)

Question 1

Write a function called "get_corrs" which takes one argument:

df, which is a pandas data frame

and returns:

m, a correlation matrix for the numerical variables in df.



In [4]:

    
def get_corrs(df):
    return df.corr()



In [5]:

    
get_corrs(cer[['name','calories','carbo','sugars']])

Sample output:

In [1]: get_corrs(cer[['name','calories','carbo','sugars']])
Out[1]:     calories    carbo       sugars
   calories 1.000000    0.250681    0.562340
      carbo    0.250681    1.000000    -0.331665
     sugars 0.562340   -0.331665     1.000000

Question 2

Write a function called "get_corr_pairs" which takes one argument:

df, which is a pandas data frame

and returns:

corr_pairs, a dictionary where keys are names of columns of df corresponding to numerical features, and values are arrays of names of columns whose correlation coefficient with the key has magnitude 0.3 or greater.

You can use your function from question 1 to get the correlation values.



In [17]:

    
def get_corr_pairs(df):
    cmat = get_corrs(df)
    d = dict.fromkeys(cmat.columns.values)
    for key in d.iterkeys():
        d[key] = cmat.loc[key][cmat.loc[key].abs()>=0.3].index.values.tolist()
        d[key].remove(key)
    return d



In [18]:

    
get_corr_pairs(cer[['name','fat','sugars','rating']])









    Out[18]:





{'fat': ['rating'], 'rating': ['fat', 'sugars'], 'sugars': ['rating']}

Sample output:

In [1]: get_corr_pairs(cer[['name','fat','sugars','rating']])
Out[1]: {'fat': ['rating'], 'rating': ['fat', 'sugars'], 'sugars': ['rating']}

Short explanation: the correlation between 'fat' and 'rating' is -0.409, 'sugars' and 'rating' is -0.760; the remaining correlations have magnitude < 0.3.

Question 3

Write a function called "sample_cereal" which takes two arguments:

df, which is a pandas data frame
kind, which is a string that can take value 'up' or 'down'

and returns:

a pandas data frame with balanced target class 'ratingID', using up sampling if kind='up' and downsampling if kind='down'.



In [21]:

    
def sample_cereal(df, kind):
    id_counts = df.groupby('ratingID').ratingID.count()
    o1 = df[df.ratingID==id_counts.argmax()]
    o2 = df[df.ratingID==id_counts.argmin()]
    if (kind=='up'):
        return o1.append(o2.iloc[np.random.choice(len(o2), len(o1), replace=True)])
    elif (kind=='down'):
        return o2.append(o1.iloc[np.random.choice(len(o1), len(o2), replace=False)])
    else:
        print 'This kind of sampling is not recognized! Returning original DataFrame!'
        return df



In [32]:

    
sample_cereal(cer.ix[3:5,['name','mfr','type','calories','protein','ratingID']], 'up')









    Out[32]:






  
    
      
      name
      mfr
      type
      calories
      protein
      ratingID
    
  
  
    
      4
      Almond Delight
      R
      C
      110
      2
      0
    
    
      5
      Apple Cinnamon Cheerios
      G
      C
      110
      2
      0
    
    
      3
      All-Bran with Extra Fiber
      K
      C
      50
      4
      1
    
    
      3
      All-Bran with Extra Fiber
      K
      C
      50
      4
      1



In [34]:

    
sample_cereal(cer.ix[3:5,['name','mfr','type','calories','protein','ratingID']], 'down')









    Out[34]:






  
    
      
      name
      mfr
      type
      calories
      protein
      ratingID
    
  
  
    
      3
      All-Bran with Extra Fiber
      K
      C
      50
      4
      1
    
    
      4
      Almond Delight
      R
      C
      110
      2
      0

Sample output:

In [1]: sample_cereal(cer.ix[3:5,['name','mfr','type','calories','protein','ratingID']], 'up')
Out[1]:     name                mfr     type    calories    protein     ratingID
3   All-Bran with Extra Fiber   K   C        50           4          1
3   All-Bran with Extra Fiber   K   C        50           4          1
4   Almond Delight                R     C        110         2       0
5   Apple Cinnamon Cheerios       G     C        110         2       0

Short explanation: The input has only one positive sample and two negative samples; random sampling from a distribution of 1 can only return one possible result, so our up-sampling of the smaller class merely replicates the row for "All-Bran with Extra Fiber".

Question 4

Write a function called "find_H" which takes two arguments:

df, which is a pandas data frame
cname, which is the name of the target column (that should correspond to a categorical variable)

and returns:

H, the entropy in the column cname (use logarithm base 2)



In [39]:

    
def find_H(df, cname):
    if (cname not in df.columns.values):
        print 'Column name not recognized!'
        return 0
    p = df.groupby(cname)[cname].count()/len(df)
    return -sum(p*np.log2(p))



In [43]:

    
find_H(cer.iloc[:20], 'ratingID')









    Out[43]:





0.60984030471640038

Sample output:

In [1]: find_H(cer.iloc[:20], 'ratingID')
Out[1]: 0.60984030471640038

Question 5

Write a function called "info_gain" which takes four arguments:

df, which is a pandas data frame
cname, which is the name of the target column (that should correspond to a categorical variable)
csplit, which is the name of a numeric column in df
threshold, which is a numeric value

and returns:

info_gain, the information gain you get in column cname by splitting the dataset on the threshold value in column csplit.



In [47]:

    
#### play with code here #####

def info_gain(df, cname, csplit, thr):
    H0 = find_H(df, cname)
    o1 = df[df[csplit]<thr]
    o2 = df[df[csplit]>=thr]
    R1 = find_H(o1, cname)
    R2 = find_H(o2, cname)
    return H0 - len(o1)/len(df)*R1 - len(o2)/len(df)*R2



In [48]:

    
info_gain(cer.iloc[:20], 'ratingID', 'sugars', 7.0)









    Out[48]:





0.2280667035464144

Sample output:

In [1]: info_gain(cer.iloc[:20], 'ratingID', 'sugars', '7.0')
Out[1]: 0.2280667035464144

Note: for a probability of 0, use the fact that lim_(p->0+) p log(p) = 0.

	calories	carbo	sugars
calories	1.000000	0.250681	0.562340
carbo	0.250681	1.000000	-0.331665
sugars	0.562340	-0.331665	1.000000

	name	mfr	type	calories	protein	ratingID
4	Almond Delight	R	C	110	2	0
5	Apple Cinnamon Cheerios	G	C	110	2	0
3	All-Bran with Extra Fiber	K	C	50	4	1
3	All-Bran with Extra Fiber	K	C	50	4	1