One student emailed with the following question:

Right now I'm trying to edit the example entropy function to the one you wrote on the board in class.

My question is the example code only has one P_s, right? Our goal is to add four different values but I don't understand how to code P_w, P_h and so on. Would you give me more details and advice on this?

As a hint, I will rewrite the entropy formula I wrote to explicitly look up the various categories.



In [8]:

    
def entropy(series):
    """Normalized Shannon Index"""
    # a series in which all the entries are equal should result in normalized entropy of 1.0
    
    # eliminate 0s
    series1 = series[series!=0]

    # if len(series) < 2 (i.e., 0 or 1) then return 0
    
    if len(series1) > 1:
        # calculate the maximum possible entropy for given length of input series
        max_s = -np.log(1.0/len(series))
    
        total = float(sum(series1))
        p = series1.astype('float')/float(total)
        return sum(-p*np.log(p))/max_s
    else:
        return 0.0



In [7]:

    
# supporint imports 

import numpy as np
from pandas import Series



In [24]:

    
def entropy_term(p):
    """Individual Shannon entropy term -- handles the case in which p is 0"""
    if p == 0:
        return 0
    else:
        return -p*np.log(p)

def entropy5_explicit_labels(series):
    """entropy5 calculation for an input Series with 5 categories"""
    # calculate the normalizing term -- what's the maximum entropy
    # there are five categories here
    max_s = -np.log(1.0/5)
    total = float(series['White']+series['Black']+series['Asian']+ \
                  series['Hispanic']+series['Other'])
    
    s = entropy_term(series['White']/total) + \
        entropy_term(series['Black']/total) + \
        entropy_term(series['Asian']/total) + \
        entropy_term(series['Hispanic']/total) + \
        entropy_term(series['Other']/total)
    
    s = s/max_s
    return s

def entropy4_explicit_labels(series):
    """entropy4 calculation for an input Series with 4 categories"""
    # calculate the normalizing term -- what's the maximum entropy
    # there are five categories here
    max_s = -np.log(1.0/4)
    # don't include Other in the total
    total = float(series['White']+series['Black']+series['Asian']+ \
                  series['Hispanic'])
    
    s = entropy_term(series['White']/total) + \
        entropy_term(series['Black']/total) + \
        entropy_term(series['Asian']/total) + \
        entropy_term(series['Hispanic']/total) 
    
    s = s/max_s
    return s



In [30]:

    
# Using the population figures for the Houston Metro Area
# Make a pandas Series out of the dict

houston = Series({'Asian': 384596,
 'Black': 998883,
 'Hispanic': 2099412,
 'Other': 103437,
 'White': 2360472})

Note how the entropy function can be used to do both the entropy5 and entropy4 calculation by just changing the subset of the houston Series being passed into entropy



In [36]:

    
# comparing two ways of doing the entropy5 calculation
(entropy(houston[['White', 'Black', 'Asian', 'Hispanic', 'Other']]),
 entropy5_explicit_labels(houston))









    Out[36]:





(0.79628076626851163, 0.79628076626851163)



In [37]:

    
# comparing two ways of doing the entropy4 calculation 
# don't include Other

(entropy(houston[['White', 'Black', 'Asian', 'Hispanic']]),
 entropy4_explicit_labels(houston))









    Out[37]:





(0.87642479416885899, 0.87642479416885899)

Calculating a entropy_rice function is left to the reader....



In [ ]: