Exercise

In this exercise, reproduce some of the findings from What Makes Houston the Next Great American City? | Travel | Smithsonian, specifically the calculation represented in

whose caption is

To assess the parity of the four major U.S. ethnic and racial groups, Rice University researchers used a scale called the Entropy Index. It ranges from 0 (a population has just one group) to 1 (all groups are equivalent). Edging New York for the most balanced diversity, Houston had an Entropy Index of 0.874 (orange bar).

The research report by Smithsonian Magazine is Houston Region Grows More Racially/Ethnically Diverse, With Small Declines in Segregation: A Joint Report Analyzing Census Data from 1990, 2000, and 2010 by the Kinder Institute for Urban Research & the Hobby Center for the Study of Texas.

In the report, you'll find the following quotes:

How does Houston’s racial/ethnic diversity compare to the racial/ethnic diversity of other large metropolitan areas? The Houston metropolitan area is the most racially/ethnically diverse.

....

Houston is one of the most racially/ethnically diverse metropolitan areas in the nation as well. *It is the most diverse of the 10 largest U.S. metropolitan areas.* [emphasis mine] Unlike the other large metropolitan areas, all four major racial/ethnic groups have substantial representation in Houston with Latinos and Anglos occupying roughly equal shares of the population.

....

Houston has the highest entropy score of the 10 largest metropolitan areas, 0.874. New York is a close second with a score of 0.872.

....

Your task is:

Tabulate all the metropolian/micropolitan statistical areas. Remember that you have to group various entities that show up separately in the Census API but which belong to the same area. You should find 942 metropolitan/micropolitan statistical areas in the 2010 Census.
Calculate the normalized Shannon index (entropy5) using the categories of White, Black, Hispanic, Asian, and Other as outlined in the Day_07_G_Calculating_Diversity notebook
Calculate the normalized Shannon index (entropy4) by not considering the Other category. In other words, assume that the the total population is the sum of White, Black, Hispanic, and Asian.
Figure out how exactly the entropy score was calculated in the report from Rice University. Since you'll find that the entropy score reported matches neither entropy5 nor entropy4, you'll need to play around with the entropy calculation to figure how to use 4 categories to get the score for Houston to come out to "0.874" and that for NYC to be "0.872". [I think I've done so and get 0.873618 and 0.872729 respectively.]
Add a calculation of the Gini-Simpson diversity index using the five categories of White, Black, Hispanic, Asian, and Other.
Note where the Bay Area stands in terms of the diversity index.

For bonus points:

make a bar chart in the style used in the Smithsonian Magazine

Deliverable:

You will need to upload your notebook to a gist and render the notebook in nbviewer and then enter the nbviewer URL (e.g., http://nbviewer.ipython.org/gist/rdhyee/60b6c0b0aad7fd531938)
On bCourses, upload the CSV version of your msas_df.

HAVE FUN: ASK QUESTIONS AND WORK TOGETHER

Constraints

Below is testing code to help make sure you are on the right track. A key assumption made here is that you will end up with a Pandas DataFrame called msas_df, indexed by the FIPS code of a metropolitan/micropolitan area (e.g., Houston's code is 26420) and with the the following columns:

Total
White
Black
Hispanic
Asian
Other
p_White
p_Black
p_Hispanic
p_Asian
p_Other
entropy4
entropy5
entropy_rice
gini_simpson

You should have 942 rows, one for each MSA. You can compare your results for entropy5, entropy_rice with mine.



In [320]:

    
# FILL IN WITH YOUR CODE
from __future__ import division
%pylab --no-import-all inline









    



Populating the interactive namespace from numpy and matplotlib



In [321]:

    
import numpy as np
import matplotlib.pyplot as plt
from pandas import DataFrame, Series, Index
import pandas as pd

from itertools import islice



In [322]:

    
import census
import us

import settings



In [323]:

    
c = census.Census(key=settings.CENSUS_KEY)



In [324]:

    
def msas(variables="NAME"):
    
     for state in us.STATES:
        geo = {'for':'metropolitan statistical area/micropolitan statistical area:*', 
               'in':'state:{state_fips}'.format(state_fips=state.fips)
               }
    
        for msa in c.sf1.get(variables, geo=geo):
            yield msa

def states(variables='NAME'):
    geo={'for':'state:*'}
    states_fips = set([state.fips for state in us.states.STATES])
    # need to filter out non-states
    for r in c.sf1.get(variables, geo=geo):
        if r['state'] in states_fips:
            yield r



In [325]:

    
def convert_P005_to_int(df):
    # do conversion in place
    df[list(P005_vars)] = df[list(P005_vars)].astype('int')
    return df

def convert_to_rdotmap(row):
    """takes the P005 variables and maps to a series with White, Black, Asian, Hispanic, Other
    Total and Name"""
    return pd.Series({'MSAS': row['metropolitan statistical area/micropolitan statistical area'],
                      'Total':row['P0050001'],
                      'White':row['P0050003'],
                      'Black':row['P0050004'],
                      'Asian':row['P0050006'],
                      'Hispanic':row['P0050010'],
                      'Other': row['P0050005'] + row['P0050007'] + row['P0050008'] + row['P0050009'],
                      'Name': row['NAME']
                      }, index=['MSAS','Name', 'Total', 'White', 'Black', 'Hispanic', 'Asian', 'Other'])


def normalize(s):
    """take a Series and divide each item by the sum so that the new series adds up to 1.0"""
    total = np.sum(s)
    return s.astype('float') / total


def entropy(series):
    """Normalized Shannon Index"""
    # a series in which all the entries are equal should result in normalized entropy of 1.0
    
    # eliminate 0s
    series1 = series[series!=0]

    # if len(series) < 2 (i.e., 0 or 1) then return 0
    
    if len(series) > 1:
        # calculate the maximum possible entropy for given length of input series
        max_s = -np.log(1.0/len(series))
    
        total = float(sum(series1))
        p = series1.astype('float')/float(total)
        return sum(-p*np.log(p))/max_s
    else:
        return 0.0



In [326]:

    
def proportion_race(df):
    races=["White","Black","Asian","Hispanic","Other"]
    for i in races:
        df["p_"+i] = df[[races[races.index(i)],"Total"]].apply(lambda x: x[0]/x[1],axis=1)
    return df



In [327]:

    
def rice_entropy(series):
    if len(series)>4:
        series=series[:4]
#     print series
    max_s = -np.log(1.0/len(series))
    rice=-1*sum([i*np.log(i) for i in series])
    return rice/max_s



In [328]:

    
def gini_simpson(series):
    return 1-sum([i**2 for i in series])



In [369]:

    
def diversity(r):
    
    """Returns a DataFrame with the following columns
    """
    df = DataFrame(r)
    df = convert_P005_to_int(df)
#     df.head()
#     df[list(P005_vars)] = df[list(P005_vars)].astype('int')
    df1 = df.apply(convert_to_rdotmap, axis=1)
#     print df1.columns
    df1=df1.groupby('MSAS').sum()
    df1['entropy5'] = df1[['Asian','Black','Hispanic','White','Other']].apply(entropy,axis=1)
    df1['entropy4'] = df1[['Asian','Black','Hispanic','White']].apply(entropy,axis=1)
    df1=proportion_race(df1)
    df1['entropy_rice'] = df1[['p_Asian','p_Black','p_Hispanic','p_White','p_Other']].apply(rice_entropy,axis=1)
    df1['gini_simpson'] = df1[['p_Asian','p_Black','p_Hispanic','p_White','p_Other']].apply(gini_simpson,axis=1)
    return df1



In [356]:

    
def P005_range(n0,n1): 
    return tuple(('P005'+ "{i:04d}".format(i=i) for i in xrange(n0,n1)))

P005_vars = P005_range(1,18)
P005_vars_str = ",".join(P005_vars)
P005_vars_with_name = ['NAME'] + list(P005_vars)



In [331]:

    
#Testing entropy_rice
houston=[0.396931,0.167970,0.064673,0.353032,0.017394]
# sum(houston[:4])
rice_entropy(houston)









    Out[331]:





0.87361807358926646



In [352]:

    
r=list(msas(P005_vars_with_name))



In [370]:

    
msas_df=diversity(r)



In [372]:

    
msas_df.sort_index(by='Total', ascending=False)[:10].sort_index(by='entropy_rice', 
                                                ascending=False)









    Out[372]:






  
    
      
      Total
      White
      Black
      Hispanic
      Asian
      Other
      entropy5
      entropy4
      p_White
      p_Black
      p_Asian
      p_Hispanic
      p_Other
      entropy_rice
      gini_simpson
    
    
      MSAS
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      26420
        5946800
       2360472
        998883
       2099412
        384596
       103437
       0.796281
       0.876425
       0.396931
       0.167970
       0.064673
       0.353032
       0.017394
       0.873618
       0.685115
    
    
      35620
       18897109
       9233812
       3044096
       4327560
       1860840
       430801
       0.805286
       0.876454
       0.488636
       0.161088
       0.098472
       0.229006
       0.022797
       0.872729
       0.672625
    
    
      47900
        5582170
       2711258
       1409473
        770795
        513919
       176725
       0.808094
       0.864206
       0.485700
       0.252496
       0.092064
       0.138082
       0.031659
       0.859318
       0.671797
    
    
      31100
       12828837
       4056820
        859086
       5700862
       1858148
       353921
       0.798070
       0.859159
       0.316227
       0.066965
       0.144842
       0.444379
       0.027588
       0.855080
       0.676304
    
    
      19100
        6371773
       3201677
        941695
       1752166
        337815
       138420
       0.759459
       0.824101
       0.502478
       0.147792
       0.053017
       0.274989
       0.021724
       0.821697
       0.646772
    
    
      33100
        5564635
       1937939
       1096536
       2312929
        122082
        95149
       0.749136
       0.821351
       0.348260
       0.197054
       0.021939
       0.415648
       0.017099
       0.819535
       0.666348
    
    
      16980
        9461105
       5204489
       1613644
       1957080
        526857
       159035
       0.736833
       0.807444
       0.550093
       0.170556
       0.055687
       0.206855
       0.016809
       0.805894
       0.622136
    
    
      12060
        5268860
       2671757
       1679979
        547400
        252510
       117214
       0.729649
       0.787682
       0.507084
       0.318851
       0.047925
       0.103893
       0.022247
       0.786026
       0.627614
    
    
      37980
        5965343
       3875845
       1204303
        468168
        293656
       123371
       0.640825
       0.685528
       0.649727
       0.201883
       0.049227
       0.078481
       0.020681
       0.686114
       0.528088
    
    
      14460
        4552402
       3408585
        301533
        410516
        292786
       138982
       0.556973
       0.565366
       0.748744
       0.066236
       0.064315
       0.090176
       0.030529
       0.569788
       0.421795
    
  

10 rows × 15 columns



In [366]:

    
# Testing code

def to_unicode(vals):
    return [unicode(v) for v in vals]

def test_msas_df(msas_df):

    min_set_of_columns =  set(['Asian','Black','Hispanic', 'Other', 'Total', 'White',
     'entropy4', 'entropy5', 'entropy_rice', 'gini_simpson','p_Asian', 'p_Black',
     'p_Hispanic', 'p_Other','p_White'])  
    
    assert min_set_of_columns & set(msas_df.columns) == min_set_of_columns
    
    # https://www.census.gov/geo/maps-data/data/tallies/national_geo_tallies.html
    # 366 metro areas
    # 576 micropolitan areas
    
    assert len(msas_df) == 942  
    
    # total number of people in metro/micro areas
    
    assert msas_df.Total.sum() == 289261315
    assert msas_df['White'].sum() == 180912856
    assert msas_df['Other'].sum() == 8540181
    
    # list of msas in descendng order by entropy_rice 
    # calculate the top 10 metros by population
    top_10_metros = msas_df.sort_index(by='Total', ascending=False)[:10]
    
    msa_codes_in_top_10_pop_sorted_by_entropy_rice = list(top_10_metros.sort_index(by='entropy_rice', 
                                                ascending=False).index) 
    
    assert to_unicode(msa_codes_in_top_10_pop_sorted_by_entropy_rice)== [u'26420', u'35620', u'47900', u'31100', u'19100', 
        u'33100', u'16980', u'12060', u'37980', u'14460']


    top_10_metro = msas_df.sort_index(by='Total', ascending=False)[:10]
    
    list(top_10_metro.sort_index(by='entropy_rice', ascending=False)['entropy5'])
    
    np.testing.assert_allclose(top_10_metro.sort_index(by='entropy_rice', ascending=False)['entropy5'], 
    [0.79628076626851163, 0.80528601550164602, 0.80809418318973791, 0.7980698349711991,
     0.75945930510650161, 0.74913610558765376, 0.73683277781032397, 0.72964862063970914,
     0.64082509648457675, 0.55697288400004963])
    
    np.testing.assert_allclose(top_10_metro.sort_index(by='entropy_rice', ascending=False)['entropy_rice'],
    [0.87361766576115552,
     0.87272877244078051,
     0.85931803868749834,
     0.85508015237749468,
     0.82169723530719896,
     0.81953527301129059,
     0.80589423784325431,
     0.78602596561378812,
     0.68611350427640316,
     0.56978827050565117])



In [373]:

    
# you are on the right track if test_msas_df doesn't complain
test_msas_df(msas_df)



In [340]:

    
# code to save your dataframe to a CSV
# upload the CSV to bCourses
# uncomment to run
msas_df.to_csv("msas_2010.csv", encoding="UTF-8")



In [341]:

    
# load back the CSV and test again
df = DataFrame.from_csv("msas_2010.csv", encoding="UTF-8")
test_msas_df(df)



In [21]:

	Total	White	Black	Hispanic	Asian	Other	entropy5	entropy4	p_White	p_Black	p_Asian	p_Hispanic	p_Other	entropy_rice	gini_simpson
MSAS
26420	5946800	2360472	998883	2099412	384596	103437	0.796281	0.876425	0.396931	0.167970	0.064673	0.353032	0.017394	0.873618	0.685115
35620	18897109	9233812	3044096	4327560	1860840	430801	0.805286	0.876454	0.488636	0.161088	0.098472	0.229006	0.022797	0.872729	0.672625
47900	5582170	2711258	1409473	770795	513919	176725	0.808094	0.864206	0.485700	0.252496	0.092064	0.138082	0.031659	0.859318	0.671797
31100	12828837	4056820	859086	5700862	1858148	353921	0.798070	0.859159	0.316227	0.066965	0.144842	0.444379	0.027588	0.855080	0.676304
19100	6371773	3201677	941695	1752166	337815	138420	0.759459	0.824101	0.502478	0.147792	0.053017	0.274989	0.021724	0.821697	0.646772
33100	5564635	1937939	1096536	2312929	122082	95149	0.749136	0.821351	0.348260	0.197054	0.021939	0.415648	0.017099	0.819535	0.666348
16980	9461105	5204489	1613644	1957080	526857	159035	0.736833	0.807444	0.550093	0.170556	0.055687	0.206855	0.016809	0.805894	0.622136
12060	5268860	2671757	1679979	547400	252510	117214	0.729649	0.787682	0.507084	0.318851	0.047925	0.103893	0.022247	0.786026	0.627614
37980	5965343	3875845	1204303	468168	293656	123371	0.640825	0.685528	0.649727	0.201883	0.049227	0.078481	0.020681	0.686114	0.528088
14460	4552402	3408585	301533	410516	292786	138982	0.556973	0.565366	0.748744	0.066236	0.064315	0.090176	0.030529	0.569788	0.421795