Exercise

In this exercise, reproduce some of the findings from What Makes Houston the Next Great American City? | Travel | Smithsonian, specifically the calculation represented in

whose caption is

To assess the parity of the four major U.S. ethnic and racial groups, Rice University researchers used a scale called the Entropy Index. It ranges from 0 (a population has just one group) to 1 (all groups are equivalent). Edging New York for the most balanced diversity, Houston had an Entropy Index of 0.874 (orange bar).

The research report by Smithsonian Magazine is Houston Region Grows More Racially/Ethnically Diverse, With Small Declines in Segregation: A Joint Report Analyzing Census Data from 1990, 2000, and 2010 by the Kinder Institute for Urban Research & the Hobby Center for the Study of Texas.

In the report, you'll find the following quotes:

How does Houston’s racial/ethnic diversity compare to the racial/ethnic diversity of other large metropolitan areas? The Houston metropolitan area is the most racially/ethnically diverse.

....

Houston is one of the most racially/ethnically diverse metropolitan areas in the nation as well. *It is the most diverse of the 10 largest U.S. metropolitan areas.* [emphasis mine] Unlike the other large metropolitan areas, all four major racial/ethnic groups have substantial representation in Houston with Latinos and Anglos occupying roughly equal shares of the population.

....

Houston has the highest entropy score of the 10 largest metropolitan areas, 0.874. New York is a close second with a score of 0.872.

....

Your task is:

  1. Tabulate all the metropolian/micropolitan statistical areas. Remember that you have to group various entities that show up separately in the Census API but which belong to the same area. You should find 942 metropolitan/micropolitan statistical areas in the 2010 Census.

  2. Calculate the normalized Shannon index (entropy5) using the categories of White, Black, Hispanic, Asian, and Other as outlined in the Day_07_G_Calculating_Diversity notebook

  3. Calculate the normalized Shannon index (entropy4) by not considering the Other category. In other words, assume that the the total population is the sum of White, Black, Hispanic, and Asian.

  4. Figure out how exactly the entropy score was calculated in the report from Rice University. Since you'll find that the entropy score reported matches neither entropy5 nor entropy4, you'll need to play around with the entropy calculation to figure how to use 4 categories to get the score for Houston to come out to "0.874" and that for NYC to be "0.872". [I think I've done so and get 0.873618 and 0.872729 respectively.]

  5. Add a calculation of the Gini-Simpson diversity index using the five categories of White, Black, Hispanic, Asian, and Other.

  6. Note where the Bay Area stands in terms of the diversity index.

For bonus points:

  • make a bar chart in the style used in the Smithsonian Magazine

Deliverable:

  1. You will need to upload your notebook to a gist and render the notebook in nbviewer and then enter the nbviewer URL (e.g., http://nbviewer.ipython.org/gist/rdhyee/60b6c0b0aad7fd531938)
  2. On bCourses, upload the CSV version of your msas_df.

Hispanic or Latino Origin and Racial Subcategories

http://www.census.gov/developers/data/sf1.xml

compare to http://www.census.gov/prod/cen2010/briefs/c2010br-02.pdf

I think the P0050001 might be the key category

  • P0010001 = P0050001
  • P0050001 = P0050002 + P0050010

P0050002 Not Hispanic or Latino (total) =

  • P0050003 Not Hispanic White only
  • P0050004 Not Hispanic Black only
  • P0050006 Not Hispanic Asian only
  • Not Hispanic Other (should also be P0050002 - (P0050003 + P0050004 + P0050006)

    • P0050005 Not Hispanic: American Indian/ American Indian and Alaska Native alone
    • P0050007 Not Hispanic: Native Hawaiian and Other Pacific Islander alone
    • P0050008 Not Hispanic: Some Other Race alone
    • P0050009 Not Hispanic: Two or More Races
  • P0050010 Hispanic or Latino

P0050010 = P0050011...P0050017

From Hispanic and Latino Americans (Wikipedia):

While the two terms are sometimes used interchangeably, Hispanic is a narrower term which mostly refers to persons of Spanish speaking origin or ancestry, while Latino is more frequently used to refer more generally to anyone of Latin American origin or ancestry, including Brazilians.

and

The Census Bureau's 2010 census does provide a definition of the terms Latino or Hispanic and is as follows: “Hispanic or Latino” refers to a person of Cuban, Mexican, Puerto Rican, South or Central American, or other Spanish culture or origin regardless of race. It allows respondents to self-define whether they were Latino or Hispanic and then identify their specific country or place of origin.[52] On its website, the Census Bureau defines "Hispanic" or "Latino" persons as being "persons who trace their origin [to]... Spanish speaking Central and South America countries, and other Spanish cultures".

In the Racial Dot Map: "Whites are coded as blue; African-Americans, green; Asians, red; Hispanics, orange; and all other racial categories are coded as brown."

In this notebook, we will relate the Racial Dot Map 5-category scheme to the P005* variables.


In [1]:
%pylab --no-import-all inline


Populating the interactive namespace from numpy and matplotlib

In [2]:
import numpy as np
import matplotlib.pyplot as plt
from pandas import DataFrame, Series, Index
import pandas as pd

from itertools import islice

In [3]:
import census
import us

import settings

The census documentation has example URLs but needs your API key to work. In this notebook, we'll use the IPython notebook HTML display mechanism to help out.


In [4]:
c = census.Census(key=settings.CENSUS_KEY)

In [5]:
# generators for the various census geographic entities of interest

def states(variables='NAME'):
    geo={'for':'state:*'}
    states_fips = set([state.fips for state in us.states.STATES])
    # need to filter out non-states
    for r in c.sf1.get(variables, geo=geo):
        if r['state'] in states_fips:
            yield r
            
def counties(variables='NAME'):
    """ask for all the states in one call"""
    
    # tabulate a set of fips codes for the states
    states_fips = set([s.fips for s in us.states.STATES])
    
    geo={'for':'county:*',
             'in':'state:*'}    
    for county in c.sf1.get(variables, geo=geo):
        # eliminate counties whose states aren't in a state or DC
        if county['state'] in states_fips:
            yield county
        

def counties2(variables='NAME'):
    """generator for all counties"""
    
    # since we can get all the counties in one call, 
    # this function is for demonstrating the use of walking through 
    # the states to get at the counties

    for state in us.states.STATES:
        geo={'for':'county:*',
             'in':'state:{fips}'.format(fips=state.fips)}
        for county in c.sf1.get(variables, geo=geo):
            yield county

            
def tracts(variables='NAME'):
    for state in us.states.STATES:
        
        # handy to print out state to monitor progress
        # print state.fips, state
        counties_in_state={'for':'county:*',
             'in':'state:{fips}'.format(fips=state.fips)}
        
        for county in c.sf1.get('NAME', geo=counties_in_state):
            
            # print county['state'], county['NAME']
            tracts_in_county = {'for':'tract:*',
              'in': 'state:{s_fips} county:{c_fips}'.format(s_fips=state.fips, 
                                                            c_fips=county['county'])}
            
            for tract in c.sf1.get(variables,geo=tracts_in_county):
                yield tract


def msas(variables="NAME"):
    
     for state in us.STATES:
        geo = {'for':'metropolitan statistical area/micropolitan statistical area:*', 
               'in':'state:{state_fips}'.format(state_fips=state.fips)
               }
    
        for msa in c.sf1.get(variables, geo=geo):
            yield msa
            
def block_groups(variables='NAME'):
    # http://api.census.gov/data/2010/sf1?get=P0010001&for=block+group:*&in=state:02+county:170
    # let's use the county generator
    for county in counties(variables):
        geo = {'for':'block group:*',
               'in':'state:{state} county:{county}'.format(state=county['state'],
                                                county=county['county'])
               }
        for block_group in c.sf1.get(variables, geo):
            yield block_group
    
    
def blocks(variables='NAME'):
    # http://api.census.gov/data/2010/sf1?get=P0010001&for=block:*&in=state:02+county:290+tract:00100
    
    # make use of the tract generator
    for tract in tracts(variables):
        geo={'for':'block:*',
             'in':'state:{state} county:{county} tract:{tract}'.format(state=tract['state'],
                                                                       county=tract['county'],
                                                                       tract=tract['tract'])
             }
        for block in c.sf1.get(variables, geo):
            yield block
        
def csas(variables="NAME"):
    # http://api.census.gov/data/2010/sf1?get=P0010001&for=combined+statistical+area:*&in=state:24
    for state in us.STATES:
        geo = {'for':'combined statistical area:*', 
               'in':'state:{state_fips}'.format(state_fips=state.fips)
               }
    
        for csa in c.sf1.get(variables, geo=geo):
            yield csa

def districts(variables="NAME"):
    # http://api.census.gov/data/2010/sf1?get=P0010001&for=congressional+district:*&in=state:24
    for state in us.STATES:
        geo = {'for':'congressional district:*', 
               'in':'state:{state_fips}'.format(state_fips=state.fips)
               }
    
        for district in c.sf1.get(variables, geo=geo):
            yield district    
            
def zip_code_tabulation_areas(variables="NAME"):
    # http://api.census.gov/data/2010/sf1?get=P0010001&for=zip+code+tabulation+area:*&in=state:02
    for state in us.STATES:
        geo = {'for':'zip code tabulation area:*', 
               'in':'state:{state_fips}'.format(state_fips=state.fips)
               }
    
        for zip_code_tabulation_area in c.sf1.get(variables, geo=geo):
            yield zip_code_tabulation_area

In [6]:
def census_labels(prefix='P005', n0=1, n1=17, field_width=4, include_name=True, join=False):
    """convenience function to generate census labels"""
    
    label_format = "{i:0%dd}" % (field_width)
    
    variables = [prefix + label_format.format(i=i) for i in xrange(n0,n1+1)]
    if include_name:
        variables = ['NAME'] + variables

    if join:
        return ",".join(variables)
    else:
        return variables

def rdot_labels(other=True):
    if other:
        return ['White', 'Black', 'Asian', 'Hispanic', 'Other']
    else:
        return ['White', 'Black', 'Asian', 'Hispanic']
    
FINAL_LABELS = ['NAME', 'Total'] + rdot_labels() + ['p_White', 'p_Black', 'p_Asian', 'p_Hispanic', 'p_Other'] + ['entropy5', 'entropy4', 'entropy_rice', 'gini_simpson']
    
def convert_to_rdotmap(row):
    """takes the P005 variables and maps to a series with White, Black, Asian, Hispanic, Other
    Total"""
    return pd.Series({'Total':row['P0050001'],
                      'White':row['P0050003'],
                      'Black':row['P0050004'],
                      'Asian':row['P0050006'],
                      'Hispanic':row['P0050010'],
                      'Other': row['P0050005'] + row['P0050007'] + row['P0050008'] + row['P0050009'],
                      }, index=['Total', 'White', 'Black', 'Hispanic', 'Asian', 'Other'])


def normalize(s):
    """take a Series and divide each item by the sum so that the new series adds up to 1.0"""
    total = np.sum(s)
    return s.astype('float') / total
    
def normalize_relabel(s):
    """take a Series and divide each item by the sum so that the new series adds up to 1.0
    Also relabel the indices by adding p_ prefix"""
    total = np.sum(s)
    new_index = list(Series(s.index).apply(lambda x: "p_"+x))
    return Series(list(s.astype('float') / total),new_index)

def entropy(series):
    """Normalized Shannon Index"""
    # a series in which all the entries are equal should result in normalized entropy of 1.0
    
    # eliminate 0s
    series1 = series[series!=0]

    # if len(series) < 2 (i.e., 0 or 1) then return 0
    
    if len(series1) > 1:
        # calculate the maximum possible entropy for given length of input series
        max_s = -np.log(1.0/len(series))
    
        total = float(sum(series1))
        p = series1.astype('float')/float(total)
        return sum(-p*np.log(p))/max_s
    else:
        return 0.0

def gini_simpson(s):
    # https://en.wikipedia.org/wiki/Diversity_index#Gini.E2.80.93Simpson_index
    s1 = normalize(s)
    return 1-np.sum(s1*s1)

def entropy_rice(series):
    """hard code how Rice U did calculation 
    This function takes the entropy5 calculation and removes the contribution from 'Other'
    """
    # pass in a Series with 
    # 'Asian','Black','Hispanic','White','Other'
    # http://kinder.rice.edu/uploadedFiles/Urban_Research_Center/Media/Houston%20Region%20Grows%20More%20Ethnically%20Diverse%202-13.pdf

    s0 = normalize(series)
    s_other = s0['Other']*np.log(s0['Other']) if s0['Other'] > 0 else 0.0
    return (np.log(0.2)*entropy(series) - s_other)/np.log(0.25)

def diversity(df):
    """Takes a df with the P005 variables and does entropy calculation"""
    # convert populations to int
    df[census_labels(include_name=False)] = df[census_labels(include_name=False)].astype('int')
    df = pd.concat((df, df.apply(convert_to_rdotmap, axis=1)),axis=1)
    df = pd.concat((df,df[rdot_labels()].apply(normalize_relabel,axis=1)), axis=1)
    df['entropy5'] = df.apply(lambda x:entropy(x[rdot_labels()]), axis=1)
    df['entropy4'] = df.apply(lambda x:entropy(x[rdot_labels(other=False)]), axis=1)
    df['entropy_rice'] = df.apply(lambda x:entropy_rice(x[rdot_labels()]), axis=1)
    df['gini_simpson'] = df.apply(lambda x:gini_simpson(x[rdot_labels()]), axis=1)
    return df

States


In [7]:
# grab states, convert populations to int
states_df = DataFrame(list(states(census_labels())))
states_df = diversity(states_df)

In [8]:
states_df[FINAL_LABELS].head()


Out[8]:
NAME Total White Black Asian Hispanic Other p_White p_Black p_Asian p_Hispanic p_Other entropy5 entropy4 entropy_rice gini_simpson
0 Alabama 4779736 3204402 1244437 52937 185602 92358 0.670414 0.260357 0.011075 0.038831 0.019323 0.541001 0.570292 0.573075 0.480755
1 Alaska 710231 455320 21949 37459 39249 156254 0.641087 0.030904 0.052742 0.055262 0.220004 0.646677 0.475235 0.510480 0.533815
2 Arizona 6392017 3695647 239101 170509 1895149 391611 0.578166 0.037406 0.026675 0.296487 0.061266 0.663524 0.643529 0.646914 0.571955
3 Arkansas 2915918 2173469 447102 35647 186050 73650 0.745381 0.153331 0.012225 0.063805 0.025258 0.515025 0.526205 0.530902 0.416039
4 California 37253956 14956253 2163804 4775070 14013719 1345110 0.401468 0.058083 0.128176 0.376167 0.036107 0.796994 0.843670 0.838778 0.676216

5 rows × 16 columns


In [9]:
states_df.sort_index(by='entropy5', ascending=False)[FINAL_LABELS].head()


Out[9]:
NAME Total White Black Asian Hispanic Other p_White p_Black p_Asian p_Hispanic p_Other entropy5 entropy4 entropy_rice gini_simpson
11 Hawaii 1360301 309343 19904 513294 120842 396918 0.227408 0.014632 0.377339 0.088835 0.291787 0.833108 0.750762 0.707954 0.712656
4 California 37253956 14956253 2163804 4775070 14013719 1345110 0.401468 0.058083 0.128176 0.376167 0.036107 0.796994 0.843670 0.838778 0.676216
28 Nevada 2700551 1462081 208058 191047 716501 122864 0.541401 0.077043 0.070744 0.265317 0.045496 0.751622 0.774363 0.771193 0.623482
32 New York 19378102 11304247 2783857 1406194 3416922 466882 0.583352 0.143660 0.072566 0.176329 0.024093 0.732727 0.787727 0.785917 0.602124
43 Texas 25145561 11397345 2886825 948426 9460921 452044 0.453255 0.114805 0.037717 0.376246 0.017977 0.727466 0.793870 0.792449 0.638073

5 rows × 16 columns

Counties


In [10]:
r = list(counties(census_labels()))

In [11]:
counties_df = DataFrame(r)
counties_df = diversity(counties_df)
counties_df[FINAL_LABELS].head()


Out[11]:
NAME Total White Black Asian Hispanic Other p_White p_Black p_Asian p_Hispanic p_Other entropy5 entropy4 entropy_rice gini_simpson
0 Autauga County 54571 42154 9595 467 1310 1045 0.772462 0.175826 0.008558 0.024005 0.019149 0.441816 0.453294 0.458294 0.371372
1 Baldwin County 182265 152200 16966 1340 7992 3767 0.835048 0.093084 0.007352 0.043848 0.020668 0.388299 0.386196 0.392968 0.291627
2 Barbour County 27457 12837 12820 107 1387 306 0.467531 0.466912 0.003897 0.050515 0.011145 0.580086 0.636407 0.637309 0.560717
3 Bibb County 22915 17191 5024 22 406 272 0.750207 0.219245 0.000960 0.017718 0.011870 0.421943 0.448712 0.451897 0.388665
4 Blount County 57322 50952 724 115 4626 905 0.888873 0.012630 0.002006 0.080702 0.015788 0.274015 0.263741 0.270876 0.202978

5 rows × 16 columns


In [12]:
counties_df.sort_index(by='entropy5', ascending=False)[FINAL_LABELS].head()


Out[12]:
NAME Total White Black Asian Hispanic Other p_White p_Black p_Asian p_Hispanic p_Other entropy5 entropy4 entropy_rice gini_simpson
1868 Queens County 2230722 616727 395881 508334 613750 96030 0.276470 0.177468 0.227879 0.275135 0.043049 0.925644 0.989171 0.976964 0.762589
68 Aleutians West Census Area 5561 1745 318 1575 726 1197 0.313792 0.057184 0.283222 0.130552 0.215249 0.920216 0.882623 0.829850 0.754673
186 Alameda County 1510271 514559 184126 390524 339889 81173 0.340706 0.121916 0.258579 0.225052 0.053747 0.910834 0.957875 0.944102 0.748656
233 Solano County 413344 168628 58743 59027 99356 27590 0.407960 0.142116 0.142804 0.240371 0.066748 0.897416 0.926901 0.911537 0.730745
67 Aleutians East Borough 3141 425 212 1113 385 1006 0.135307 0.067494 0.354346 0.122572 0.320280 0.896064 0.864996 0.777253 0.733972

5 rows × 16 columns

MSAs


In [13]:
# msas

r = list(msas(census_labels()))

In [14]:
len(r)


Out[14]:
1013

In [15]:
df=DataFrame(r)
df[census_labels(include_name=False)] = df[census_labels(include_name=False)].astype('int')

msas_grouped = df.groupby('metropolitan statistical area/micropolitan statistical area')

#df1 = msas_grouped.apply(lambda x:Series((list(x['NAME']), sum(x['P0050001'])), index=['msas','total_pop'])).sort_index(by='total_pop', ascending=False)
df1 = msas_grouped.apply(lambda x:Series((list(x['NAME']), ), 
                                         index=['msas']))


df2 = msas_grouped.sum()
df3 = pd.concat((df1,df2), axis=1)
df3['NAME'] = df3.apply(lambda x: "; ".join(x['msas']), axis=1)

In [16]:
msas_df = diversity(df3)

In [17]:
# grab the ten most populous msas and sort by entropy_rice
msas_df.sort_index(by='Total', ascending=False)[:10].sort_index(by='entropy_rice', ascending=False)[FINAL_LABELS]


Out[17]:
NAME Total White Black Asian Hispanic Other p_White p_Black p_Asian p_Hispanic p_Other entropy5 entropy4 entropy_rice gini_simpson
metropolitan statistical area/micropolitan statistical area
26420 Houston-Sugar Land-Baytown, TX Metro Area 5946800 2360472 998883 384596 2099412 103437 0.396931 0.167970 0.064673 0.353032 0.017394 0.796281 0.876425 0.873618 0.685115
35620 New York-Northern New Jersey-Long Island, NY-N... 18897109 9233812 3044096 1860840 4327560 430801 0.488636 0.161088 0.098472 0.229006 0.022797 0.805286 0.876454 0.872729 0.672625
47900 Washington-Arlington-Alexandria, DC-VA-MD-WV M... 5582170 2711258 1409473 513919 770795 176725 0.485700 0.252496 0.092064 0.138082 0.031659 0.808094 0.864206 0.859318 0.671797
31100 Los Angeles-Long Beach-Santa Ana, CA Metro Area 12828837 4056820 859086 1858148 5700862 353921 0.316227 0.066965 0.144842 0.444379 0.027588 0.798070 0.859159 0.855080 0.676304
19100 Dallas-Fort Worth-Arlington, TX Metro Area 6371773 3201677 941695 337815 1752166 138420 0.502478 0.147792 0.053017 0.274989 0.021724 0.759459 0.824101 0.821697 0.646772
33100 Miami-Fort Lauderdale-Pompano Beach, FL Metro ... 5564635 1937939 1096536 122082 2312929 95149 0.348260 0.197054 0.021939 0.415648 0.017099 0.749136 0.821351 0.819535 0.666348
16980 Chicago-Joliet-Naperville, IL-IN-WI Metro Area... 9461105 5204489 1613644 526857 1957080 159035 0.550093 0.170556 0.055687 0.206855 0.016809 0.736833 0.807444 0.805894 0.622136
12060 Atlanta-Sandy Springs-Marietta, GA Metro Area 5268860 2671757 1679979 252510 547400 117214 0.507084 0.318851 0.047925 0.103893 0.022247 0.729649 0.787682 0.786026 0.627614
37980 Philadelphia-Camden-Wilmington, PA-NJ-DE-MD Me... 5965343 3875845 1204303 293656 468168 123371 0.649727 0.201883 0.049227 0.078481 0.020681 0.640825 0.685528 0.686114 0.528088
14460 Boston-Cambridge-Quincy, MA-NH Metro Area (par... 4552402 3408585 301533 292786 410516 138982 0.748744 0.066236 0.064315 0.090176 0.030529 0.556973 0.565366 0.569788 0.421795

10 rows × 16 columns


In [18]:
# Testing code

def to_unicode(vals):
    return [unicode(v) for v in vals]

def test_msas_df(msas_df):

    min_set_of_columns =  set(['Asian','Black','Hispanic', 'Other', 'Total', 'White',
     'entropy4', 'entropy5', 'entropy_rice', 'gini_simpson','p_Asian', 'p_Black',
     'p_Hispanic', 'p_Other','p_White'])  
    
    assert min_set_of_columns & set(msas_df.columns) == min_set_of_columns
    
    # https://www.census.gov/geo/maps-data/data/tallies/national_geo_tallies.html
    # 366 metro areas
    # 576 micropolitan areas
    
    assert len(msas_df) == 942  
    
    # total number of people in metro/micro areas
    
    assert msas_df.Total.sum() == 289261315
    assert msas_df['White'].sum() == 180912856
    assert msas_df['Other'].sum() == 8540181
    
    # list of msas in descendng order by entropy_rice 
    top_10_metros = msas_df.sort_index(by='Total', ascending=False)[:10]
    msa_codes_in_top_10_pop_sorted_by_entropy_rice = list(top_10_metros.sort_index(by='entropy_rice', 
                                                ascending=False).index) 
    
    assert to_unicode(msa_codes_in_top_10_pop_sorted_by_entropy_rice)== [u'26420', u'35620', u'47900', u'31100', u'19100', 
        u'33100', u'16980', u'12060', u'37980', u'14460']


    top_10_metro = msas_df.sort_index(by='Total', ascending=False)[:10]
    
    list(top_10_metro.sort_index(by='entropy_rice', ascending=False)['entropy5'])
    
    np.testing.assert_allclose(top_10_metro.sort_index(by='entropy_rice', ascending=False)['entropy5'], 
    [0.79628076626851163, 0.80528601550164602, 0.80809418318973791, 0.7980698349711991,
     0.75945930510650161, 0.74913610558765376, 0.73683277781032397, 0.72964862063970914,
     0.64082509648457675, 0.55697288400004963])
    
    np.testing.assert_allclose(top_10_metro.sort_index(by='entropy_rice', ascending=False)['entropy_rice'],
    [0.87361766576115552,
     0.87272877244078051,
     0.85931803868749834,
     0.85508015237749468,
     0.82169723530719896,
     0.81953527301129059,
     0.80589423784325431,
     0.78602596561378812,
     0.68611350427640316,
     0.56978827050565117])

In [19]:
# you are on the right track if test_msas_df doesn't complain
test_msas_df(msas_df)

In [20]:
# code to save your dataframe to a CSV
# upload the CSV to bCourses
# uncomment to run
# msas_df.to_csv("msas_2010.csv", encoding="UTF-8")

In [21]:
# load back the CSV and test again
# df = DataFrame.from_csv("msas_2010.csv", encoding="UTF-8")
# test_msas_df(df)

Appendix: what if used all the P005 categories?


In [22]:
all_categories = census_labels('P005',2,10, include_name=False) + \
                 census_labels('P005',11,17, include_name=False)
all_categories


Out[22]:
['P0050002',
 'P0050003',
 'P0050004',
 'P0050005',
 'P0050006',
 'P0050007',
 'P0050008',
 'P0050009',
 'P0050010',
 'P0050011',
 'P0050012',
 'P0050013',
 'P0050014',
 'P0050015',
 'P0050016',
 'P0050017']

In [23]:
msas_df['entropy_all'] = msas_df.apply(lambda x:entropy(x[all_categories]), axis=1)

In [24]:
msas_df.sort_index(by='entropy_all', ascending=False)[FINAL_LABELS + ['entropy_all']][:20]


Out[24]:
NAME Total White Black Asian Hispanic Other p_White p_Black p_Asian p_Hispanic p_Other entropy5 entropy4 entropy_rice gini_simpson entropy_all
metropolitan statistical area/micropolitan statistical area
44700 Stockton, CA Metro Area 685306 245919 48540 94547 266341 29959 0.358846 0.070830 0.137963 0.388645 0.043716 0.828052 0.869824 0.862634 0.694223 0.695439
31100 Los Angeles-Long Beach-Santa Ana, CA Metro Area 12828837 4056820 859086 1858148 5700862 353921 0.316227 0.066965 0.144842 0.444379 0.027588 0.798070 0.859159 0.855080 0.676304 0.689053
23420 Fresno, CA Metro Area 930450 304522 45005 86856 468070 25997 0.327285 0.048369 0.093348 0.503058 0.027940 0.732562 0.780302 0.778371 0.627984 0.685842
40140 Riverside-San Bernardino-Ontario, CA Metro Area 4224851 1546666 301523 249899 1996402 130361 0.366088 0.071369 0.059150 0.472538 0.030856 0.736344 0.779591 0.777447 0.633143 0.680298
25260 Hanford-Corcoran, CA Metro Area 152982 53879 10314 5339 77866 5584 0.352192 0.067420 0.034900 0.508988 0.036501 0.702746 0.729483 0.728699 0.609796 0.676985
32900 Merced, CA Metro Area 255793 81599 8785 18183 140485 6741 0.319004 0.034344 0.071085 0.549214 0.026353 0.679215 0.719629 0.719421 0.589674 0.675128
41500 Salinas, CA Metro Area 415057 136435 11300 23777 230003 13542 0.328714 0.027225 0.057286 0.554148 0.032627 0.662618 0.688024 0.688723 0.579780 0.672056
46700 Vallejo-Fairfield, CA Metro Area 413344 168628 58743 59027 99356 27590 0.407960 0.142116 0.142804 0.240371 0.066748 0.897416 0.926901 0.911537 0.730745 0.670699
12540 Bakersfield-Delano, CA Metro Area 839631 323794 45377 33100 413033 24327 0.385638 0.054044 0.039422 0.491922 0.028973 0.686088 0.722859 0.722509 0.603981 0.668019
26420 Houston-Sugar Land-Baytown, TX Metro Area 5946800 2360472 998883 384596 2099412 103437 0.396931 0.167970 0.064673 0.353032 0.017394 0.796281 0.876425 0.873618 0.685115 0.661636
31460 Madera-Chowchilla, CA Metro Area 150865 57380 5009 2533 80992 4951 0.380340 0.033202 0.016790 0.536851 0.032817 0.618488 0.634707 0.637158 0.564671 0.659675
17500 Clewiston, FL Micro Area 39140 13650 5057 275 19243 915 0.348748 0.129203 0.007026 0.491645 0.023378 0.685630 0.733128 0.732653 0.619370 0.658762
10740 Albuquerque, NM Metro Area 887077 374214 19766 16769 414222 62106 0.421851 0.022282 0.018904 0.466952 0.070012 0.662122 0.629810 0.634408 0.598243 0.658328
29820 Las Vegas-Paradise, NV Metro Area 1951269 935955 194821 165121 568644 86728 0.479665 0.099843 0.084622 0.291423 0.044447 0.800982 0.835903 0.830088 0.665889 0.654037
33700 Modesto, CA Metro Area 514453 240423 13065 24712 215658 20595 0.467337 0.025396 0.048035 0.419199 0.040033 0.675950 0.691203 0.691824 0.601313 0.653970
24380 Grants, NM Micro Area 27213 5857 221 136 9934 11065 0.215228 0.008121 0.004998 0.365046 0.406607 0.702078 0.552323 0.551140 0.654998 0.652172
41740 San Diego-Carlsbad-San Marcos, CA Metro Area 3095313 1500047 146600 328058 991348 129260 0.484619 0.047362 0.105985 0.320274 0.041760 0.764654 0.795817 0.792070 0.647349 0.652044
41940 San Jose-Sunnyvale-Santa Clara, CA Metro Area 1836911 648063 42686 566764 510396 69002 0.352800 0.023238 0.308542 0.277856 0.037564 0.805816 0.852024 0.846600 0.701179 0.649439
47300 Visalia-Porterville, CA Metro Area 442179 143935 5497 14204 268065 10478 0.325513 0.012432 0.032123 0.606236 0.023696 0.573133 0.598715 0.601417 0.524771 0.647226
41860 San Francisco-Oakland-Fremont, CA Metro Area 4335391 1840372 349895 994616 938794 211714 0.424500 0.080707 0.229418 0.216542 0.048834 0.859532 0.901183 0.891526 0.711379 0.645037

20 rows × 17 columns


In [25]:
msas_df.sort_index(by='P0050001', ascending=False).head()


Out[25]:
msas P0050001 P0050002 P0050003 P0050004 P0050005 P0050006 P0050007 P0050008 P0050009 P0050010 P0050011 P0050012 P0050013 P0050014 P0050015 P0050016 P0050017 NAME Total
metropolitan statistical area/micropolitan statistical area
35620 [New York-Northern New Jersey-Long Island, NY-... 18897109 14569549 9233812 3044096 31377 1860840 4859 93753 300812 4327560 1943852 318520 61255 17421 3729 1670891 311892 New York-Northern New Jersey-Long Island, NY-N... 18897109 ...
31100 [Los Angeles-Long Beach-Santa Ana, CA Metro Area] 12828837 7127975 4056820 859086 25102 1858148 30821 30960 267038 5700862 2710537 48532 65858 26521 4627 2545313 299474 Los Angeles-Long Beach-Santa Ana, CA Metro Area 12828837 ...
16980 [Chicago-Joliet-Naperville, IL-IN-WI Metro Are... 9461105 7504025 5204489 1613644 12777 526857 1975 13026 131257 1957080 979392 32349 23748 5944 986 815750 98911 Chicago-Joliet-Naperville, IL-IN-WI Metro Area... 9461105 ...
19100 [Dallas-Fort Worth-Arlington, TX Metro Area] 6371773 4619607 3201677 941695 24758 337815 5431 9049 99182 1752166 959603 20176 18632 3688 765 668721 80581 Dallas-Fort Worth-Arlington, TX Metro Area 6371773 ...
37980 [Philadelphia-Camden-Wilmington, PA-NJ-DE-MD M... 5965343 5497175 3875845 1204303 9541 293656 1563 10971 101296 468168 192506 37477 6799 2110 653 191036 37587 Philadelphia-Camden-Wilmington, PA-NJ-DE-MD Me... 5965343 ...

5 rows × 35 columns

Plot courtesy of AJ


In [26]:
top_10_metros = msas_df.sort_index(by='Total', ascending=False)[:10]
top_10_metros['City'] = top_10_metros['NAME'].apply(lambda name: name.split('-')[0])
top_10_metros.sort(columns=['entropy_rice'], inplace=True, ascending=True)

cities = pd.Series(top_10_metros['City'])

diversity = pd.Series(top_10_metros['entropy_rice'])

p_white = pd.Series(top_10_metros['p_White'])
p_asian = pd.Series(top_10_metros['p_Asian'])
p_black = pd.Series(top_10_metros['p_Black'])
p_latino = pd.Series(top_10_metros['p_Hispanic'])

In [27]:
import matplotlib.pyplot as plt

fig = plt.figure(figsize=(10, 8))
ax = plt.subplot(111)

# y axis locations for diversity and races
y_div = np.arange(len(cities))*2
y_race = (np.arange(len(cities))*2)+1

# diversity bars
pDiversity = ax.barh(y_div, diversity, alpha=0.4)

# stacked horizontal bars
pWhite = ax.barh(y_race, p_white, color='b')
pLatino = ax.barh(y_race, p_latino, color='g', left=p_white)
pBlack = ax.barh(y_race, p_black, color='r', left=p_white+p_latino)
pAsian = ax.barh(y_race, p_asian, color='c', left=p_white+p_latino+p_black)

plt.yticks(y_race, cities)

# legend foo https://stackoverflow.com/questions/4700614/how-to-put-the-legend-out-of-the-plot
# Shink current axis's height by 10% on the bottom
box = ax.get_position()
ax.set_position([box.x0, box.y0 + box.height * 0.1,
                 box.width, box.height * 0.85])

# Put a legend below current axis
ax.legend((pWhite, pLatino, pBlack, pAsian, pDiversity), ('White', 'Latino', 'Black', 'Asian', 'Diversity'),
          loc='upper center', bbox_to_anchor=(0.5, -0.05),
          fancybox=True, shadow=True, ncol=5)

plt.show()

# If you want to save it