Exercise

Look at this What Makes Houston the Next Great American City? | Travel | Smithsonian, specifically the calculation represented in

whose caption is

To assess the parity of the four major U.S. ethnic and racial groups, Rice University researchers used a scale called the Entropy Index. It ranges from 0 (a population has just one group) to 1 (all groups are equivalent). Edging New York for the most balanced diversity, Houston had an Entropy Index of 0.874 (orange bar).

The research report by Smithsonian Magazine is Houston Region Grows More Racially/Ethnically Diverse, With Small Declines in Segregation: A Joint Report Analyzing Census Data from 1990, 2000, and 2010 by the Kinder Institute for Urban Research & the Hobby Center for the Study of Texas.

In the report, you'll find the following quotes:

How does Houston’s racial/ethnic diversity compare to the racial/ethnic diversity of other large metropolitan areas? The Houston metropolitan area is the most racially/ethnically diverse.

....

Houston is one of the most racially/ethnically diverse metropolitan areas in the nation as well. *It is the most diverse of the 10 largest U.S. metropolitan areas.* [emphasis mine] Unlike the other large metropolitan areas, all four major racial/ethnic groups have substantial representation in Houston with Latinos and Anglos occupying roughly equal shares of the population.

....

Houston has the highest entropy score of the 10 largest metropolitan areas, 0.874. New York is a close second with a score of 0.872.

....

Tasks in this notebook:

  1. Tabulate all the metropolian/micropolitan statistical areas. Remember that you have to group various entities that show up separately in the Census API but which belong to the same area. You should find 942 metropolitan/micropolitan statistical areas in the 2010 Census.

  2. Calculate the normalized Shannon index (entropy5) using the categories of White, Black, Hispanic, Asian, and Other as outlined in the Day_07_G_Calculating_Diversity notebook

  3. Calculate the normalized Shannon index (entropy4) by not considering the Other category. In other words, assume that the the total population is the sum of White, Black, Hispanic, and Asian.

  4. Figure out how exactly the entropy score was calculated in the report from Rice University. Since you'll find that the entropy score reported matches neither entropy5 nor entropy4, you'll need to play around with the entropy calculation to figure how to use 4 categories to get the score for Houston to come out to "0.874" and that for NYC to be "0.872". [I think I've done so and get 0.873618 and 0.872729 respectively.]

  5. Add a calculation of the Gini-Simpson diversity index using the five categories of White, Black, Hispanic, Asian, and Other.

  6. Note where the Bay Area stands in terms of the diversity index.

  7. make a bar chart in the style used in the Smithsonian Magazine


In [25]:
# FILL IN WITH YOUR CODE

import census
import settings
import us


import time
from pandas import DataFrame, Series, Index
import pandas as pd
from itertools import islice
import numpy as np


c = census.Census(key=settings.CENSUS_KEY)

In [26]:
def msas(variables="NAME"):
    
     for state in us.STATES:
        geo = {'for':'metropolitan statistical area/micropolitan statistical area:*', 
               'in':'state:{state_fips}'.format(state_fips=state.fips)
               }
    
        for msa in c.sf1.get(variables, geo=geo):
            yield msa

In [27]:
def P005_range(n0,n1): 
    return tuple(('P005'+ "{i:04d}".format(i=i) for i in xrange(n0,n1)))

P005_vars = P005_range(1,18)
P005_vars_str = ",".join(P005_vars)
P005_vars_with_name = ['NAME'] + list(P005_vars)


# http://manishamde.github.io/blog/2013/03/07/pandas-and-python-top-10/#create
def convert_to_rdotmap(row):
    """takes the P005 variables and maps to a series with White, Black, Asian, Hispanic, Other
    Total and Name"""
    return pd.Series({'Total':row['P0050001'],
                      'White':row['P0050003'],
                      'Black':row['P0050004'],
                      'Asian':row['P0050006'],
                      'Hispanic':row['P0050010'],
                      'Other': row['P0050005'] + row['P0050007'] + row['P0050008'] + row['P0050009'],
                      }, index=['Total', 'White', 'Black', 'Hispanic', 'Asian', 'Other'])


def normalize(s):
    """take a Series and divide each item by the sum so that the new series adds up to 1.0"""
    total = np.sum(s)
    return s.astype('float') / total


def entropy(series):
    """Normalized Shannon Index"""
    # a series in which all the entries are equal should result in normalized entropy of 1.0
    
    # eliminate 0s
    series1 = series[series!=0]

    # if len(series) < 2 (i.e., 0 or 1) then return 0
    
    if len(series) > 1:
        # calculate the maximum possible entropy for given length of input series
        max_s = -np.log(1.0/len(series))
    
        total = float(sum(series1))
        p = series1.astype('float')/float(total)
        return sum(-p*np.log(p))/max_s
    else:
        return 0.0

    
def convert_P005_to_int(df):
    # do conversion in place
    df[list(P005_vars)] = df[list(P005_vars)].astype('int')
#     df.NAME = df.NAME

    return df
    

def diversity(r):

    """Returns a DataFrame with the following columns
    """
    df = DataFrame(r)
    df = convert_P005_to_int(df)
    df1 = df.apply(convert_to_rdotmap, axis=1)
    
    df1['entropy5'] = df1[['Asian','Black','Hispanic','White','Other']].apply(entropy,axis=1)
    df1['entropy4'] = df1[['Asian','Black','Hispanic','White']].apply(entropy,axis=1)
    return df1

In [28]:
msa_list = list(islice(msas(P005_vars_with_name), None))
len(msa_list)


Out[28]:
1013

In [29]:
msa_list
dr = DataFrame(msa_list)
dr = convert_P005_to_int(dr)
dr.head()

grouped = dr.groupby('metropolitan statistical area/micropolitan statistical area').sum()
grouped.head()


Out[29]:
P0050001 P0050002 P0050003 P0050004 P0050005 P0050006 P0050007 P0050008 P0050009 P0050010 P0050011 P0050012 P0050013 P0050014 P0050015 P0050016 P0050017
metropolitan statistical area/micropolitan statistical area
10020 57999 56618 46305 8246 195 1148 4 53 667 1381 617 40 14 12 1 550 147
10100 40602 40048 37774 192 1095 358 42 17 570 554 264 7 28 2 0 152 101
10140 72797 66525 59282 762 3005 995 177 72 2232 6272 2543 41 320 37 5 2745 581
10180 165252 130144 112735 11549 655 2110 113 170 2812 35108 18773 635 429 99 19 13098 2055
10220 37492 35969 25973 879 6331 244 17 12 2513 1523 714 23 181 0 0 413 192

5 rows × 17 columns


In [30]:
df_diversity = diversity(grouped)

In [31]:
#'p_Asian', 'p_Black', 'p_Hispanic', 'p_Other','p_White'
df_diversity['p_Asian'] = df_diversity['Asian']/df_diversity['Total']
df_diversity['p_Black'] = df_diversity['Black']/df_diversity['Total']
df_diversity['p_Hispanic'] = df_diversity['Hispanic']/df_diversity['Total']
df_diversity['p_Other'] = df_diversity['Other']/df_diversity['Total']
df_diversity['p_White'] = df_diversity['White']/df_diversity['Total']

In [32]:
def giniSimpson(series):
    """Normalized Shannon Index"""
    # a series in which all the entries are equal should result in normalized entropy of 1.0
    
    # eliminate 0s
    series1 = series[series!=0]
    
    if len(series) > 1:
        # calculate the maximum possible entropy for given length of input series
        max_s = -np.log(1.0/len(series))
    
        total = float(sum(series1))
        p = series1.astype('float')/float(total)
        return sum(p*(1-p))
    else: 
        return 0.0 
    

def entropyRice(series):
    """Rice Entropy Index"""
    # a series in which all the entries are equal should result in normalized entropy of 1.0
    
    # eliminate 0s
    series1 = series[series!=0]
    
    if len(series) > 1:
        max_s = -np.log(1.0/4) #only 4 races are considered in this Entropy
        p = series1.astype('float')
        p_other = series['p_Other'].astype('float')
        E_other = (-p_other*np.log(p_other))
        
        return (sum(-p*np.log(p)) - E_other)/max_s
    else: 
        return 0.0
    
df_diversity['gini_simpson'] = df_diversity[['Asian','Black','Hispanic','White','Other']].apply(giniSimpson,axis=1)

df_diversity['entropy_rice'] = df_diversity[['p_Asian', 'p_Black','p_Hispanic', 'p_Other','p_White']].apply(entropyRice,axis=1)

In [33]:
msas_df = df_diversity
len(msas_df)


Out[33]:
942

In [34]:
# Testing code

def to_unicode(vals):
    return [unicode(v) for v in vals]

def test_msas_df(msas_df):

    min_set_of_columns =  set(['Asian','Black','Hispanic', 'Other', 'Total', 'White',
     'entropy4', 'entropy5', 'entropy_rice', 'gini_simpson','p_Asian', 'p_Black',
     'p_Hispanic', 'p_Other','p_White'])  
    
    #--> what does this assert mean?
    assert min_set_of_columns & set(msas_df.columns) == min_set_of_columns
    
    # https://www.census.gov/geo/maps-data/data/tallies/national_geo_tallies.html
    # 366 metro areas
    # 576 micropolitan areas
    
    assert len(msas_df) == 942  
    
    # total number of people in metro/micro areas
    
    assert msas_df.Total.sum() == 289261315
    assert msas_df['White'].sum() == 180912856
    assert msas_df['Other'].sum() == 8540181
    
    # list of msas in descendng order by entropy_rice 
    # calculate the top 10 metros by population
    top_10_metros = msas_df.sort_index(by='Total', ascending=False)[:10]
    
    msa_codes_in_top_10_pop_sorted_by_entropy_rice = list(top_10_metros.sort_index(by='entropy_rice', 
                                                ascending=False).index) 
    
    assert to_unicode(msa_codes_in_top_10_pop_sorted_by_entropy_rice)== [u'26420', u'35620', u'47900', u'31100', u'19100', 
        u'33100', u'16980', u'12060', u'37980', u'14460']


    top_10_metro = msas_df.sort_index(by='Total', ascending=False)[:10]
    
    list(top_10_metro.sort_index(by='entropy_rice', ascending=False)['entropy5'])
    
    np.testing.assert_allclose(top_10_metro.sort_index(by='entropy_rice', ascending=False)['entropy5'], 
    [0.79628076626851163, 0.80528601550164602, 0.80809418318973791, 0.7980698349711991,
     0.75945930510650161, 0.74913610558765376, 0.73683277781032397, 0.72964862063970914,
     0.64082509648457675, 0.55697288400004963])
    
    np.testing.assert_allclose(top_10_metro.sort_index(by='entropy_rice', ascending=False)['entropy_rice'],
    [0.87361766576115552,
     0.87272877244078051,
     0.85931803868749834,
     0.85508015237749468,
     0.82169723530719896,
     0.81953527301129059,
     0.80589423784325431,
     0.78602596561378812,
     0.68611350427640316,
     0.56978827050565117])

In [35]:
# you are on the right track if test_msas_df doesn't complain
test_msas_df(msas_df)

In [36]:
# code to save your dataframe to a CSV
msas_df.to_csv("msas_2010.csv", encoding="UTF-8")

In [37]:
# load back the CSV and test again
df12 = DataFrame.from_csv("msas_2010.csv", encoding="UTF-8")

test_msas_df(df12)

In [38]:
top_10_metro = msas_df.sort_index(by='Total', ascending=False)[:10]

top_10_metro.sort_index(by='entropy_rice',ascending=False)


Out[38]:
Total White Black Hispanic Asian Other entropy5 entropy4 p_Asian p_Black p_Hispanic p_Other p_White gini_simpson entropy_rice
metropolitan statistical area/micropolitan statistical area
26420 5946800 2360472 998883 2099412 384596 103437 0.796281 0.876425 0.064673 0.167970 0.353032 0.017394 0.396931 0.685115 0.873618
35620 18897109 9233812 3044096 4327560 1860840 430801 0.805286 0.876454 0.098472 0.161088 0.229006 0.022797 0.488636 0.672625 0.872729
47900 5582170 2711258 1409473 770795 513919 176725 0.808094 0.864206 0.092064 0.252496 0.138082 0.031659 0.485700 0.671797 0.859318
31100 12828837 4056820 859086 5700862 1858148 353921 0.798070 0.859159 0.144842 0.066965 0.444379 0.027588 0.316227 0.676304 0.855080
19100 6371773 3201677 941695 1752166 337815 138420 0.759459 0.824101 0.053017 0.147792 0.274989 0.021724 0.502478 0.646772 0.821697
33100 5564635 1937939 1096536 2312929 122082 95149 0.749136 0.821351 0.021939 0.197054 0.415648 0.017099 0.348260 0.666348 0.819535
16980 9461105 5204489 1613644 1957080 526857 159035 0.736833 0.807444 0.055687 0.170556 0.206855 0.016809 0.550093 0.622136 0.805894
12060 5268860 2671757 1679979 547400 252510 117214 0.729649 0.787682 0.047925 0.318851 0.103893 0.022247 0.507084 0.627614 0.786026
37980 5965343 3875845 1204303 468168 293656 123371 0.640825 0.685528 0.049227 0.201883 0.078481 0.020681 0.649727 0.528088 0.686114
14460 4552402 3408585 301533 410516 292786 138982 0.556973 0.565366 0.064315 0.066236 0.090176 0.030529 0.748744 0.421795 0.569788

10 rows × 15 columns


In [39]:
# 6. Note where the Bay Area stands in terms of the diversity index.
def msas_for_bayarea(variables="NAME"):
    geo = {'for':'metropolitan statistical area/micropolitan statistical area:*', 
           'in':'state:{state_fips}'.format(state_fips='06')
           }
    for msa_ca in c.sf1.get(variables, geo=geo):
        yield msa_ca


msa_ca_list = list(islice(msas_for_bayarea(['P0050001','NAME']), None))

In [40]:
"""Following is the df of msa of CA - which are bay area? Let us look at various MSAs in CA and manually handpick them"""
DataFrame(msa_ca_list)
# list(DataFrame(msa_ca_list)['metropolitan statistical area/micropolitan statistical area'].astype('int'))
#  df[list(P005_vars)] = df[list(P005_vars)].astype('int')


Out[40]:
NAME P0050001 metropolitan statistical area/micropolitan statistical area state
0 Bakersfield-Delano, CA Metro Area 839631 12540 06
1 Bishop, CA Micro Area 18546 13860 06
2 Chico, CA Metro Area 220000 17020 06
3 Clearlake, CA Micro Area 64665 17340 06
4 Crescent City, CA Micro Area 28610 18860 06
5 El Centro, CA Metro Area 174528 20940 06
6 Eureka-Arcata-Fortuna, CA Micro Area 134623 21700 06
7 Fresno, CA Metro Area 930450 23420 06
8 Hanford-Corcoran, CA Metro Area 152982 25260 06
9 Los Angeles-Long Beach-Santa Ana, CA Metro Area 12828837 31100 06
10 Madera-Chowchilla, CA Metro Area 150865 31460 06
11 Merced, CA Metro Area 255793 32900 06
12 Modesto, CA Metro Area 514453 33700 06
13 Napa, CA Metro Area 136484 34900 06
14 Oxnard-Thousand Oaks-Ventura, CA Metro Area 823318 37100 06
15 Phoenix Lake-Cedar Ridge, CA Micro Area 55365 38020 06
16 Red Bluff, CA Micro Area 63463 39780 06
17 Redding, CA Metro Area 177223 39820 06
18 Riverside-San Bernardino-Ontario, CA Metro Area 4224851 40140 06
19 Sacramento--Arden-Arcade--Roseville, CA Metro ... 2149127 40900 06
20 Salinas, CA Metro Area 415057 41500 06
21 San Diego-Carlsbad-San Marcos, CA Metro Area 3095313 41740 06
22 San Francisco-Oakland-Fremont, CA Metro Area 4335391 41860 06
23 San Jose-Sunnyvale-Santa Clara, CA Metro Area 1836911 41940 06
24 San Luis Obispo-Paso Robles, CA Metro Area 269637 42020 06
25 Santa Barbara-Santa Maria-Goleta, CA Metro Area 423895 42060 06
26 Santa Cruz-Watsonville, CA Metro Area 262382 42100 06
27 Santa Rosa-Petaluma, CA Metro Area 483878 42220 06
28 Stockton, CA Metro Area 685306 44700 06
29 Susanville, CA Micro Area 34895 45000 06
30 Truckee-Grass Valley, CA Micro Area 98764 46020 06
31 Ukiah, CA Micro Area 87841 46380 06
32 Vallejo-Fairfield, CA Metro Area 413344 46700 06
33 Visalia-Porterville, CA Metro Area 442179 47300 06
34 Yuba City, CA Metro Area 166892 49700 06

35 rows × 4 columns


In [41]:
"""I am considering the following 4 areas as bay area MSAs"""
# Napa, CA Metro Area	 136484	 34900	 06
# San Francisco-Oakland-Fremont, CA Metro Area	 4335391	 41860	 06
# San Jose-Sunnyvale-Santa Clara, CA Metro Area	 1836911	 41940	 06
# Santa Rosa-Petaluma, CA Metro Area	 483878	 42220	 06
df_bay = msas_df[msas_df.index.isin(['34900', '41860', '41940', '42220'])]

In [42]:
df_bay


Out[42]:
Total White Black Hispanic Asian Other entropy5 entropy4 p_Asian p_Black p_Hispanic p_Other p_White gini_simpson entropy_rice
metropolitan statistical area/micropolitan statistical area
34900 136484 76967 2440 44010 8986 4081 0.648671 0.676360 0.065839 0.017878 0.322455 0.029901 0.563927 0.572460 0.677380
41860 4335391 1840372 349895 938794 994616 211714 0.859532 0.901183 0.229418 0.080707 0.216542 0.048834 0.424500 0.711379 0.891526
41940 1836911 648063 42686 510396 566764 69002 0.805816 0.852024 0.308542 0.023238 0.277856 0.037564 0.352800 0.701179 0.846600
42220 483878 320027 6769 120430 17777 18875 0.576115 0.572311 0.036739 0.013989 0.248885 0.039008 0.661380 0.497566 0.577569

4 rows × 15 columns


In [48]:
# Barchart showing desities of top 10 MSAs 
def createBarchart(msas_df,df_bay):
    #http://matplotlib.org/examples/api/barchart_demo.html
    %pylab --no-import-all inline
    import numpy as np
    import matplotlib.pyplot as plt
    n_groups = 10
    top_10_metro = msas_df.sort_index(by='Total', ascending=False)[:10]
    top_10_metro = top_10_metro.sort_index(by='entropy_rice',ascending=False) #  sorting by Entropy_rice
    
#     #adding bayarea MSAs to Top 10
#     top_10_metro= top_10_metro.add(df_bay, fill_value=0)
#     print len(top_10_metro)
    
    white   = tuple(top_10_metro['p_White'])
    hispanic   = tuple(top_10_metro['p_Hispanic'])
    black   = tuple(top_10_metro['p_Black'])
    asian   = tuple(top_10_metro['p_Asian'])
    other   = tuple(top_10_metro['p_Other'])
    
    fig, ax = plt.subplots(figsize=(14,5))
    
    index = np.arange(n_groups)
    bar_width = 0.15
    
    opacity = 0.7
    error_config = {'ecolor': '0.3'}
    
    rects1 = plt.bar(index, white, bar_width,
                     alpha=opacity,
                     color='b',
                     label='white')
    
    rects2 = plt.bar(index + bar_width, hispanic, bar_width,
                     alpha=opacity,
                     color='r',
                     label='hispanic')
    rects3 = plt.bar(index + 2*bar_width, black, bar_width,
                     alpha=opacity,
                     color='c',
                     label='black')
    
    rects4 = plt.bar(index + 3*bar_width, asian, bar_width,
                     alpha=opacity,
                     color='y',
                     label='asian')
    rects5 = plt.bar(index + 4*bar_width, other, bar_width,
                     alpha=opacity,
                     color='g',
                     label='other')
    

    plt.xlabel('MSA IDs')
    plt.ylabel('Percentages')
    plt.title('Barchart showing desities of Top 10 most Diverse MSAs')
    plt.xticks(index + 2*bar_width, tuple(top_10_metro.index),  rotation=45)
    plt.legend()
    plt.yticks(np.arange(0,1.1,.1))
    plt.tight_layout()
    plt.show()


    
    
createBarchart(msas_df,df_bay)
##'34900', '41860', '41940', '42220' are Bay area MSAs


Populating the interactive namespace from numpy and matplotlib

In [49]:
# Barchart showing desities of Bay Area MSAs
def createBarchart(msas_df,df_bay):
    #http://matplotlib.org/examples/api/barchart_demo.html
    %pylab --no-import-all inline
    import numpy as np
    import matplotlib.pyplot as plt
    n_groups = 4
    top_10_metro = msas_df.sort_index(by='Total', ascending=False)[:10]
    top_10_metro = top_10_metro.sort_index(by='entropy_rice',ascending=False) #  sorting by Entropy_rice
    
    # bayarea MSAs
    top_10_metro= df_bay
    print len(top_10_metro)
    
    white   = tuple(top_10_metro['p_White'])
    hispanic   = tuple(top_10_metro['p_Hispanic'])
    black   = tuple(top_10_metro['p_Black'])
    asian   = tuple(top_10_metro['p_Asian'])
    other   = tuple(top_10_metro['p_Other'])
    
    fig, ax = plt.subplots(figsize=(14,5))
    
    index = np.arange(n_groups)
    bar_width = 0.10
    
    opacity = 0.7
    error_config = {'ecolor': '0.3'}
    
    rects1 = plt.bar(index, white, bar_width,
                     alpha=opacity,
                     color='b',
                     label='white')
    
    rects2 = plt.bar(index + bar_width, hispanic, bar_width,
                     alpha=opacity,
                     color='r',
                     label='hispanic')
    rects3 = plt.bar(index + 2*bar_width, black, bar_width,
                     alpha=opacity,
                     color='c',
                     label='black')
    
    rects4 = plt.bar(index + 3*bar_width, asian, bar_width,
                     alpha=opacity,
                     color='y',
                     label='asian')
    rects5 = plt.bar(index + 4*bar_width, other, bar_width,
                     alpha=opacity,
                     color='g',
                     label='other')
    

    plt.xlabel('MSA IDs')
    plt.ylabel('Percentages')
    plt.title('Barchart showing desities of Bay Area MSAs')
    plt.xticks(index + 2*bar_width, tuple(top_10_metro.index),  rotation=45)
    plt.legend()
    plt.yticks(np.arange(0,1.1,.1))
    plt.tight_layout()
    plt.show()


    
    
createBarchart(msas_df,df_bay)
##'34900', '41860', '41940', '42220' are Bay area MSAs


Populating the interactive namespace from numpy and matplotlib
4

In [53]:
# Barchart showing desities of top 10 MSAs vs Bay Area MSAs
def createBarchart(msas_df,df_bay):
    #http://matplotlib.org/examples/api/barchart_demo.html
    %pylab --no-import-all inline
    import numpy as np
    import matplotlib.pyplot as plt
    n_groups = 10
    top_10_metro = msas_df.sort_index(by='Total', ascending=False)[:10]
    top_10_metro = top_10_metro.sort_index(by='entropy_rice',ascending=False) #  sorting by Entropy_rice
    
#     #adding bayarea MSAs to Top 10
#     top_10_metro= top_10_metro.add(df_bay, fill_value=0)
#     print len(top_10_metro)
    
    white   = tuple(top_10_metro['p_White'])
    hispanic   = tuple(top_10_metro['p_Hispanic'])
    black   = tuple(top_10_metro['p_Black'])
    asian   = tuple(top_10_metro['p_Asian'])
    other   = tuple(top_10_metro['p_Other'])
    
    fig, (ax2, ax1) = plt.subplots(nrows=1, ncols=2, figsize=(18,5))
    
    index = np.arange(n_groups)
    bar_width = 0.20
    
    opacity = 0.4
    error_config = {'ecolor': '0.3'}
    
#     plt.subplot(1, 2, 1)
    rects1 = ax1.bar(index, white, bar_width,
                     alpha=opacity,
                     color='b',
                     label='white')
    
    rects2 = ax1.bar(index + bar_width, hispanic, bar_width,
                     alpha=opacity,
                     color='r',
                     label='hispanic')
    rects3 = ax1.bar(index + 2*bar_width, black, bar_width,
                     alpha=opacity,
                     color='c',
                     label='black')
    
    rects4 = ax1.bar(index + 3*bar_width, asian, bar_width,
                     alpha=opacity,
                     color='y',
                     label='asian')
    rects5 = ax1.bar(index + 4*bar_width, other, bar_width,
                     alpha=opacity,
                     color='g',
                     label='other')


    ax1.set_xlabel('Race')
    ax1.set_ylabel('Percentages')
    ax1.set_yticks(np.arange(0,1.1,.1))
    ax1.set_xticklabels(index + 2*bar_width, tuple(top_10_metro.index))
    
   
    
    
    white   = tuple(df_bay['p_White'])
    hispanic   = tuple(df_bay['p_Hispanic'])
    black   = tuple(df_bay['p_Black'])
    asian   = tuple(df_bay['p_Asian'])
    other   = tuple(df_bay['p_Other'])
        
    index1 = np.arange(4)
    bar_width = 0.15
    
#     plt.subplot(1, 2, 2)
    rects1 = ax2.bar(index1, white, bar_width,
                     alpha=opacity,
                     color='b',
                     label='white')
    
    rects2 = ax2.bar(index1 + bar_width, hispanic, bar_width,
                     alpha=opacity,
                     color='r',
                     label='hispanic')
    rects3 = ax2.bar(index1 + 2*bar_width, black, bar_width,
                     alpha=opacity,
                     color='c',
                     label='black')
    
    rects4 = ax2.bar(index1 + 3*bar_width, asian, bar_width,
                     alpha=opacity,
                     color='y',
                     label='asian')
    rects5 = ax2.bar(index1 + 4*bar_width, other, bar_width,
                     alpha=opacity,
                     color='g',
                     label='other')


    ax2.set_xlabel('MSA IDs')
    ax2.set_ylabel('Percentages')
    ax2.set_xticklabels(index1 + 2*bar_width, tuple(df_bay.index))
    ax2.set_yticks(np.arange(0,1.1,.1))

    fig.suptitle('Barchart showing desities of Top 10 most Diverse MSAs', fontsize=14)

#     plt.title('Barchart showing desities of Top 10 most Diverse MSAs')
    ax2.legend()
    plt.tight_layout()
    plt.show()


    
    
createBarchart(msas_df,df_bay)
##'34900', '41860', '41940', '42220' are Bay area MSAs    plt.yticks(np.arange(0,1.1,.1))


Populating the interactive namespace from numpy and matplotlib

In [52]:
# Barchart showing desities of top 10 MSAs and Bay Area MSAs
def createBarchart(msas_df):
    #http://matplotlib.org/examples/api/barchart_demo.html
    %pylab --no-import-all inline
    import numpy as np
    import matplotlib.pyplot as plt
    
    N = 14
    top_10_metro = msas_df.sort_index(by='Total', ascending=False)[:10]
    top_10_metro = top_10_metro.sort_index(by='entropy_rice',ascending=False) #  sorting by Entropy_rice
    
    #adding bayarea MSAs to Top 10
    top_10_metro= top_10_metro.add(df_bay, fill_value=0)
    print len(top_10_metro)
    
    fig, ax = plt.subplots(figsize=(14,5))
    asian   = tuple(top_10_metro['p_Asian'])
    black   = tuple(top_10_metro['p_Black'])
    hispanic   = tuple(top_10_metro['p_Hispanic'])
    white   = tuple(top_10_metro['p_White'])
    other   = tuple(top_10_metro['p_Other'])

    ind = np.arange(N)    # the x locations for the groups
    width = 0.35       # the width of the bars: can also be len(x) sequence
    
    comb = lambda a, b: tuple(x+y for x, y in zip(a, b))
   
    ao = comb(asian, other)
    bao = comb(black, ao)
    hbao = comb(hispanic, bao)
    
    p5 = plt.bar(ind, other,   width, color='g', alpha=0.8)
    p4 = plt.bar(ind, asian,   width, color='y', alpha=0.8, bottom=other)
    p3 = plt.bar(ind, black,   width, color='c', alpha=0.8, bottom=ao)
    p2 = plt.bar(ind, hispanic,   width, color='r', alpha=0.8, bottom=bao)
    p1 = plt.bar(ind, white,   width, color='b', alpha=0.8,bottom=hbao)

#   p2 = plt.bar(ind, womenMeans, width, color='y', bottom=menMeans)
    
    #crimson', 'burlywood', 'chartreuse'
    plt.ylabel('Percentages')
    plt.xlabel('MSA IDs')

    plt.title('Barchart showing desities of top 10 MSAs and Bay Area MSAs')
    plt.xticks(ind+width/2., tuple(top_10_metro.index), rotation=45 )
    plt.yticks(np.arange(0,1.1,.1))
    plt.legend( (p1[0], p2[0],p3[0], p4[0],p5[0]), ('White', 'Hispanic', 'Black', 'Asian', 'Others') )
    
    plt.show()
    
    
createBarchart(msas_df)


Populating the interactive namespace from numpy and matplotlib
14