Exercise

In this exercise, reproduce some of the findings from What Makes Houston the Next Great American City? | Travel | Smithsonian, specifically the calculation represented in

whose caption is

To assess the parity of the four major U.S. ethnic and racial groups, Rice University researchers used a scale called the Entropy Index. It ranges from 0 (a population has just one group) to 1 (all groups are equivalent). Edging New York for the most balanced diversity, Houston had an Entropy Index of 0.874 (orange bar).

The research report by Smithsonian Magazine is Houston Region Grows More Racially/Ethnically Diverse, With Small Declines in Segregation: A Joint Report Analyzing Census Data from 1990, 2000, and 2010 by the Kinder Institute for Urban Research & the Hobby Center for the Study of Texas.

In the report, you'll find the following quotes:

How does Houston’s racial/ethnic diversity compare to the racial/ethnic diversity of other large metropolitan areas? The Houston metropolitan area is the most racially/ethnically diverse.

....

Houston is one of the most racially/ethnically diverse metropolitan areas in the nation as well. *It is the most diverse of the 10 largest U.S. metropolitan areas.* [emphasis mine] Unlike the other large metropolitan areas, all four major racial/ethnic groups have substantial representation in Houston with Latinos and Anglos occupying roughly equal shares of the population.

....

Houston has the highest entropy score of the 10 largest metropolitan areas, 0.874. New York is a close second with a score of 0.872.

....

Your task is:

  1. Tabulate all the metropolian/micropolitan statistical areas. Remember that you have to group various entities that show up separately in the Census API but which belong to the same area. You should find 942 metropolitan/micropolitan statistical areas in the 2010 Census.

  2. Calculate the normalized Shannon index (entropy5) using the categories of White, Black, Hispanic, Asian, and Other as outlined in the Day_07_G_Calculating_Diversity notebook

  3. Calculate the normalized Shannon index (entropy4) by not considering the Other category. In other words, assume that the the total population is the sum of White, Black, Hispanic, and Asian.

  4. Figure out how exactly the entropy score was calculated in the report from Rice University. Since you'll find that the entropy score reported matches neither entropy5 nor entropy4, you'll need to play around with the entropy calculation to figure how to use 4 categories to get the score for Houston to come out to "0.874" and that for NYC to be "0.872". [I think I've done so and get 0.873618 and 0.872729 respectively.]

  5. Add a calculation of the Gini-Simpson diversity index using the five categories of White, Black, Hispanic, Asian, and Other.

  6. Note where the Bay Area stands in terms of the diversity index.

For bonus points:

  • make a bar chart in the style used in the Smithsonian Magazine

Deliverable:

  1. You will need to upload your notebook to a gist and render the notebook in nbviewer and then enter the nbviewer URL (e.g., http://nbviewer.ipython.org/gist/rdhyee/60b6c0b0aad7fd531938)
  2. On bCourses, upload the CSV version of your msas_df.

HAVE FUN: ASK QUESTIONS AND WORK TOGETHER

Constraints

Below is testing code to help make sure you are on the right track. A key assumption made here is that you will end up with a Pandas DataFrame called msas_df, indexed by the FIPS code of a metropolitan/micropolitan area (e.g., Houston's code is 26420) and with the the following columns:

  • Total
  • White
  • Black
  • Hispanic
  • Asian
  • Other
  • p_White
  • p_Black
  • p_Hispanic
  • p_Asian
  • p_Other
  • entropy4
  • entropy5
  • entropy_rice
  • gini_simpson

You should have 942 rows, one for each MSA. You can compare your results for entropy5, entropy_rice with mine.


In [43]:
# FILL IN WITH YOUR CODE

import census
import settings
import us


import time
from pandas import DataFrame, Series, Index
import pandas as pd
from itertools import islice

c = census.Census(key=settings.CENSUS_KEY)

In [44]:
# def msas(variables="NAME"):
#     # tabulate a set of fips codes for the states
#     states_fips = set([s.fips for s in us.states.STATES])
#     geo = {'for':'metropolitan statistical area/micropolitan statistical area:*', 
#                'in':'state:*'}
               
#     for msa in c.sf1.get(variables, geo=geo):
#         if msa['state'] in states_fips:
#             yield msa


def msas(variables="NAME"):
    
     for state in us.STATES:
        geo = {'for':'metropolitan statistical area/micropolitan statistical area:*', 
               'in':'state:{state_fips}'.format(state_fips=state.fips)
               }
    
        for msa in c.sf1.get(variables, geo=geo):
            yield msa


msa_list = list(islice(msas("NAME,P0010001"), None))
len(msa_list)


Out[44]:
1013

In [45]:
msa_list[:5]


Out[45]:
[{u'NAME': u'Albertville, AL Micro Area',
  u'P0010001': u'93019',
  u'metropolitan statistical area/micropolitan statistical area': u'10700',
  u'state': u'01'},
 {u'NAME': u'Alexander City, AL Micro Area',
  u'P0010001': u'53155',
  u'metropolitan statistical area/micropolitan statistical area': u'10760',
  u'state': u'01'},
 {u'NAME': u'Anniston-Oxford, AL Metro Area',
  u'P0010001': u'118572',
  u'metropolitan statistical area/micropolitan statistical area': u'11500',
  u'state': u'01'},
 {u'NAME': u'Auburn-Opelika, AL Metro Area',
  u'P0010001': u'140247',
  u'metropolitan statistical area/micropolitan statistical area': u'12220',
  u'state': u'01'},
 {u'NAME': u'Birmingham-Hoover, AL Metro Area',
  u'P0010001': u'1128047',
  u'metropolitan statistical area/micropolitan statistical area': u'13820',
  u'state': u'01'}]

In [46]:
df = DataFrame(msa_list)

In [82]:
df.groupby('metropolitan statistical area/micropolitan statistical area').count()
type(df)


Out[82]:
pandas.core.frame.DataFrame

In [61]:
df[['P0010001', 'state']] = \
    df[['P0010001', 'state']].astype('int')

In [92]:
#initially done like this
#grouped_msa = df.groupby('metropolitan statistical area/micropolitan statistical area').sum() 
#---> groups by msa---> have name, popn, state as columns
#but did not work to get list of msa
#therefore changed the strategy to the following

grouped_msa = df.groupby('metropolitan statistical area/micropolitan statistical area')

In [94]:
s= grouped_msa['P0010001'].sum()
s


Out[94]:
metropolitan statistical area/micropolitan statistical area
10020                                                           57999
10100                                                           40602
10140                                                           72797
10180                                                          165252
10220                                                           37492
10300                                                           99892
10420                                                          703200
10460                                                           63797
10500                                                          157308
10540                                                          116672
10580                                                          870716
10620                                                           60585
10660                                                           31255
10700                                                           93019
10740                                                          887077
...
49060                                                           36311
49100                                                           51461
49180                                                          477717
49260                                                           20081
49300                                                          114520
49340                                                          798552
49380                                                           21378
49420                                                          243231
49460                                                           22438
49540                                                           28065
49620                                                          434972
49660                                                          565773
49700                                                          166892
49740                                                          195751
49780                                                           86074
Name: P0010001, Length: 942, dtype: int32

In [201]:
type(s)


Out[201]:
pandas.core.series.Series

In [ ]:
#to get the msa_ids
msa_ids = s.index
msa_ids

In [119]:
msa_ids


Out[119]:
Index([u'10020', u'10100', u'10140', u'10180', u'10220', u'10300', u'10420', u'10460', u'10500', u'10540', u'10580', u'10620', u'10660', u'10700', u'10740', u'10760', u'10780', u'10820', u'10860', u'10880', u'10900', u'10940', u'10980', u'11020', u'11060', u'11100', u'11140', u'11180', u'11220', u'11260', u'11300', u'11340', u'11380', u'11420', u'11460', u'11500', u'11540', u'11580', u'11620', u'11660', u'11700', u'11740', u'11780', u'11820', u'11860', u'11900', u'11940', u'11980', u'12020', u'12060', u'12100', u'12140', u'12180', u'12220', u'12260', u'12300', u'12380', u'12420', u'12460', u'12540', u'12580', u'12620', u'12660', u'12700', u'12740', u'12780', u'12820', u'12860', u'12900', u'12940', u'12980', u'13020', u'13060', u'13100', u'13140', u'13180', u'13220', u'13260', u'13300', u'13340', u'13380', u'13420', u'13460', u'13500', u'13540', u'13620', u'13660', u'13700', u'13740', u'13780', u'13820', u'13860', u'13900', u'13940', u'13980', u'14020', u'14060', u'14100', u'14140', u'14180', ...], dtype='object')

In [111]:
#to access list of states of each msa
for msa in s.index:
    print msa, s[msa],list(grouped_msa.get_group(msa).state)


10020 57999 [22]
10100 40602 [46]
10140 72797 [53]
10180 165252 [48]
10220 37492 [40]
10300 99892 [26]
10420 703200 [39]
10460 63797 [35]
10500 157308 [13]
10540 116672 [41]
10580 870716 [36]
10620 60585 [37]
10660 31255 [27]
10700 93019 [1]
10740 887077 [35]
10760 53155 [1]
10780 153922 [22]
10820 36009 [27]
10860 40838 [48]
10880 111408 [26]
10900 821173 [34, 42]
10940 42476 [26]
10980 29598 [26]
11020 127089 [42]
11060 26446 [40]
11100 249881 [48]
11140 37829 [13]
11180 89542 [19]
11220 50219 [36]
11260 380821 [2]
11300 131636 [18]
11340 187126 [45]
11380 14786 [48]
11420 34185 [18]
11460 344791 [26]
11500 118572 [1]
11540 225666 [55]
11580 34862 [12]
11620 56980 [40]
11660 22995 [5]
11700 424858 [37]
11740 53139 [39]
11780 101497 [39]
11820 37039 [41]
11860 16924 [20]
11900 64757 [39]
11940 52266 [47]
11980 78532 [48]
12020 192541 [13]
12060 5268860 [13]
12100 274549 [34]
12140 42223 [18]
12180 80026 [36]
12220 140247 [1]
12260 556877 [13, 45]
12300 122151 [23]
12380 39163 [27]
12420 1716289 [48]
12460 27842 [13]
12540 839631 [6]
12580 2710489 [24]
12620 153923 [23]
12660 61976 [55]
12700 215888 [25]
12740 59534 [50]
12780 50976 [40]
12820 27979 [22]
12860 60079 [36]
12900 36647 [5]
12940 802484 [22]
12980 136146 [26]
13020 107771 [26]
13060 36702 [48]
13100 22311 [31]
13140 388745 [48]
13180 88759 [55]
13220 78859 [54]
13260 46134 [18]
13300 31861 [48]
13340 45858 [39]
13380 201140 [53]
13420 44442 [27]
13460 157733 [41]
13500 28933 [45]
13540 37125 [50]
13620 39361 [33, 50]
13660 42798 [26]
13700 35012 [48]
13740 158050 [30]
13780 251725 [36]
13820 1128047 [1]
13860 18546 [6]
13900 108779 [38]
13940 45607 [16]
13980 162958 [51]
14020 192714 [18]
14060 169572 [17]
14100 85562 [42]
14140 107342 [51, 54]
14180 46480 [5]
14220 47168 [22]
14260 616561 [16]
14300 33915 [48]
14340 26306 [19]
14380 51079 [37]
14420 22150 [48]
14460 4552402 [25, 33]
14500 294567 [8]
14540 125953 [21]
14580 89513 [30]
14620 43450 [42]
14660 91067 [27]
14700 83877 [29]
14740 251133 [53]
14780 33718 [48]
14820 33090 [37]
14860 916829 [9]
14940 49975 [49]
15020 34869 [28]
15060 22364 [41]
15100 31965 [46]
15140 18787 [47]
15180 406220 [48]
15220 38106 [48]
15260 112370 [13]
15340 43784 [39]
15380 1135509 [36]
15420 43021 [16]
15460 47656 [17, 19]
15500 151131 [37]
15540 211261 [50]
15580 34200 [30]
15620 47584 [26]
15660 55186 [13]
15700 32618 [24]
15740 40087 [39]
15780 31488 [5]
15820 24512 [21]
15860 46824 [8]
15900 37069 [17]
15940 404422 [39]
15980 618754 [12]
16020 96275 [17, 29]
16060 60218 [17]
16100 53829 [35]
16180 55274 [32]
16220 75450 [56]
16260 46163 [49]
16300 257940 [19]
16340 41475 [13]
16380 40814 [39]
16420 31499 [21]
16460 39437 [17]
16500 75455 [53]
16540 149618 [42]
16580 231891 [17]
16620 304284 [54]
16660 64921 [17]
16700 664607 [45]
16740 1758038 [37, 45]
16820 201559 [51]
16860 528143 [13, 47]
16900 33140 [45]
16940 91738 [56]
16980 9461105 [17, 18, 55]
17020 220000 [6]
17060 78064 [39]
17140 2130151 [18, 21, 39]
17200 43742 [33]
17220 94196 [54]
17260 26151 [28]
17300 273949 [21, 47]
17340 64665 [6]
17380 34145 [28]
17420 115788 [47]
17460 2077240 [39]
17500 39140 [12]
17540 49116 [19]
17580 48376 [35]
17660 138494 [16]
17700 35471 [20]
17740 45248 [26]
17780 228660 [48]
17820 645613 [8]
17860 172786 [29]
17900 767598 [45]
17940 80956 [47]
17980 294865 [1, 13]
18020 76794 [18]
18060 59779 [28]
18100 32237 [31]
18140 1836536 [39]
18180 146445 [33]
18220 24277 [18]
18260 106042 [47]
18300 63043 [41]
18340 35637 [21]
18380 23439 [13]
18420 37057 [28]
18460 43041 [13]
18500 98990 [36]
18580 428185 [48]
18620 47735 [48]
18660 49336 [36]
18700 85579 [41]
18740 36901 [39]
18820 38124 [18]
18860 28610 [6]
18880 180822 [12]
18900 56053 [47]
18940 61773 [22]
18980 80406 [1]
19020 46689 [51]
19060 103299 [24, 54]
19100 6371773 [48]
19140 142227 [13]
19180 81625 [17]
19220 53174 [21]
19260 106561 [51]
19300 182265 [1]
19340 379690 [17, 19]
19380 841502 [39]
19460 153829 [1]
19500 110768 [17]
19540 34387 [18]
19580 39037 [39]
19620 48879 [48]
19660 494593 [12]
19700 25095 [35]
19740 2543482 [8]
19760 35654 [22]
19780 569633 [19]
19820 4296250 [26]
19860 24982 [38]
19900 32062 [45]
19940 36031 [17]
19980 33848 [20]
20020 145639 [1]
20060 50731 [13]
20100 162310 [10]
20140 58414 [13]
20180 81642 [42]
20220 93653 [19]
20260 279771 [27, 55]
20300 21904 [48]
20340 45048 [40]
20380 114678 [37]
20420 51334 [8]
20460 42416 [40]
20500 504357 [37]
20540 38335 [47]
20580 54258 [48]
20620 107841 [39]
20660 37782 [24]
20700 169842 [42]
20740 161151 [55]
20780 59507 [8]
20820 34242 [17]
20900 41280 [48]
20940 174528 [6]
20980 41639 [5]
21020 64094 [37]
21060 119736 [21]
21120 22119 [40]
21140 197559 [18]
21220 50805 [32]
21260 40915 [53]
21300 88830 [36]
21340 800647 [48]
21380 36480 [20]
21420 60580 [40]
21460 100199 [1]
21500 280566 [42]
21540 37069 [26]
21580 40246 [35]
21640 29970 [1, 13]
21660 351715 [41]
21700 134623 [6]
21740 21118 [56]
21780 358676 [18, 21]
21820 97581 [2]
21860 20840 [27]
21900 56418 [54]
21980 24877 [32]
22020 208777 [27, 38]
22060 64142 [27]
22100 65359 [29]
22140 130044 [35]
22180 366383 [37]
22220 463204 [5, 29]
22260 57303 [27]
22280 51980 [32]
22300 74782 [39]
22340 27172 [13]
22380 134421 [4]
22420 425790 [26]
22500 205566 [45]
22520 147137 [1]
22540 101633 [55]
22580 67810 [37]
22620 28258 [5]
22660 299630 [8]
22700 38013 [19]
22780 52274 [29]
22800 43001 [19, 29]
22820 28159 [8]
22840 71109 [1]
22860 52334 [22]
22900 298592 [5, 40]
22980 27695 [13]
23060 416257 [18]
23140 33224 [18]
23180 70706 [21]
23240 24837 [48]
23300 47711 [17]
23340 36691 [31]
23380 60944 [39]
23420 930450 [6]
23460 104430 [1]
23500 55342 [45]
23540 264275 [12]
23580 179684 [13]
23620 38437 [48]
23660 70626 [17]
23700 71492 [35]
23780 36776 [20]
23820 46997 [32]
23860 60158 [45]
23900 101407 [42]
23940 46133 [56]
23980 52272 [21]
24020 128923 [36]
24100 55531 [36]
24140 122623 [37]
24180 59672 [48]
24220 98461 [27, 38]
24260 72726 [31]
24300 146723 [8]
24340 774160 [26]
24380 27213 [35]
24420 82713 [41]
24460 27674 [20]
24500 81327 [30]
24540 252825 [8]
24580 306241 [55]
24620 68831 [47]
24660 723801 [37]
24700 25740 [18]
24740 51137 [28]
24780 189510 [37]
24820 52959 [39]
24860 636986 [45]
24900 42914 [28]
24940 69661 [45]
24980 21906 [28]
25060 248820 [28]
25100 20640 [40]
25180 269140 [24, 54]
25220 121097 [22]
25260 152982 [6]
25300 38948 [29]
25340 54181 [47]
25380 24913 [17]
25420 549475 [42]
25460 45233 [5]
25500 125228 [51]
25540 1212381 [9]
25580 37906 [31]
25620 142842 [28]
25660 16096 [30]
25700 28452 [20]
25720 23530 [49]
25740 74801 [30]
25760 21757 [5]
25780 45422 [37]
25820 19372 [48]
25860 365497 [37]
25900 185079 [15]
25940 187010 [45]
25980 77917 [13]
26020 64727 [35]
26100 263801 [26]
26140 141236 [12]
26180 953207 [15]
26220 22346 [41]
26260 31606 [5]
26300 96024 [5]
26340 38784 [26]
26380 208178 [22]
26420 5946800 [48]
26460 63096 [36]
26480 49683 [47]
26500 45913 [42]
26540 37124 [18]
26580 287702 [21, 39, 54]
26620 417593 [1]
26660 67861 [48]
26700 17398 [46]
26740 64511 [20]
26780 36651 [27]
26820 130374 [16]
26860 88880 [42]
26900 1756241 [18]
26940 29450 [28]
26980 152586 [19]
27020 30591 [26, 55]
27060 101564 [36]
27100 160248 [26]
27140 539057 [28]
27180 115425 [47]
27220 31464 [16, 56]
27260 1345596 [12]
27300 40902 [17]
27340 177772 [37]
27380 50845 [48]
27420 21100 [38]
27460 134905 [36]
27500 160331 [55]
27540 54734 [18]
27620 149807 [29]
27660 31594 [22]
27700 30099 [13]
27740 198716 [47]
27780 143679 [42]
27860 121026 [5]
27900 175518 [29]
27940 31275 [2]
27980 154834 [15]
28020 326589 [26]
28060 90928 [30]
28100 113449 [17]
28140 2035334 [20, 29]
28180 67091 [15]
28260 52591 [31]
28300 77117 [33]
28340 47536 [18]
28380 31953 [29]
28420 253340 [53]
28500 49625 [48]
28540 13477 [2]
28580 73090 [12]
28620 33920 [37]
28660 405300 [48]
28700 309544 [47, 51]
28740 182493 [36]
28780 32477 [48]
28820 59495 [37]
28860 30038 [29]
28900 66380 [41]
28940 698030 [47]
28980 13592 [2]
29020 98688 [18]
29060 60088 [33]
29100 133665 [27, 55]
29140 201789 [18]
29180 273738 [22]
29220 40716 [47]
29260 25748 [41]
29300 67044 [13]
29340 199607 [22]
29380 67531 [12]
29420 200186 [4]
29460 602095 [12]
29500 13833 [48]
29540 519445 [42]
29580 76652 [45]
29620 464036 [26]
29660 36299 [56]
29700 250304 [48]
29740 209233 [35]
29780 29393 [35]
29820 1951269 [32]
29860 84823 [28]
29900 36157 [37]
29940 110826 [20]
29980 41869 [47]
30020 124098 [40]
30060 35571 [29]
30100 174724 [33, 50]
30140 133568 [42]
30220 22935 [48]
30260 44947 [42]
30280 30617 [47]
30300 60888 [16, 53]
30340 107702 [23]
30380 46682 [42]
30420 26370 [31]
30460 472099 [21]
30500 105151 [24]
30580 22952 [20]
30620 106331 [39]
30660 30305 [17]
30700 302157 [31]
30740 78265 [37]
30780 699757 [5]
30820 39238 [42]
30860 125442 [16, 49]
30900 38966 [18]
30940 58849 [21]
30980 214369 [48]
31020 102410 [53]
31060 17950 [35]
31100 12828837 [6]
31140 1283566 [18, 21]
31180 284890 [48]
31260 86771 [48]
31300 134168 [37]
31340 252634 [51]
31380 32612 [17]
31420 232293 [13]
31460 150865 [6]
31500 32428 [18]
31540 568593 [55]
31580 46920 [21]
31620 24552 [5]
31660 51599 [36]
31700 400721 [33]
31740 127081 [20]
31820 81442 [55]
31860 96740 [27]
31900 124475 [39]
31920 42750 [48]
31940 65778 [26, 55]
31980 70061 [18]
32020 66501 [39]
32060 66357 [17]
32100 67077 [26]
32140 25857 [27]
32180 23370 [29]
32220 65631 [48]
32260 40648 [19]
32270 74749 [55]
32280 35021 [47]
32300 67972 [51]
32340 23370 [29]
32380 51749 [19]
32460 37121 [21]
32500 31360 [21]
32540 45837 [40]
32580 774769 [48]
32620 53535 [28]
32660 39839 [47]
32700 29180 [20]
32740 88765 [42]
32780 203206 [41]
32820 1316100 [5, 28, 47]
32860 43857 [55]
32900 255793 [6]
32940 107449 [28]
32980 28743 [55]
33020 25529 [29]
33060 31848 [40]
33100 5564635 [12]
33140 111467 [18]
33180 28691 [21]
33220 83629 [26]
33260 136872 [48]
33300 55149 [13]
33340 1555908 [55]
33380 41207 [22]
33420 28111 [48]
33460 3279833 [27, 55]
33500 69540 [38]
33540 109299 [30]
33580 22835 [46]
33620 25414 [29]
33660 412992 [1]
33700 514453 [6]
33740 176441 [22]
33780 152021 [26]
33820 36842 [55]
33860 374536 [1]
33940 41276 [8]
33980 66469 [37]
34020 54650 [22]
34060 129709 [54]
34100 136608 [47]
34140 37244 [16]
34180 89120 [53]
34220 45498 [13]
34260 41513 [5]
34300 27038 [16]
34340 73673 [37]
34380 70311 [26]
34420 32334 [48]
34460 44396 [21]
34500 47284 [17]
34540 60921 [39]
34580 116901 [53]
34620 117671 [18]
34660 37191 [21]
34700 54132 [19]
34740 172188 [26]
34780 70990 [40]
34820 269291 [45]
34860 64524 [48]
34900 136484 [6]
34940 321520 [12]
34980 1589934 [47]
35020 53119 [22, 28]
35060 39566 [22]
35100 126802 [37]
35140 37508 [45]
35220 49462 [18]
35260 91108 [42]
35300 862477 [9]
35340 73240 [22]
35380 1167764 [22]
35420 92582 [39]
35460 35662 [47]
35500 36842 [19]
35580 25893 [27]
35620 18897109 [34, 36, 42]
35660 156813 [26]
35700 47420 [4]
35740 48271 [31]
35820 37590 [31]
35840 702281 [12]
35860 28525 [18]
35900 69340 [37]
35940 59626 [39]
35980 274055 [9]
36020 78506 [53]
36060 46039 [54]
36100 331298 [12]
36140 97265 [34]
36180 51454 [24]
36220 137130 [48]
36260 547184 [49]
36300 111944 [36]
36340 54984 [42]
36380 39996 [12]
36420 1252987 [40]
36460 80317 [36]
36500 252264 [53]
36540 865350 [19, 31]
36580 62259 [36]
36620 53936 [16, 41]
36660 83384 [22]
36700 92501 [45]
36740 2134411 [12]
36780 166994 [55]
36820 22381 [19]
36860 154908 [17]
36900 35625 [19]
36940 36576 [27]
36980 114752 [21]
37020 70648 [26]
37060 47351 [28]
37100 823318 [6]
37140 98762 [17, 21]
37220 43946 [32]
37260 74364 [12]
37300 58458 [48]
37340 543376 [12]
37380 95696 [12]
37420 23464 [48]
37460 168852 [12]
37500 42090 [5]
37540 32330 [47]
37580 49793 [48]
37620 162056 [39, 54]
37660 21607 [20]
37700 162246 [28]
37740 53597 [4]
37780 13783 [48]
37800 33309 [19]
37820 87062 [41]
37860 448991 [12]
37900 379186 [17]
37940 36903 [18]
37980 5965343 [10, 24, 34, 42]
38020 55365 [6]
38060 4192887 [4]
38100 55834 [28]
38180 19988 [46]
38200 23421 [22]
38220 100258 [5]
38260 39134 [20]
38300 2356285 [42]
38340 131219 [25]
38380 36273 [48]
38420 51208 [55]
38460 82128 [36]
38500 47051 [18]
38540 90656 [16]
38580 58258 [39, 54]
38620 46562 [40]
38700 38950 [17]
38740 42794 [29]
38780 19846 [35]
38820 71404 [53]
38860 514098 [23]
38900 2226009 [41, 53]
38940 424107 [12]
39020 79499 [39]
39060 148289 [42]
39100 670301 [36]
39140 211033 [4]
39220 21403 [49]
39260 20978 [41]
39300 1600852 [25, 44]
39340 526810 [49]
39380 159063 [8]
39420 44776 [53]
39460 159978 [12]
39500 77314 [17, 29]
39540 195408 [55]
39580 1130490 [37]
39660 126382 [46]
39700 22134 [48]
39740 411442 [42]
39780 63463 [6]
39820 177223 [6]
39860 46183 [27]
39900 425417 [32]
39940 50778 [16]
39980 68917 [18]
40060 1258251 [51]
40080 99972 [21]
40100 60968 [48]
40140 4224851 [6]
40180 40123 [56]
40220 308707 [51]
40260 76790 [37]
40300 53497 [17]
40340 186011 [27]
40380 1054323 [36]
40420 349431 [17]
40460 46639 [37]
40500 39736 [23]
40540 43806 [56]
40580 152392 [37]
40620 45156 [29]
40660 96317 [13]
40700 107667 [41]
40740 65645 [35]
40760 20497 [35]
40780 83939 [5]
40820 63009 [22]
40860 61642 [50]
40900 2149127 [6]
40940 45657 [4]
40980 200169 [26]
41060 189093 [27]
41100 138115 [49]
41140 127329 [20, 29]
41180 2812896 [17, 29]
41220 50513 [13]
41260 31946 [42]
41420 390738 [41]
41460 61697 [20]
41500 415057 [6]
41540 125203 [24]
41580 138428 [37]
41620 1124197 [49]
41660 111823 [48]
41700 2142508 [48]
41740 3095313 [6]
41780 77079 [39]
41820 57866 [37]
41860 4335391 [6]
41940 1836911 [6]
42020 269637 [6]
42060 423895 [6]
42100 262382 [6]
42140 144170 [35]
42220 483878 [6]
42300 38520 [26]
42340 347611 [13]
42380 62622 [42]
42420 37660 [31]
42460 53227 [1]
42500 24181 [18]
42540 563631 [42]
42580 197145 [10]
42620 77076 [5]
42660 3439809 [53]
42680 138028 [12]
42700 98786 [12]
42740 42201 [29]
42780 39702 [42]
42820 43820 [1]
42860 74273 [45]
42900 35251 [36]
42940 89889 [47]
42980 42376 [18]
43060 69442 [40]
43100 115507 [55]
43140 98078 [37]
43180 45058 [47]
43220 60699 [53]
43260 29116 [56]
43300 120877 [48]
43320 107449 [4]
43340 398604 [22]
43380 49423 [39]
43420 131346 [4]
43460 39191 [29]
43500 29514 [35]
43540 27994 [8]
43580 143577 [19, 31, 46]
43620 228261 [46]
43660 16921 [48]
43700 63063 [21]
43740 77742 [42]
43780 319224 [18, 26]
43860 88247 [37]
43900 284307 [45]
43940 24097 [46]
43980 16667 [19]
44020 16667 [19]
44060 471221 [53]
44100 210170 [17]
44140 692942 [25]
44180 436712 [29]
44220 138333 [39]
44260 47671 [28]
44300 153990 [42]
44340 70217 [13]
44380 159437 [37]
44420 118502 [51]
44500 37890 [48]
44540 22709 [8]
44580 58498 [17]
44600 124454 [39, 54]
44620 70019 [55]
44660 77350 [40]
44700 685306 [6]
44740 20260 [19]
44780 61295 [26]
44860 35161 [48]
44900 26015 [13]
44940 107456 [45]
44980 94528 [42]
45000 34895 [6]
45020 15216 [48]
45060 662577 [36]
45140 46987 [40]
45180 82291 [1]
45220 367413 [12]
45260 12093 [22]
45300 2783243 [12]
45340 32937 [35]
45380 34800 [17]
45460 172425 [18]
45500 136027 [5, 48]
45520 25213 [41]
45540 93420 [12]
45580 27153 [13]
45620 44720 [13]
45640 162878 [37]
45660 56745 [39]
45700 40118 [13]
45740 26175 [13]
45780 651429 [39]
45820 233870 [20]
45860 189927 [9]
45900 143372 [26]
45940 366513 [34]
45980 32899 [1]
46020 98764 [6]
46060 980263 [4]
46100 100210 [47]
46140 937478 [40]
46180 136268 [28]
46220 219461 [1]
46260 21452 [1]
46300 99604 [16]
46340 209714 [48]
46380 87841 [6]
46420 28961 [45]
46460 38620 [21, 47]
46500 40097 [39]
46540 299397 [36]
46620 26405 [48]
46660 139588 [13]
46700 413344 [6]
46740 34215 [1]
46780 28744 [39]
46820 13864 [46]
46860 32588 [49]
46900 13535 [48]
46980 48773 [28]
47020 115384 [48]
47080 36346 [13]
47180 38440 [18]
47220 156898 [34]
47260 1671683 [37, 51]
47300 442179 [6]
47340 32888 [18]
47380 234906 [48]
47420 22897 [27, 38]
47460 58781 [53]
47500 38892 [45]
47540 45949 [39]
47580 139900 [13]
47620 41815 [42]
47660 52595 [29]
47700 77358 [18]
47780 31648 [18]
47820 47759 [37]
47900 5582170 [11, 24, 51, 54]
47920 29030 [39]
47940 167819 [19]
47980 33130 [46]
48020 83686 [55]
48060 116229 [36]
48100 27731 [12]
48140 134063 [55]
48180 55070 [13]
48220 27469 [40]
48300 110884 [53]
48460 40400 [29]
48500 20634 [28]
48540 147950 [39, 54]
48580 102228 [55]
48620 623061 [20]
48660 151306 [48]
48700 116111 [42]
48740 118428 [9]
48780 22398 [38]
48820 42239 [27]
48900 362315 [37]
48940 42040 [39]
48980 81234 [37]
49020 128472 [51, 54]
49060 36311 [20]
49100 51461 [27]
49180 477717 [37]
49260 20081 [40]
49300 114520 [39]
49340 798552 [25]
49380 21378 [27]
49420 243231 [53]
49460 22438 [46]
49540 28065 [28]
49620 434972 [42]
49660 565773 [39, 42]
49700 166892 [6]
49740 195751 [4]
49780 86074 [39]

In [73]:
def P005_range(n0,n1): 
    return tuple(('P005'+ "{i:04d}".format(i=i) for i in xrange(n0,n1)))

P005_vars = P005_range(1,18)
P005_vars_str = ",".join(P005_vars)
P005_vars_with_name = ['NAME'] + list(P005_vars)


# http://manishamde.github.io/blog/2013/03/07/pandas-and-python-top-10/#create
def convert_to_rdotmap(row):
    """takes the P005 variables and maps to a series with White, Black, Asian, Hispanic, Other
    Total and Name"""
    return pd.Series({'Total':row['P0050001'],
                      'White':row['P0050003'],
                      'Black':row['P0050004'],
                      'Asian':row['P0050006'],
                      'Hispanic':row['P0050010'],
                      'Other': row['P0050005'] + row['P0050007'] + row['P0050008'] + row['P0050009'],
                      'Name': row['NAME']
                      }, index=['Name', 'Total', 'White', 'Black', 'Hispanic', 'Asian', 'Other'])


def normalize(s):
    """take a Series and divide each item by the sum so that the new series adds up to 1.0"""
    total = np.sum(s)
    return s.astype('float') / total


def entropy(series):
    """Normalized Shannon Index"""
    # a series in which all the entries are equal should result in normalized entropy of 1.0
    
    # eliminate 0s
    series1 = series[series!=0]

    # if len(series) < 2 (i.e., 0 or 1) then return 0
    
    if len(series) > 1:
        # calculate the maximum possible entropy for given length of input series
        max_s = -np.log(1.0/len(series))
    
        total = float(sum(series1))
        p = series1.astype('float')/float(total)
        return sum(-p*np.log(p))/max_s
    else:
        return 0.0

    
def convert_P005_to_int(df):
    # do conversion in place
    df[list(P005_vars)] = df[list(P005_vars)].astype('int')
    return df
    

def diversity(r):

    """Returns a DataFrame with the following columns
    """
    df = DataFrame(r)
    df = convert_P005_to_int(df)
    # df[list(P005_vars)] = df[list(P005_vars)].astype('int')
    df1 = df.apply(convert_to_rdotmap, axis=1)
    
    df1['entropy5'] = df1[['Asian','Black','Hispanic','White','Other']].apply(entropy,axis=1)
    df1['entropy4'] = df1[['Asian','Black','Hispanic','White']].apply(entropy,axis=1)
    return df1

In [225]:
#http://api.census.gov/data/2010/sf1?get=P0010001&for=metropolitan+statistical+area/micropolitan+statistical+area:11260&in=state:02
def get_all_msas(variables='NAME'):
    count = 0
    for msa in s.index:

        for state in list(grouped_msa.get_group(msa).state):
            geo = {'for':'metropolitan statistical area/micropolitan statistical area:{msa_fips}'.format(msa_fips=msa), 
                   'in':'state:{state_fips}'.format(state_fips=state)
                  }
           
            for each_msa in c.sf1.get(variables, geo=geo):
                count +=1
                if msa == '35620':
                    print state
                    print each_msa
                yield each_msa

In [226]:
r = list(get_all_msas(P005_vars_with_name))


34
{u'P0050015': u'738', u'P0050014': u'4916', u'P0050011': u'693977', u'NAME': u'New York-Northern New Jersey-Long Island, NY-NJ-PA Metro Area (part)', u'metropolitan statistical area/micropolitan statistical area': u'35620', u'P0050016': u'419929', u'P0050017': u'84217', u'P0050012': u'59911', u'state': u'34', u'P0050009': u'93679', u'P0050008': u'23517', u'P0050013': u'13553', u'P0050005': u'7789', u'P0050004': u'773074', u'P0050007': u'1304', u'P0050006': u'608231', u'P0050001': u'6471215', u'P0050010': u'1277241', u'P0050003': u'3686380', u'P0050002': u'5193974'}
36
{u'P0050015': u'2988', u'P0050014': u'12491', u'P0050011': u'1246568', u'NAME': u'New York-Northern New Jersey-Long Island, NY-NJ-PA Metro Area (part)', u'metropolitan statistical area/micropolitan statistical area': u'35620', u'P0050016': u'1249846', u'P0050017': u'227263', u'P0050012': u'258337', u'state': u'36', u'P0050009': u'206336', u'P0050008': u'70159', u'P0050013': u'47653', u'P0050005': u'23461', u'P0050004': u'2267972', u'P0050007': u'3542', u'P0050006': u'1252026', u'P0050001': u'12368525', u'P0050010': u'3045146', u'P0050003': u'5499883', u'P0050002': u'9323379'}
42
{u'P0050015': u'3', u'P0050014': u'14', u'P0050011': u'3307', u'NAME': u'New York-Northern New Jersey-Long Island, NY-NJ-PA Metro Area (part)', u'metropolitan statistical area/micropolitan statistical area': u'35620', u'P0050016': u'1116', u'P0050017': u'412', u'P0050012': u'272', u'state': u'42', u'P0050009': u'797', u'P0050008': u'77', u'P0050013': u'49', u'P0050005': u'127', u'P0050004': u'3050', u'P0050007': u'13', u'P0050006': u'583', u'P0050001': u'57369', u'P0050010': u'5173', u'P0050003': u'47549', u'P0050002': u'52196'}

In [227]:
len(r)


Out[227]:
1013

In [297]:
test1= DataFrame(r)
len(test1)
# test.P0050001.astype('int').sum()


Out[297]:
1013

In [298]:
import numpy as np
df_diversity = diversity(r)
df_diversity


Out[298]:
Name Total White Black Hispanic Asian Other entropy5 entropy4
0 Abbeville, LA Micro Area 57999 46305 8246 1381 1148 919 0.428364 0.445662
1 Aberdeen, SD Micro Area 40602 37774 192 554 358 1724 0.203138 0.113942
2 Aberdeen, WA Micro Area 72797 59282 762 6272 995 5486 0.422324 0.321742
3 Abilene, TX Metro Area 165252 112735 11549 35108 2110 3750 0.570100 0.597267
4 Ada, OK Micro Area 37492 25973 879 1523 244 8873 0.525798 0.282587
5 Adrian, MI Micro Area 99892 87483 2397 7614 492 1906 0.312895 0.300877
6 Akron, OH Metro Area 703200 579151 83984 10733 14124 15208 0.396953 0.394132
7 Alamogordo, NM Micro Area 63797 33716 2048 22026 716 5291 0.665741 0.617931
8 Albany, GA Metro Area 157308 68820 81671 3270 1455 2092 0.548816 0.594085
9 Albany-Lebanon, OR Micro Area 116672 101579 456 9127 1078 4432 0.316341 0.260671
10 Albany-Schenectady-Troy, NY Metro Area 870716 721505 62963 36027 26983 23238 0.423665 0.414162
11 Albemarle, NC Micro Area 60585 49877 6590 2166 1108 844 0.405878 0.424194
12 Albert Lea, MN Micro Area 31255 27701 200 2750 232 372 0.274817 0.275758
13 Albertville, AL Micro Area 93019 78060 1389 11238 475 1857 0.354368 0.347732
14 Albuquerque, NM Metro Area 887077 374214 19766 414222 16769 62106 0.662122 0.629810
15 Alexander City, AL Micro Area 53155 36442 14606 1272 211 624 0.482905 0.520690
16 Alexandria, LA Metro Area 153922 98984 45378 4349 1598 3613 0.546936 0.568047
17 Alexandria, MN Micro Area 36009 34974 146 341 163 385 0.104223 0.079173
18 Alice, TX Micro Area 40838 8062 156 32254 134 232 0.357935 0.392505
19 Allegan, MI Micro Area 111408 99945 1264 7454 620 2125 0.269400 0.249417
20 Allentown-Bethlehem-Easton, PA-NJ Metro Area (... 108692 93165 3592 7659 2642 1634 0.363594 0.371422
21 Allentown-Bethlehem-Easton, PA-NJ Metro Area (... 712481 552576 31325 98939 17525 12116 0.477843 0.501141
22 Alma, MI Micro Area 42476 37144 2341 2301 145 545 0.317042 0.322703
23 Alpena, MI Micro Area 29598 28623 78 304 148 445 0.114741 0.078099
24 Altoona, PA Metro Area 127089 121495 2051 1230 689 1624 0.148198 0.124293
25 Altus, OK Micro Area 26446 17403 1918 5538 303 1284 0.615835 0.604190
26 Amarillo, TX Metro Area 249881 160881 14570 63116 6319 4995 0.601431 0.640350
27 Americus, GA Micro Area 37829 17025 18061 1878 454 411 0.598709 0.659004
28 Ames, IA Metro Area 89542 77812 2125 2695 5369 1541 0.344768 0.343437
29 Amsterdam, NY Micro Area 50219 42732 712 5654 356 765 0.337035 0.339575
30 Anchorage, AK Metro Area 380821 256490 16125 25362 24283 58561 0.648631 0.524013
31 Anderson, IN Metro Area 131636 113577 10887 4189 547 2436 0.335387 0.328977
32 Anderson, SC Metro Area 187126 147362 29810 5447 1384 3123 0.427670 0.442685
33 Andrews, TX Micro Area 14786 7083 199 7195 85 224 0.530728 0.568132
34 Angola, IN Micro Area 34185 32513 155 982 162 373 0.154586 0.137582
35 Ann Arbor, MI Metro Area 344791 248675 43152 13860 27005 12099 0.585306 0.590587
36 Anniston-Oxford, AL Metro Area 118572 87285 24177 3893 830 2387 0.481695 0.498171
37 Appleton, WI Metro Area 225666 203691 1918 8049 6235 5773 0.276392 0.241187
38 Arcadia, FL Micro Area 34862 19549 4338 10425 169 381 0.633690 0.699896
39 Ardmore, OK Micro Area 56980 41684 3297 3641 547 7811 0.550703 0.406843
40 Arkadelphia, AR Micro Area 22995 16177 5366 926 120 406 0.506412 0.533308
41 Asheville, NC Metro Area 424858 366448 18976 27100 3664 8670 0.349423 0.340762
42 Ashland, OH Micro Area 53139 51352 395 501 275 616 0.119532 0.094279
43 Ashtabula, OH Micro Area 101497 92126 3467 3441 372 2091 0.260059 0.234344
44 Astoria, OR Micro Area 37039 32295 163 2838 445 1298 0.317363 0.268305
45 Atchison, KS Micro Area 16924 15152 844 383 70 475 0.284117 0.244408
46 Athens, OH Micro Area 64757 58779 1741 1002 1741 1494 0.269547 0.239279
47 Athens, TN Micro Area 52266 47273 2021 1482 385 1105 0.270483 0.245305
48 Athens, TX Micro Area 78532 63494 4813 8490 318 1417 0.421412 0.431882
49 Athens-Clarke County, GA Metro Area 192541 130515 37182 15313 6065 3466 0.598788 0.641689
50 Atlanta-Sandy Springs-Marietta, GA Metro Area 5268860 2671757 1679979 547400 252510 117214 0.729649 0.787682
51 Atlantic City-Hammonton, NJ Metro Area 274549 160871 40882 46241 20419 6136 0.730078 0.787988
52 Auburn, IN Micro Area 42223 40406 141 1031 186 459 0.139699 0.120237
53 Auburn, NY Micro Area 80026 73098 3009 1896 387 1636 0.248565 0.221134
54 Auburn-Opelika, AL Metro Area 140247 97900 31674 4571 3615 2487 0.537048 0.569332
55 Augusta-Richmond County, GA-SC Metro Area (part) 369793 190907 145579 15474 8146 9687 0.634125 0.666177
56 Augusta-Richmond County, GA-SC Metro Area (part) 187084 123762 48994 9238 1412 3678 0.551061 0.581431
57 Augusta-Waterville, ME Micro Area 122151 116549 642 1504 886 2570 0.151290 0.104215
58 Austin, MN Micro Area 39163 32975 791 4138 636 623 0.369001 0.375469
59 Austin-Round Rock-San Marcos, TX Metro Area 1716289 938474 120510 538313 80980 38012 0.688892 0.739489
... ... ... ... ... ... ... ... ...

1013 rows × 9 columns


In [299]:
temp_diversity = df_diversity


---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-299-a1d55402ef66> in <module>()
      1 temp_diversity = df_diversity
----> 2 df_diversity = df_diversity.groupby('metropolitan statistical area/micropolitan statistical area').sum()

/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/pandas/core/generic.pyc in groupby(self, by, axis, level, as_index, sort, group_keys, squeeze)
   2432         axis = self._get_axis_number(axis)
   2433         return groupby(self, by, axis=axis, level=level, as_index=as_index,
-> 2434                        sort=sort, group_keys=group_keys, squeeze=squeeze)
   2435 
   2436     def asfreq(self, freq, method=None, how=None, normalize=False):

/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/pandas/core/groupby.pyc in groupby(obj, by, **kwds)
    787         raise TypeError('invalid type: %s' % type(obj))
    788 
--> 789     return klass(obj, by, **kwds)
    790 
    791 

/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/pandas/core/groupby.pyc in __init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze)
    236         if grouper is None:
    237             grouper, exclusions = _get_grouper(obj, keys, axis=axis,
--> 238                                                level=level, sort=sort)
    239 
    240         self.grouper = grouper

/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/pandas/core/groupby.pyc in _get_grouper(obj, key, axis, level, sort)
   1610             exclusions.append(gpr)
   1611             name = gpr
-> 1612             gpr = obj[gpr]
   1613 
   1614         if isinstance(gpr, Categorical) and len(gpr) != len(obj):

/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/pandas/core/frame.pyc in __getitem__(self, key)
   1633             return self._getitem_multilevel(key)
   1634         else:
-> 1635             return self._getitem_column(key)
   1636 
   1637     def _getitem_column(self, key):

/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/pandas/core/frame.pyc in _getitem_column(self, key)
   1640         # get column
   1641         if self.columns.is_unique:
-> 1642             return self._get_item_cache(key)
   1643 
   1644         # duplicate columns & possible reduce dimensionaility

/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/pandas/core/generic.pyc in _get_item_cache(self, item)
    981         res = cache.get(item)
    982         if res is None:
--> 983             values = self._data.get(item)
    984             res = self._box_item_values(item, values)
    985             cache[item] = res

/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/pandas/core/internals.pyc in get(self, item)
   2752                 return self.get_for_nan_indexer(indexer)
   2753 
-> 2754             _, block = self._find_block(item)
   2755             return block.get(item)
   2756         else:

/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/pandas/core/internals.pyc in _find_block(self, item)
   3063 
   3064     def _find_block(self, item):
-> 3065         self._check_have(item)
   3066         for i, block in enumerate(self.blocks):
   3067             if item in block:

/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/pandas/core/internals.pyc in _check_have(self, item)
   3070     def _check_have(self, item):
   3071         if item not in self.items:
-> 3072             raise KeyError('no item named %s' % com.pprint_thing(item))
   3073 
   3074     def reindex_axis(self, new_axis, indexer=None, method=None, axis=0,

KeyError: u'no item named metropolitan statistical area/micropolitan statistical area'

In [263]:
type(df_diversity)


Out[263]:
pandas.core.frame.DataFrame

In [264]:
#'Asian','Black','Hispanic', 'Other', 'Total', 'White',
#  'entropy4', 'entropy5', 'entropy_rice', 'gini_simpson',
#'p_Asian', 'p_Black', 'p_Hispanic', 'p_Other','p_White'
df_diversity['p_Asian'] = df_diversity['Asian']/df_diversity['Total']
df_diversity['p_Black'] = df_diversity['Black']/df_diversity['Total']
df_diversity['p_Hispanic'] = df_diversity['Hispanic']/df_diversity['Total']
df_diversity['p_Other'] = df_diversity['Other']/df_diversity['Total']
df_diversity['p_White'] = df_diversity['White']/df_diversity['Total']

In [265]:
def entropyRice(series):
    """Normalized Shannon Index"""
    # a series in which all the entries are equal should result in normalized entropy of 1.0
    
    # eliminate 0s
    series1 = series[series!=0]
    
    if len(series) > 1:
        # calculate the maximum possible entropy for given length of input series
        max_s = -np.log(1.0/len(series))
    
        total = float(sum(series1))
        p = series1.astype('float')/float(total)
#         print series['Other']
        p_other = series['Other'].astype('float')/float(total)
        E_other = -(p_other*np.log(p_other))/max_s
#         print E_other
        return (sum(-p*np.log(p))/max_s) - E_other
    else: 
        return 0.0

def giniSimpson(series):
    """Normalized Shannon Index"""
    # a series in which all the entries are equal should result in normalized entropy of 1.0
    
    # eliminate 0s
    series1 = series[series!=0]
    
    if len(series) > 1:
        # calculate the maximum possible entropy for given length of input series
        max_s = -np.log(1.0/len(series))
    
        total = float(sum(series1))
        p = series1.astype('float')/float(total)
        return sum(p*(1-p))
    else: 
        return 0.0 
    
    
df_diversity['gini_simpson'] = df_diversity[['Asian','Black','Hispanic','White','Other']].apply(giniSimpson,axis=1)

df_diversity['entropy_rice'] = df_diversity[['Asian','Black','Hispanic','White','Other']].apply(entropyRice,axis=1)

In [266]:
df_diversity.head()


Out[266]:
Total White Black Hispanic Asian Other entropy5 entropy4 p_Asian p_Black p_Hispanic p_Other p_White gini_simpson entropy_rice
Name
Abbeville, LA Micro Area 57999 46305 8246 1381 1148 919 0.428364 0.445662 0.019793 0.142175 0.023811 0.015845 0.798376 0.341173 0.387557
Aberdeen, SD Micro Area 40602 37774 192 554 358 1724 0.203138 0.113942 0.008817 0.004729 0.013645 0.042461 0.930348 0.132363 0.119791
Aberdeen, WA Micro Area 72797 59282 762 6272 995 5486 0.422324 0.321742 0.013668 0.010467 0.086157 0.075360 0.814347 0.323441 0.301262
Abilene, TX Metro Area 165252 112735 11549 35108 2110 3750 0.570100 0.597267 0.012768 0.069887 0.212451 0.022693 0.682201 0.483905 0.516722
Ada, OK Micro Area 37492 25973 879 1523 244 8873 0.525798 0.282587 0.006508 0.023445 0.040622 0.236664 0.692761 0.461830 0.313886

5 rows × 15 columns


In [267]:
msas_df = df_diversity
#df_diversity.sort_index(by="entropy5", ascending=False).head()

In [285]:
assert len(msas_df) == 942  
type(msas_df)


Out[285]:
pandas.core.frame.DataFrame

In [295]:
top_10_metros = msas_df.sort_index(by='Total', ascending=False)[:10]
msa_codes_in_top_10_pop_sorted_by_entropy_rice = list(top_10_metros.sort_index(by='entropy_rice', 
                                                ascending=False).index) 
type(top_10_metros)
top_10_metros
# msa_codes_in_top_10_pop_sorted_by_entropy_rice
to_unicode(msa_codes_in_top_10_pop_sorted_by_entropy_rice)
# [u'Houston-Sugar Land-Baytown, TX Metro Area',
#  u'New York-Northern New Jersey-Long Island, NY-NJ-PA Metro Area (part)',
#  u'Washington-Arlington-Alexandria, DC-VA-MD-WV Metro Area (part)',
#  u'Los Angeles-Long Beach-Santa Ana, CA Metro Area',
#  u'Dallas-Fort Worth-Arlington, TX Metro Area',
#  u'Miami-Fort Lauderdale-Pompano Beach, FL Metro Area',
#  u'Chicago-Joliet-Naperville, IL-IN-WI Metro Area (part)',
#  u'Atlanta-Sandy Springs-Marietta, GA Metro Area',
#  u'Philadelphia-Camden-Wilmington, PA-NJ-DE-MD Metro Area (part)',
#  u'Boston-Cambridge-Quincy, MA-NH Metro Area (part)']


Out[295]:
[u'Houston-Sugar Land-Baytown, TX Metro Area',
 u'New York-Northern New Jersey-Long Island, NY-NJ-PA Metro Area (part)',
 u'Washington-Arlington-Alexandria, DC-VA-MD-WV Metro Area (part)',
 u'Los Angeles-Long Beach-Santa Ana, CA Metro Area',
 u'Dallas-Fort Worth-Arlington, TX Metro Area',
 u'Miami-Fort Lauderdale-Pompano Beach, FL Metro Area',
 u'Chicago-Joliet-Naperville, IL-IN-WI Metro Area (part)',
 u'Atlanta-Sandy Springs-Marietta, GA Metro Area',
 u'Philadelphia-Camden-Wilmington, PA-NJ-DE-MD Metro Area (part)',
 u'Boston-Cambridge-Quincy, MA-NH Metro Area (part)']

In [155]:
# Testing code

def to_unicode(vals):
    return [unicode(v) for v in vals]

def test_msas_df(msas_df):

    min_set_of_columns =  set(['Asian','Black','Hispanic', 'Other', 'Total', 'White',
     'entropy4', 'entropy5', 'entropy_rice', 'gini_simpson','p_Asian', 'p_Black',
     'p_Hispanic', 'p_Other','p_White'])  
    
    #--> what does this assert mean?
    assert min_set_of_columns & set(msas_df.columns) == min_set_of_columns
    
    # https://www.census.gov/geo/maps-data/data/tallies/national_geo_tallies.html
    # 366 metro areas
    # 576 micropolitan areas
    
    assert len(msas_df) == 942  
    
    # total number of people in metro/micro areas
    
    assert msas_df.Total.sum() == 289261315
    assert msas_df['White'].sum() == 180912856
    assert msas_df['Other'].sum() == 8540181
    
    # list of msas in descendng order by entropy_rice 
    # calculate the top 10 metros by population
    top_10_metros = msas_df.sort_index(by='Total', ascending=False)[:10]
    
    msa_codes_in_top_10_pop_sorted_by_entropy_rice = list(top_10_metros.sort_index(by='entropy_rice', 
                                                ascending=False).index) 
    
    assert to_unicode(msa_codes_in_top_10_pop_sorted_by_entropy_rice)== [u'26420', u'35620', u'47900', u'31100', u'19100', 
        u'33100', u'16980', u'12060', u'37980', u'14460']


    top_10_metro = msas_df.sort_index(by='Total', ascending=False)[:10]
    
    list(top_10_metro.sort_index(by='entropy_rice', ascending=False)['entropy5'])
    
    np.testing.assert_allclose(top_10_metro.sort_index(by='entropy_rice', ascending=False)['entropy5'], 
    [0.79628076626851163, 0.80528601550164602, 0.80809418318973791, 0.7980698349711991,
     0.75945930510650161, 0.74913610558765376, 0.73683277781032397, 0.72964862063970914,
     0.64082509648457675, 0.55697288400004963])
    
    np.testing.assert_allclose(top_10_metro.sort_index(by='entropy_rice', ascending=False)['entropy_rice'],
    [0.87361766576115552,
     0.87272877244078051,
     0.85931803868749834,
     0.85508015237749468,
     0.82169723530719896,
     0.81953527301129059,
     0.80589423784325431,
     0.78602596561378812,
     0.68611350427640316,
     0.56978827050565117])

In [270]:
# you are on the right track if test_msas_df doesn't complain
test_msas_df(msas_df)


---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-270-824fb53aa28c> in <module>()
      1 # you are on the right track if test_msas_df doesn't complain
----> 2 test_msas_df(msas_df)

<ipython-input-155-1e1577c3e222> in test_msas_df(msas_df)
     33 
     34     assert to_unicode(msa_codes_in_top_10_pop_sorted_by_entropy_rice)== [u'26420', u'35620', u'47900', u'31100', u'19100', 
---> 35         u'33100', u'16980', u'12060', u'37980', u'14460']
     36 
     37 

AssertionError: 

In [ ]:
# code to save your dataframe to a CSV
# upload the CSV to bCourses
# uncomment to run
# msas_df.to_csv("msas_2010.csv", encoding="UTF-8")

In [ ]:
# load back the CSV and test again
# df = DataFrame.from_csv("msas_2010.csv", encoding="UTF-8")
# test_msas_df(df)

In [ ]: