In [1]:

    
%pylab --no-import-all inline









    



Populating the interactive namespace from numpy and matplotlib

Some Context

The US Census is complex....so it's good, even essential, to have a framing question to guide your explorations so that you don't get distracted or lost.

I got into thinking of the census in 2002 when I saw a woman I knew in the following SF Chronicle article:

Claremont-Elmwood / Homogeneity in Berkeley? Well, yeah - SFGate

I thought at that point it should be easy for regular people to do census calculations....

In the summer of 2013, I wrote the following note to Greg Wilson about diversity calculations:

notes for Greg Wilson about an example Data Science Workflow

There's a whole cottage industry in musing on "diversity" in the USA:

The Most Diverse Cities In The US - Business Insider -- using 4 categories: Vallejo.
Most And Least Diverse Cities: Brown University Study Evaluates Diversity In The U.S.
The Top 10 Most Diverse Cities in America -- LA?

and let's not forget the Racial Dot Map and some background.



In [2]:

    
# Shows the version of pandas that we are using
!pip show pandas









    



---
Name: pandas
Version: 0.13.0
Location: /Users/Morgan/anaconda/lib/python2.7/site-packages
Requires:



In [3]:

    
#  import useful classes of pandas
import numpy as np
import pandas as pd
from pandas import Series, DataFrame, Index

http://www.census.gov/developers/

Dependency: to start with -- let's use the Python module: https://pypi.python.org/pypi/census/

pip install -U  census

Things we'd like to be able to do:

calculate the population of California.
then calculate the population of every geographic entity going down to census block if possible.
for a given geographic unit, can we get the racial/ethnic breakdown?

Figuring out the Census Data is a Big Jigsaw Puzzle

Some starting points:

We focus first on the API -- and I hope we can come back to processing the bulk data from Census FTP site

Prerequisites: Getting and activating key

fill out form at http://www.census.gov/developers/tos/key_request.html

"Your request for a new API key has been successfully submitted. Please check your email. In a few minutes you should receive a message with instructions on how to activate your new key."

click on link you'll get http://api.census.gov/data/KeySignup?validate={key}

Then create a settings.py in the same directory as this notebook (or somewhere else in your Python path) to hold settings.CENSUS_KEY



In [5]:

    
import settings



In [6]:

    
# This cell should run successfully if you have a string set up to represent your census key

try:
    import settings
    assert type(settings.CENSUS_KEY) == str or type(settings.CENSUS_KEY) == unicode
except Exception as e:
    print "error in importing settings to get at settings.CENSUS_KEY", e

states module



In [7]:

    
# let's figure out a bit about the us module, in particular, us.states
# https://github.com/unitedstates/python-us

from us import states

for (i, state) in enumerate(states.STATES):
    print i, state.name, state.fips









    



0 Alabama 01
1 Alaska 02
2 Arizona 04
3 Arkansas 05
4 California 06
5 Colorado 08
6 Connecticut 09
7 Delaware 10
8 District of Columbia 11
9 Florida 12
10 Georgia 13
11 Hawaii 15
12 Idaho 16
13 Illinois 17
14 Indiana 18
15 Iowa 19
16 Kansas 20
17 Kentucky 21
18 Louisiana 22
19 Maine 23
20 Maryland 24
21 Massachusetts 25
22 Michigan 26
23 Minnesota 27
24 Mississippi 28
25 Missouri 29
26 Montana 30
27 Nebraska 31
28 Nevada 32
29 New Hampshire 33
30 New Jersey 34
31 New Mexico 35
32 New York 36
33 North Carolina 37
34 North Dakota 38
35 Ohio 39
36 Oklahoma 40
37 Oregon 41
38 Pennsylvania 42
39 Rhode Island 44
40 South Carolina 45
41 South Dakota 46
42 Tennessee 47
43 Texas 48
44 Utah 49
45 Vermont 50
46 Virginia 51
47 Washington 53
48 West Virginia 54
49 Wisconsin 55
50 Wyoming 56

Questions to ponder: How many states are in the list? Is DC included the states list? How to access the territories?

Formulating URL requests by hand

It's immensely useful to be able to access the census API directly but creating a URL with the proper parameters -- as well as using the census package.



In [8]:

    
import requests



In [9]:

    
# get the total population of all states
url = "http://api.census.gov/data/2010/sf1?key={key}&get=P0010001,NAME&for=state:*".format(key=settings.CENSUS_KEY)



In [10]:

    
# note the structure of the response
r = requests.get(url)

Total Population



In [11]:

    
# FILL IN
# drop the header record
from itertools import islice
# total population including PR is 312471327



In [14]:

    
state_data=r.json()
state_data=state_data[1:]
# print state_data[0]
total_pop_US_withPR=sum([int(i[0]) for i in state_data])
print str(total_pop_US_withPR) + " with PR"









    



312471327 with PR



In [15]:

    
# FILL IN
# exclude PR:  308745538
total_pop_US=sum([int(i[0]) for i in state_data if i[1]!="Puerto Rico"])
print str(total_pop_US) + " without PR"









    



308745538 without PR



In [16]:

    
# let's now create a DataFrame from r.json()

df = DataFrame(r.json()[1:], columns=r.json()[0])
df.head()









    Out[16]:






  
    
      
      P0010001
      NAME
      state
    
  
  
    
      0
        4779736
          Alabama
       01
    
    
      1
         710231
           Alaska
       02
    
    
      2
        6392017
          Arizona
       04
    
    
      3
        2915918
         Arkansas
       05
    
    
      4
       37253956
       California
       06
    
  

5 rows × 3 columns



In [38]:

    
# FILL IN
# calculate the total population using df
sum([int(i[0]) for i in df.get_values()])









    Out[38]:





312471327



In [39]:

    
# FILL IN -- now calculate the total population excluding Puerto Rico
sum([int(i[0]) for i in df.get_values() if i[1]!="Puerto Rico"])









    Out[39]:





308745538

Focusing on sf1 +2010 census

How to map out the geographical hierachy and pull out total population figures?

Nation
Regions
Divisions
State
County
Census Tract
Block Group
Census Block

Questions

What identifiers are used for these various geographic entities?
Can we get an enumeration of each of these entities?
How to figure out which census tract, block group, census block one is in?

Total Population of California

2010 Census Summary File 1

P0010001 is found in 2010 SF1 API Variables [XML] = "total population"



In [40]:

    
from settings import CENSUS_KEY
import census

c=census.Census(settings.CENSUS_KEY) 
c.sf1.get(('NAME', 'P0010001'), {'for': 'state:%s' % states.CA.fips})









    Out[40]:





[{u'NAME': u'California', u'P0010001': u'37253956', u'state': u'06'}]



In [41]:

    
"population of California: {0}".format(
        int(c.sf1.get(('NAME', 'P0010001'), {'for': 'state:%s' % states.CA.fips})[0]['P0010001']))









    Out[41]:





'population of California: 37253956'

Let's try to get at the counties of California and their populations



In [42]:

    
ca_counties = c.sf1.get(('NAME', 'P0010001'), geo={'for': 'county:*', 'in': 'state:%s' % states.CA.fips})



In [43]:

    
# create a DataFrame, convert the 'P0010001' column
# show by descending population
df = DataFrame(ca_counties)
df['P0010001'] = df['P0010001'].astype('int')
df.sort_index(by='P0010001', ascending=False)









    Out[43]:






  
    
      
      NAME
      P0010001
      county
      state
    
  
  
    
      18
           Los Angeles County
       9818605
       037
       06
    
    
      36
             San Diego County
       3095313
       073
       06
    
    
      29
                Orange County
       3010232
       059
       06
    
    
      32
             Riverside County
       2189641
       065
       06
    
    
      35
        San Bernardino County
       2035210
       071
       06
    
    
      42
           Santa Clara County
       1781642
       085
       06
    
    
      0 
               Alameda County
       1510271
       001
       06
    
    
      33
            Sacramento County
       1418788
       067
       06
    
    
      6 
          Contra Costa County
       1049025
       013
       06
    
    
      9 
                Fresno County
        930450
       019
       06
    
    
      14
                  Kern County
        839631
       029
       06
    
    
      55
               Ventura County
        823318
       111
       06
    
    
      37
         San Francisco County
        805235
       075
       06
    
    
      40
             San Mateo County
        718451
       081
       06
    
    
      38
           San Joaquin County
        685306
       077
       06
    
    
      49
            Stanislaus County
        514453
       099
       06
    
    
      48
                Sonoma County
        483878
       097
       06
    
    
      53
                Tulare County
        442179
       107
       06
    
    
      41
         Santa Barbara County
        423895
       083
       06
    
    
      26
              Monterey County
        415057
       053
       06
    
    
      47
                Solano County
        413344
       095
       06
    
    
      30
                Placer County
        348432
       061
       06
    
    
      39
       San Luis Obispo County
        269637
       079
       06
    
    
      43
            Santa Cruz County
        262382
       087
       06
    
    
      23
                Merced County
        255793
       047
       06
    
    
      20
                 Marin County
        252409
       041
       06
    
    
      3 
                 Butte County
        220000
       007
       06
    
    
      56
                  Yolo County
        200849
       113
       06
    
    
      8 
             El Dorado County
        181058
       017
       06
    
    
      44
                Shasta County
        177223
       089
       06
    
    
      12
              Imperial County
        174528
       025
       06
    
    
      15
                 Kings County
        152982
       031
       06
    
    
      19
                Madera County
        150865
       039
       06
    
    
      27
                  Napa County
        136484
       055
       06
    
    
      11
              Humboldt County
        134623
       023
       06
    
    
      28
                Nevada County
         98764
       057
       06
    
    
      50
                Sutter County
         94737
       101
       06
    
    
      22
             Mendocino County
         87841
       045
       06
    
    
      57
                  Yuba County
         72155
       115
       06
    
    
      16
                  Lake County
         64665
       033
       06
    
    
      51
                Tehama County
         63463
       103
       06
    
    
      54
              Tuolumne County
         55365
       109
       06
    
    
      34
            San Benito County
         55269
       069
       06
    
    
      4 
             Calaveras County
         45578
       009
       06
    
    
      46
              Siskiyou County
         44900
       093
       06
    
    
      2 
                Amador County
         38091
       005
       06
    
    
      17
                Lassen County
         34895
       035
       06
    
    
      7 
             Del Norte County
         28610
       015
       06
    
    
      10
                 Glenn County
         28122
       021
       06
    
    
      5 
                Colusa County
         21419
       011
       06
    
    
      31
                Plumas County
         20007
       063
       06
    
    
      13
                  Inyo County
         18546
       027
       06
    
    
      21
              Mariposa County
         18251
       043
       06
    
    
      25
                  Mono County
         14202
       051
       06
    
    
      52
               Trinity County
         13786
       105
       06
    
    
      24
                 Modoc County
          9686
       049
       06
    
    
      45
                Sierra County
          3240
       091
       06
    
    
      1 
                Alpine County
          1175
       003
       06
    
  

58 rows × 4 columns



In [44]:

    
#http://stackoverflow.com/a/13130357/7782
count,division = np.histogram(df['P0010001'])
df['P0010001'].hist(bins=division)









    Out[44]:





<matplotlib.axes.AxesSubplot at 0x109a4ce50>



In [ ]:

	P0010001	NAME	state
0	4779736	Alabama	01
1	710231	Alaska	02
2	6392017	Arizona	04
3	2915918	Arkansas	05
4	37253956	California	06

	NAME	P0010001	county	state
18	Los Angeles County	9818605	037	06
36	San Diego County	3095313	073	06
29	Orange County	3010232	059	06
32	Riverside County	2189641	065	06
35	San Bernardino County	2035210	071	06
42	Santa Clara County	1781642	085	06
0	Alameda County	1510271	001	06
33	Sacramento County	1418788	067	06
6	Contra Costa County	1049025	013	06
9	Fresno County	930450	019	06
14	Kern County	839631	029	06
55	Ventura County	823318	111	06
37	San Francisco County	805235	075	06
40	San Mateo County	718451	081	06
38	San Joaquin County	685306	077	06
49	Stanislaus County	514453	099	06
48	Sonoma County	483878	097	06
53	Tulare County	442179	107	06
41	Santa Barbara County	423895	083	06
26	Monterey County	415057	053	06
47	Solano County	413344	095	06
30	Placer County	348432	061	06
39	San Luis Obispo County	269637	079	06
43	Santa Cruz County	262382	087	06
23	Merced County	255793	047	06
20	Marin County	252409	041	06
3	Butte County	220000	007	06
56	Yolo County	200849	113	06
8	El Dorado County	181058	017	06
44	Shasta County	177223	089	06
12	Imperial County	174528	025	06
15	Kings County	152982	031	06
19	Madera County	150865	039	06
27	Napa County	136484	055	06
11	Humboldt County	134623	023	06
28	Nevada County	98764	057	06
50	Sutter County	94737	101	06
22	Mendocino County	87841	045	06
57	Yuba County	72155	115	06
16	Lake County	64665	033	06
51	Tehama County	63463	103	06
54	Tuolumne County	55365	109	06
34	San Benito County	55269	069	06
4	Calaveras County	45578	009	06
46	Siskiyou County	44900	093	06
2	Amador County	38091	005	06
17	Lassen County	34895	035	06
7	Del Norte County	28610	015	06
10	Glenn County	28122	021	06
5	Colusa County	21419	011	06
31	Plumas County	20007	063	06
13	Inyo County	18546	027	06
21	Mariposa County	18251	043	06
25	Mono County	14202	051	06
52	Trinity County	13786	105	06
24	Modoc County	9686	049	06
45	Sierra County	3240	091	06
1	Alpine County	1175	003	06