In [1]:
%pylab --no-import-all inline


Populating the interactive namespace from numpy and matplotlib

Some Context

The US Census is complex....so it's good, even essential, to have a framing question to guide your explorations so that you don't get distracted or lost.

I got into thinking of the census in 2002 when I saw a woman I knew in the following SF Chronicle article:

Claremont-Elmwood / Homogeneity in Berkeley? Well, yeah - SFGate

I thought at that point it should be easy for regular people to do census calculations....

In the summer of 2013, I wrote the following note to Greg Wilson about diversity calculations:

notes for Greg Wilson about an example Data Science Workflow

There's a whole cottage industry in musing on "diversity" in the USA:

and let's not forget the Racial Dot Map and some background.


In [2]:
# Shows the version of pandas that we are using
!pip show pandas


---
Name: pandas
Version: 0.13.0
Location: /Users/Morgan/anaconda/lib/python2.7/site-packages
Requires: 

In [3]:
#  import useful classes of pandas
import numpy as np
import pandas as pd
from pandas import Series, DataFrame, Index

http://www.census.gov/developers/

Dependency: to start with -- let's use the Python module: https://pypi.python.org/pypi/census/

pip install -U  census

Things we'd like to be able to do:

  • calculate the population of California.
  • then calculate the population of every geographic entity going down to census block if possible.
  • for a given geographic unit, can we get the racial/ethnic breakdown?

Figuring out the Census Data is a Big Jigsaw Puzzle

Some starting points:

We focus first on the API -- and I hope we can come back to processing the bulk data from Census FTP site

Prerequisites: Getting and activating key

"Your request for a new API key has been successfully submitted. Please check your email. In a few minutes you should receive a message with instructions on how to activate your new key."

Then create a settings.py in the same directory as this notebook (or somewhere else in your Python path) to hold settings.CENSUS_KEY


In [5]:
import settings

In [6]:
# This cell should run successfully if you have a string set up to represent your census key

try:
    import settings
    assert type(settings.CENSUS_KEY) == str or type(settings.CENSUS_KEY) == unicode
except Exception as e:
    print "error in importing settings to get at settings.CENSUS_KEY", e

states module


In [7]:
# let's figure out a bit about the us module, in particular, us.states
# https://github.com/unitedstates/python-us

from us import states

for (i, state) in enumerate(states.STATES):
    print i, state.name, state.fips


0 Alabama 01
1 Alaska 02
2 Arizona 04
3 Arkansas 05
4 California 06
5 Colorado 08
6 Connecticut 09
7 Delaware 10
8 District of Columbia 11
9 Florida 12
10 Georgia 13
11 Hawaii 15
12 Idaho 16
13 Illinois 17
14 Indiana 18
15 Iowa 19
16 Kansas 20
17 Kentucky 21
18 Louisiana 22
19 Maine 23
20 Maryland 24
21 Massachusetts 25
22 Michigan 26
23 Minnesota 27
24 Mississippi 28
25 Missouri 29
26 Montana 30
27 Nebraska 31
28 Nevada 32
29 New Hampshire 33
30 New Jersey 34
31 New Mexico 35
32 New York 36
33 North Carolina 37
34 North Dakota 38
35 Ohio 39
36 Oklahoma 40
37 Oregon 41
38 Pennsylvania 42
39 Rhode Island 44
40 South Carolina 45
41 South Dakota 46
42 Tennessee 47
43 Texas 48
44 Utah 49
45 Vermont 50
46 Virginia 51
47 Washington 53
48 West Virginia 54
49 Wisconsin 55
50 Wyoming 56

Questions to ponder: How many states are in the list? Is DC included the states list? How to access the territories?

Formulating URL requests by hand

It's immensely useful to be able to access the census API directly but creating a URL with the proper parameters -- as well as using the census package.


In [8]:
import requests

In [9]:
# get the total population of all states
url = "http://api.census.gov/data/2010/sf1?key={key}&get=P0010001,NAME&for=state:*".format(key=settings.CENSUS_KEY)

In [10]:
# note the structure of the response
r = requests.get(url)

Total Population


In [11]:
# FILL IN
# drop the header record
from itertools import islice
# total population including PR is 312471327

In [14]:
state_data=r.json()
state_data=state_data[1:]
# print state_data[0]
total_pop_US_withPR=sum([int(i[0]) for i in state_data])
print str(total_pop_US_withPR) + " with PR"


312471327 with PR

In [15]:
# FILL IN
# exclude PR:  308745538
total_pop_US=sum([int(i[0]) for i in state_data if i[1]!="Puerto Rico"])
print str(total_pop_US) + " without PR"


308745538 without PR

In [16]:
# let's now create a DataFrame from r.json()

df = DataFrame(r.json()[1:], columns=r.json()[0])
df.head()


Out[16]:
P0010001 NAME state
0 4779736 Alabama 01
1 710231 Alaska 02
2 6392017 Arizona 04
3 2915918 Arkansas 05
4 37253956 California 06

5 rows × 3 columns


In [38]:
# FILL IN
# calculate the total population using df
sum([int(i[0]) for i in df.get_values()])


Out[38]:
312471327

In [39]:
# FILL IN -- now calculate the total population excluding Puerto Rico
sum([int(i[0]) for i in df.get_values() if i[1]!="Puerto Rico"])


Out[39]:
308745538

Focusing on sf1 +2010 census

How to map out the geographical hierachy and pull out total population figures?

  1. Nation
  2. Regions
  3. Divisions
  4. State
  5. County
  6. Census Tract
  7. Block Group
  8. Census Block

Questions

  • What identifiers are used for these various geographic entities?
  • Can we get an enumeration of each of these entities?
  • How to figure out which census tract, block group, census block one is in?

Total Population of California

2010 Census Summary File 1

P0010001 is found in 2010 SF1 API Variables [XML] = "total population"


In [40]:
from settings import CENSUS_KEY
import census

c=census.Census(settings.CENSUS_KEY) 
c.sf1.get(('NAME', 'P0010001'), {'for': 'state:%s' % states.CA.fips})


Out[40]:
[{u'NAME': u'California', u'P0010001': u'37253956', u'state': u'06'}]

In [41]:
"population of California: {0}".format(
        int(c.sf1.get(('NAME', 'P0010001'), {'for': 'state:%s' % states.CA.fips})[0]['P0010001']))


Out[41]:
'population of California: 37253956'

Let's try to get at the counties of California and their populations


In [42]:
ca_counties = c.sf1.get(('NAME', 'P0010001'), geo={'for': 'county:*', 'in': 'state:%s' % states.CA.fips})

In [43]:
# create a DataFrame, convert the 'P0010001' column
# show by descending population
df = DataFrame(ca_counties)
df['P0010001'] = df['P0010001'].astype('int')
df.sort_index(by='P0010001', ascending=False)


Out[43]:
NAME P0010001 county state
18 Los Angeles County 9818605 037 06
36 San Diego County 3095313 073 06
29 Orange County 3010232 059 06
32 Riverside County 2189641 065 06
35 San Bernardino County 2035210 071 06
42 Santa Clara County 1781642 085 06
0 Alameda County 1510271 001 06
33 Sacramento County 1418788 067 06
6 Contra Costa County 1049025 013 06
9 Fresno County 930450 019 06
14 Kern County 839631 029 06
55 Ventura County 823318 111 06
37 San Francisco County 805235 075 06
40 San Mateo County 718451 081 06
38 San Joaquin County 685306 077 06
49 Stanislaus County 514453 099 06
48 Sonoma County 483878 097 06
53 Tulare County 442179 107 06
41 Santa Barbara County 423895 083 06
26 Monterey County 415057 053 06
47 Solano County 413344 095 06
30 Placer County 348432 061 06
39 San Luis Obispo County 269637 079 06
43 Santa Cruz County 262382 087 06
23 Merced County 255793 047 06
20 Marin County 252409 041 06
3 Butte County 220000 007 06
56 Yolo County 200849 113 06
8 El Dorado County 181058 017 06
44 Shasta County 177223 089 06
12 Imperial County 174528 025 06
15 Kings County 152982 031 06
19 Madera County 150865 039 06
27 Napa County 136484 055 06
11 Humboldt County 134623 023 06
28 Nevada County 98764 057 06
50 Sutter County 94737 101 06
22 Mendocino County 87841 045 06
57 Yuba County 72155 115 06
16 Lake County 64665 033 06
51 Tehama County 63463 103 06
54 Tuolumne County 55365 109 06
34 San Benito County 55269 069 06
4 Calaveras County 45578 009 06
46 Siskiyou County 44900 093 06
2 Amador County 38091 005 06
17 Lassen County 34895 035 06
7 Del Norte County 28610 015 06
10 Glenn County 28122 021 06
5 Colusa County 21419 011 06
31 Plumas County 20007 063 06
13 Inyo County 18546 027 06
21 Mariposa County 18251 043 06
25 Mono County 14202 051 06
52 Trinity County 13786 105 06
24 Modoc County 9686 049 06
45 Sierra County 3240 091 06
1 Alpine County 1175 003 06

58 rows × 4 columns


In [44]:
#http://stackoverflow.com/a/13130357/7782
count,division = np.histogram(df['P0010001'])
df['P0010001'].hist(bins=division)


Out[44]:
<matplotlib.axes.AxesSubplot at 0x109a4ce50>

In [ ]: