In [1]:
%pylab --no-import-all inline
The US Census is complex....so it's good, even essential, to have a framing question to guide your explorations so that you don't get distracted or lost.
I got into thinking of the census in 2002 when I saw a woman I knew in the following SF Chronicle article:
Claremont-Elmwood / Homogeneity in Berkeley? Well, yeah - SFGate
I thought at that point it should be easy for regular people to do census calculations....
In the summer of 2013, I wrote the following note to Greg Wilson about diversity calculations:
notes for Greg Wilson about an example Data Science Workflow
There's a whole cottage industry in musing on "diversity" in the USA:
The Most Diverse Cities In The US - Business Insider -- using 4 categories: Vallejo.
Most And Least Diverse Cities: Brown University Study Evaluates Diversity In The U.S.
and let's not forget the Racial Dot Map and some background.
In [2]:
# Shows the version of pandas that we are using
!pip show pandas
In [3]:
# import useful classes of pandas
import numpy as np
import pandas as pd
from pandas import Series, DataFrame, Index
http://www.census.gov/developers/
Dependency: to start with -- let's use the Python module: https://pypi.python.org/pypi/census/
pip install -U census
Things we'd like to be able to do:
Some starting points:
We focus first on the API -- and I hope we can come back to processing the bulk data from Census FTP site
"Your request for a new API key has been successfully submitted. Please check your email. In a few minutes you should receive a message with instructions on how to activate your new key."
Then create a settings.py in the same directory as this notebook (or somewhere else in your Python path) to hold settings.CENSUS_KEY
In [5]:
import settings
In [6]:
# This cell should run successfully if you have a string set up to represent your census key
try:
import settings
assert type(settings.CENSUS_KEY) == str or type(settings.CENSUS_KEY) == unicode
except Exception as e:
print "error in importing settings to get at settings.CENSUS_KEY", e
In [7]:
# let's figure out a bit about the us module, in particular, us.states
# https://github.com/unitedstates/python-us
from us import states
for (i, state) in enumerate(states.STATES):
print i, state.name, state.fips
Questions to ponder: How many states are in the list? Is DC included the states list? How to access the territories?
It's immensely useful to be able to access the census API directly but creating a URL with the proper parameters -- as well as using the census
package.
In [8]:
import requests
In [9]:
# get the total population of all states
url = "http://api.census.gov/data/2010/sf1?key={key}&get=P0010001,NAME&for=state:*".format(key=settings.CENSUS_KEY)
In [10]:
# note the structure of the response
r = requests.get(url)
In [11]:
# FILL IN
# drop the header record
from itertools import islice
# total population including PR is 312471327
In [14]:
state_data=r.json()
state_data=state_data[1:]
# print state_data[0]
total_pop_US_withPR=sum([int(i[0]) for i in state_data])
print str(total_pop_US_withPR) + " with PR"
In [15]:
# FILL IN
# exclude PR: 308745538
total_pop_US=sum([int(i[0]) for i in state_data if i[1]!="Puerto Rico"])
print str(total_pop_US) + " without PR"
In [16]:
# let's now create a DataFrame from r.json()
df = DataFrame(r.json()[1:], columns=r.json()[0])
df.head()
Out[16]:
In [38]:
# FILL IN
# calculate the total population using df
sum([int(i[0]) for i in df.get_values()])
Out[38]:
In [39]:
# FILL IN -- now calculate the total population excluding Puerto Rico
sum([int(i[0]) for i in df.get_values() if i[1]!="Puerto Rico"])
Out[39]:
How to map out the geographical hierachy and pull out total population figures?
Questions
P0010001 is found in 2010 SF1 API Variables [XML] = "total population"
In [40]:
from settings import CENSUS_KEY
import census
c=census.Census(settings.CENSUS_KEY)
c.sf1.get(('NAME', 'P0010001'), {'for': 'state:%s' % states.CA.fips})
Out[40]:
In [41]:
"population of California: {0}".format(
int(c.sf1.get(('NAME', 'P0010001'), {'for': 'state:%s' % states.CA.fips})[0]['P0010001']))
Out[41]:
Let's try to get at the counties of California and their populations
In [42]:
ca_counties = c.sf1.get(('NAME', 'P0010001'), geo={'for': 'county:*', 'in': 'state:%s' % states.CA.fips})
In [43]:
# create a DataFrame, convert the 'P0010001' column
# show by descending population
df = DataFrame(ca_counties)
df['P0010001'] = df['P0010001'].astype('int')
df.sort_index(by='P0010001', ascending=False)
Out[43]:
In [44]:
#http://stackoverflow.com/a/13130357/7782
count,division = np.histogram(df['P0010001'])
df['P0010001'].hist(bins=division)
Out[44]:
In [ ]: