In [3]:
# our usual pylab import
%pylab --no-import-all inline
For background, see Mapping Census Data, including the scan of the 10-question form. Keep in mind what people were asked and the range of data available in the census.
Using the census API to get an understanding of some of the geographic entities in the 2010 census. We'll specifically be using the variable P0010001
, the total population.
What you will do in this notebook:
We will make use of pandas
in this notebook.
I often have the following diagram in mind to help understand the relationship among entities. Also use the list of example URLs -- it'll come in handy.
It's helpful to have a concrete instance of a place to work with, especially when dealing with rather intangible entities like census tracts, block groups, and blocks. You can use the American FactFinder site to look up for any given US address the corresponding census geographies.
Let's use Cafe Milano in Berkeley as an example. You can verify the following results by typing in the address into http://factfinder2.census.gov/faces/nav/jsf/pages/searchresults.xhtml?refresh=t.
2522 Bancroft Way, BERKELEY, CA, 94704
In [4]:
# YouTube video I made on how to use the American Factfinder site to look up addresses
from IPython.display import YouTubeVideo
YouTubeVideo('HeXcliUx96Y')
Out[4]:
In [5]:
# standard numpy, pandas, matplotlib imports
import numpy as np
import matplotlib.pyplot as plt
from pandas import DataFrame, Series, Index
import pandas as pd
In [6]:
# check that CENSUS_KEY is defined
import census
import us
import requests
import settings
assert settings.CENSUS_KEY is not None
The census documentation has example URLs but needs your API key to work. In this notebook, we'll use the IPython notebook HTML display mechanism to help out.
In [7]:
c = census.Census(key=settings.CENSUS_KEY)
Note: we can use c.sf1
to access 2010 census (SF1: Census Summary File 1 (2010, 2000, 1990) available in API -- 2010 is the default)
see documentation: sunlightlabs/census
Let's make a DataFrame
named states_df
with columns NAME
, P0010001
(for population), and state
(to hold the FIPS code). Make sure to exclude Puerto Rico.
In [27]:
# call the API and instantiate `df`
df = DataFrame(c.sf1.get('NAME,P0010001', geo={'for':'state:*'}))
type(df['P0010001'])
Out[27]:
In [28]:
# convert the population to integer
df['P0010001'] = df['P0010001'].astype(np.int)
df.head()
Out[28]:
In [26]:
type(df['P0010001'])
Out[26]:
In [24]:
len(df) #expecting 51, got 52
#df.NAME
#df['NAME'] =='California' #for each elem in the series, asks if it is "California
sates_df = df[df['NAME'] !='Puerito Rico']
Out[24]:
You can filter Puerto Rico (PR) in a number of ways -- use the way you're most comfortable with.
Optional fun: filter PR in the following way
np.array
holding the the fips of the states
In [11]:
## FILL IN
## calculate states_fips so that PR not included
If states_df
is calculated properly, the following asserts will pass silently.
In [12]:
# check that we have three columns
assert set(states_df.columns) == set((u'NAME', u'P0010001', u'state'))
# check that the total 2010 census population is correct
assert np.sum(states_df.P0010001) == 308745538
# check that the number of states+DC is 51
assert len(states_df) == 51
Looking at http://api.census.gov/data/2010/sf1/geo.html, we see
state-county: http://api.census.gov/data/2010/sf1?get=P0010001&for=county:*
if we want to grab all counties in one go, or you can grab counties state-by-state:
http://api.census.gov/data/2010/sf1?get=P0010001&for=county:*&in=state:06
for all counties in the state with FIPS code 06
(which is what state?)
In [ ]:
# Here's a way to use translate
# http://api.census.gov/data/2010/sf1?get=P0010001&for=county:*
# into a call using the census.Census object
r = c.sf1.get('NAME,P0010001', geo={'for':'county:*'})
# ask yourself what len(r) means and what it should be
len(r)
In [ ]:
# let's try out one of the `census` object convenience methods
# instead of using `c.sf1.get`
r = c.sf1.state_county('NAME,P0010001',census.ALL,census.ALL)
r
In [ ]:
# convert the json from the API into a DataFrame
# coerce to integer the P0010001 column
df = DataFrame(r)
df['P0010001'] = df['P0010001'].astype('int')
# display the first records
df.head()
In [ ]:
# calculate the total population
# what happens when you google the number you get?
np.sum(df['P0010001'])
In [ ]:
# often you can use dot notation to access a DataFrame column
df.P0010001.head()
In [ ]:
## FILL IN
## compute counties_df
## counties_df should have same columns as df
## filter out PR -- what's the total population now
Check properties of counties_df
In [ ]:
# number of counties
assert len(counties_df) == 3143 #3143 county/county-equivs in US
In [ ]:
# check that the total population by adding all counties == population by adding all states
assert np.sum(counties_df['P0010001']) == np.sum(states_df.P0010001)
In [ ]:
# check we have same columns between counties_df and df
set(counties_df.columns) == set(df.columns)
From Mapping Census Data:
In [ ]:
# take a look at the current structure of counties_df
counties_df.head()
If you add all the counties on a state-by-state basisi, do you get the same populations for each state?
set_index
to make the FIPS code for the state the index for states_df
county_fips
counties_df
to compare the populations of states with that in states_df
In [ ]:
## FILL IN
Let's look at home: California state and Alameda County
In [ ]:
# boolean indexing to pull up California
states_df[states_df.NAME == 'California']
In [ ]:
# use .ix -- most general indexing
# http://pandas.pydata.org/pandas-docs/dev/indexing.html#different-choices-for-indexing-loc-iloc-and-ix
states_df.ix['06']
In [ ]:
# California counties
counties_df[counties_df.state=='06']
In [ ]:
counties_df[counties_df.NAME == 'Alameda County']
In [ ]:
counties_df[counties_df.NAME == 'Alameda County']['P0010001']
Different ways to read off the population of Alameda County -- still looking for the best way
In [ ]:
counties_df[counties_df.NAME == 'Alameda County']['P0010001'].to_dict().values()[0]
In [ ]:
list(counties_df[counties_df.NAME == 'Alameda County']['P0010001'].iteritems())[0][1]
In [ ]:
int(counties_df[counties_df.NAME == 'Alameda County']['P0010001'].values)
If you know the FIPS code for Alameda County, just read off the population using .ix
In [ ]:
# this is like accessing a cell in a spreadsheet -- row, col
ALAMEDA_COUNTY_FIPS = '06001'
counties_df.ix[ALAMEDA_COUNTY_FIPS,'P0010001']
In [ ]:
counties_df.ix[ALAMEDA_COUNTY_FIPS,'county']
In [ ]:
## FILL IN
## generate a DataFrame named alameda_county_tracts_df by
## calling the census api and the state-county-tract technique
## how many census tracts in Alameda County?
## if you add up the population, what do you get?
## generate the FIPS code for each tract
In [ ]:
# confirm that you can find the census tract in which Cafe Milano is located
# Cafe Milano is in tract 4228
MILANO_TRACT_ID = '422800'
alameda_county_tracts_df[alameda_county_tracts_df.tract==MILANO_TRACT_ID]
In [13]:
## FILL IN
## try to reproduce the generator I show in class for all the census tracts
## start to think about how to do this for other geographical entities
import time
import us
from itertools import islice
def census_tracts(variable=('NAME','P0010001'), sleep_time=1.0):
for state in us.states.STATES:
print state
for tract in c.sf1.get(variable,
geo={'for':"tract:*",
'in':'state:{state_fips}'.format(state_fips=state.fips)
}):
yield tract
# don't hit the API more than once a second
time.sleep(sleep_time)
# limit the number of tracts we crawl for until we're reading to get all of them
tracts_df = DataFrame(list(islice(census_tracts(), 100)))
tracts_df['P0010001'] = tracts_df['P0010001'].astype('int')
In [14]:
tracts_df.head()
Out[14]:
In [15]:
## EXERCISE for next time
## write a generator all census places
In [ ]:
#add county
We can compare the total number of tracts we calculate to:
https://www.census.gov/geo/maps-data/data/tallies/tractblock.html
and
https://www.census.gov/geo/maps-data/data/docs/geo_tallies/Tract_Block2010.txt