In [3]:
# our usual pylab import
%pylab --no-import-all inline


Populating the interactive namespace from numpy and matplotlib

Goal

For background, see Mapping Census Data, including the scan of the 10-question form. Keep in mind what people were asked and the range of data available in the census.

Using the census API to get an understanding of some of the geographic entities in the 2010 census. We'll specifically be using the variable P0010001, the total population.

What you will do in this notebook:

  • Sum the population of the states (or state-like entity like DC) to get the total population of the nation
  • Add up the counties for each state and validate the sums
  • Add up the census tracts for each county and validate the sums

We will make use of pandas in this notebook.

I often have the following diagram in mind to help understand the relationship among entities. Also use the list of example URLs -- it'll come in handy.

Working out the geographical hierarchy for Cafe Milano

It's helpful to have a concrete instance of a place to work with, especially when dealing with rather intangible entities like census tracts, block groups, and blocks. You can use the American FactFinder site to look up for any given US address the corresponding census geographies.

Let's use Cafe Milano in Berkeley as an example. You can verify the following results by typing in the address into http://factfinder2.census.gov/faces/nav/jsf/pages/searchresults.xhtml?refresh=t.

https://www.evernote.com/shard/s1/sh/dc0bfb96-4965-4fbf-bc28-c9d4d0080782/2bd8c92a045d62521723347d62fa2b9d

2522 Bancroft Way, BERKELEY, CA, 94704

  • State: California
  • County: Alameda County
  • County Subdivision: Berkeley CCD, Alameda County, California
  • Census Tract: Census Tract 4228, Alameda County, California
  • Block Group: Block Group 1, Census Tract 4228, Alameda County, California
  • Block: Block 1001, Block Group 1, Census Tract 4228, Alameda County, California

In [4]:
# YouTube video I made on how to use the American Factfinder site to look up addresses
from IPython.display import YouTubeVideo
YouTubeVideo('HeXcliUx96Y')


Out[4]:

In [5]:
#  standard numpy, pandas, matplotlib imports

import numpy as np
import matplotlib.pyplot as plt
from pandas import DataFrame, Series, Index
import pandas as pd

In [6]:
# check that CENSUS_KEY is defined
import census
import us

import requests

import settings
assert settings.CENSUS_KEY is not None

The census documentation has example URLs but needs your API key to work. In this notebook, we'll use the IPython notebook HTML display mechanism to help out.


In [7]:
c = census.Census(key=settings.CENSUS_KEY)

Note: we can use c.sf1 to access 2010 census (SF1: Census Summary File 1 (2010, 2000, 1990) available in API -- 2010 is the default)

see documentation: sunlightlabs/census

Summing up populations by state

Let's make a DataFrame named states_df with columns NAME, P0010001 (for population), and state (to hold the FIPS code). Make sure to exclude Puerto Rico.


In [27]:
# call the API and instantiate `df`
df = DataFrame(c.sf1.get('NAME,P0010001', geo={'for':'state:*'}))
type(df['P0010001'])


Out[27]:
pandas.core.series.Series

In [28]:
# convert the population to integer
df['P0010001'] = df['P0010001'].astype(np.int)
df.head()


Out[28]:
NAME P0010001 state
0 Alabama 4779736 01
1 Alaska 710231 02
2 Arizona 6392017 04
3 Arkansas 2915918 05
4 California 37253956 06

In [26]:
type(df['P0010001'])


Out[26]:
pandas.core.series.Series

In [24]:
len(df) #expecting 51, got 52
#df.NAME
#df['NAME'] =='California' #for each elem in the series, asks if it is "California
sates_df = df[df['NAME'] !='Puerito Rico']


Out[24]:
0     False
1     False
2     False
3     False
4      True
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
30    False
31    False
32    False
33    False
34    False
35    False
36    False
37    False
38    False
39    False
40    False
41    False
42    False
43    False
44    False
45    False
46    False
47    False
48    False
49    False
50    False
51    False
Name: NAME, Length: 52

You can filter Puerto Rico (PR) in a number of ways -- use the way you're most comfortable with.

Optional fun: filter PR in the following way

  • calculate a np.array holding the the fips of the states
  • then use numpy.in1d, which is a analogous to the in operator to test membership in a list

In [11]:
## FILL IN
## calculate states_fips so that PR not included

If states_df is calculated properly, the following asserts will pass silently.


In [12]:
# check that we have three columns
assert set(states_df.columns) == set((u'NAME', u'P0010001', u'state'))

# check that the total 2010 census population is correct
assert np.sum(states_df.P0010001) == 308745538 

# check that the number of states+DC is 51
assert len(states_df) == 51


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-12-4f45f2a4240a> in <module>()
      1 # check that we have three columns
----> 2 assert set(states_df.columns) == set((u'NAME', u'P0010001', u'state'))
      3 
      4 # check that the total 2010 census population is correct
      5 assert np.sum(states_df.P0010001) == 308745538

NameError: name 'states_df' is not defined

Counties

Looking at http://api.census.gov/data/2010/sf1/geo.html, we see

state-county: http://api.census.gov/data/2010/sf1?get=P0010001&for=county:*

if we want to grab all counties in one go, or you can grab counties state-by-state:

http://api.census.gov/data/2010/sf1?get=P0010001&for=county:*&in=state:06

for all counties in the state with FIPS code 06 (which is what state?)


In [ ]:
# Here's a way to use translate 
# http://api.census.gov/data/2010/sf1?get=P0010001&for=county:*
# into a call using the census.Census object

r = c.sf1.get('NAME,P0010001', geo={'for':'county:*'})

# ask yourself what len(r) means and what it should be
len(r)

In [ ]:
# let's try out one of the `census` object convenience methods
# instead of using `c.sf1.get`

r = c.sf1.state_county('NAME,P0010001',census.ALL,census.ALL)
r

In [ ]:
# convert the json from the API into a DataFrame
# coerce to integer the P0010001 column

df = DataFrame(r)
df['P0010001'] = df['P0010001'].astype('int')

# display the first records
df.head()

In [ ]:
# calculate the total population 
# what happens when you google the number you get?

np.sum(df['P0010001'])

In [ ]:
# often you can use dot notation to access a DataFrame column
df.P0010001.head()

In [ ]:
## FILL IN
## compute counties_df
## counties_df should have same columns as df
## filter out PR -- what's the total population now

Check properties of counties_df


In [ ]:
# number of counties
assert len(counties_df) == 3143 #3143 county/county-equivs in US

In [ ]:
# check that the total population by adding all counties == population by adding all states

assert np.sum(counties_df['P0010001']) == np.sum(states_df.P0010001)

In [ ]:
# check we have same columns between counties_df and df
set(counties_df.columns) == set(df.columns)

Using FIPS code as the Index

From Mapping Census Data:

  • Each state (SUMLEV = 040) has a 2-digit FIPS ID; Delaware's is 10.
  • Each county (SUMLEV = 050) within a state has a 3-digit FIPS ID, appended to the 2-digit state ID. New Castle County, Delaware, has FIPS ID 10003.
  • Each Census Tract (SUMLEV = 140) within a county has a 6-digit ID, appended to the county code. The Tract in New Castle County DE that contains most of the the UD campus has FIPS ID 10003014502.
  • Each Block Group (SUMLEV = 150) within a Tract has a single digit ID appended to the Tract ID. The center of campus in the northwest corner of the tract is Block Group100030145022.
  • Each Block (SUMLEV = 750) within a Block Group is identified by three more digits appended to the Block Group ID. Pearson Hall is located in Block 100030145022009.

In [ ]:
# take a look at the current structure of counties_df

counties_df.head()

If you add all the counties on a state-by-state basisi, do you get the same populations for each state?

  • use set_index to make the FIPS code for the state the index for states_df
  • calculate the FIPS code for the counties and make the county FIPS code the index of of county_fips
  • use groupby on counties_df to compare the populations of states with that in states_df

In [ ]:
## FILL IN

Counties in California

Let's look at home: California state and Alameda County


In [ ]:
# boolean indexing to pull up California
states_df[states_df.NAME == 'California']

In [ ]:
# use .ix -- most general indexing 
# http://pandas.pydata.org/pandas-docs/dev/indexing.html#different-choices-for-indexing-loc-iloc-and-ix
states_df.ix['06']

In [ ]:
# California counties

counties_df[counties_df.state=='06']

In [ ]:
counties_df[counties_df.NAME == 'Alameda County']

In [ ]:
counties_df[counties_df.NAME == 'Alameda County']['P0010001']

Different ways to read off the population of Alameda County -- still looking for the best way


In [ ]:
counties_df[counties_df.NAME == 'Alameda County']['P0010001'].to_dict().values()[0]

In [ ]:
list(counties_df[counties_df.NAME == 'Alameda County']['P0010001'].iteritems())[0][1]

In [ ]:
int(counties_df[counties_df.NAME == 'Alameda County']['P0010001'].values)

If you know the FIPS code for Alameda County, just read off the population using .ix


In [ ]:
# this is like accessing a cell in a spreadsheet -- row, col

ALAMEDA_COUNTY_FIPS = '06001'

counties_df.ix[ALAMEDA_COUNTY_FIPS,'P0010001']

Reading off all the tracts in Alameda County


In [ ]:
counties_df.ix[ALAMEDA_COUNTY_FIPS,'county']

In [ ]:
## FILL IN 
## generate a DataFrame named alameda_county_tracts_df by
## calling the census api and the state-county-tract technique

## how many census tracts in Alameda County?
## if you add up the population, what do you get?
## generate the FIPS code for each tract

In [ ]:
# confirm that you can find the census tract in which Cafe Milano is located
# Cafe Milano is in tract 4228

MILANO_TRACT_ID = '422800'
alameda_county_tracts_df[alameda_county_tracts_df.tract==MILANO_TRACT_ID]

Using Generators to yield all the tracts in the country


In [13]:
## FILL IN
## try to reproduce the generator I show in class for all the census tracts
## start to think about how to do this for other geographical entities

import time
import us

from itertools import islice

def census_tracts(variable=('NAME','P0010001'), sleep_time=1.0):
    
    for state in us.states.STATES:
        print state
        for tract in c.sf1.get(variable, 
                    geo={'for':"tract:*", 
                        'in':'state:{state_fips}'.format(state_fips=state.fips)
                        }):
            yield tract
        # don't hit the API more than once a second    
        time.sleep(sleep_time)
 
# limit the number of tracts we crawl for until we're reading to get all of them        
tracts_df = DataFrame(list(islice(census_tracts(), 100)))
tracts_df['P0010001'] = tracts_df['P0010001'].astype('int')


Alabama

In [14]:
tracts_df.head()


Out[14]:
NAME P0010001 county state tract
0 Census Tract 201 1912 001 01 020100
1 Census Tract 202 2170 001 01 020200
2 Census Tract 203 3373 001 01 020300
3 Census Tract 204 4386 001 01 020400
4 Census Tract 205 10766 001 01 020500

In [15]:
## EXERCISE for next time
## write a generator all census places

In [ ]:
#add county

Compare with Tabulations