In [ ]:
%matplotlib inline

Some Context

The US Census is complex....so it's good, even essential, to have a framing question to guide your explorations so that you don't get distracted or lost.

I got into thinking of the census in 2002 when I saw a woman I knew in the following SF Chronicle article:

Claremont-Elmwood / Homogeneity in Berkeley? Well, yeah - SFGate

I thought at that point it should be easy for regular people to do census calculations....

In the summer of 2013, I wrote the following note to Greg Wilson about diversity calculations:

notes for Greg Wilson about an example Data Science Workflow

There's a whole cottage industry in musing on "diversity" in the USA:

and let's not forget the Racial Dot Map and some background.


In [ ]:
#  import useful classes of pandas
import numpy as np
import pandas as pd
from pandas import Series, DataFrame, Index

http://www.census.gov/developers/

Dependency: use a special version of the census module:

In BCE, go to the shell and type

. activate py34

to use the Python 34 env. Then:

pip install -e git+https://github.com/rdhyee/census.git#egg=census

Things we'd like to be able to do:

  • calculate the population of California.
  • then calculate the population of every geographic entity going down to census block if possible.
  • for a given geographic unit, can we get the racial/ethnic breakdown?

Figuring out the Census Data is a Big Jigsaw Puzzle

Some starting points:

We focus first on the API -- and I hope we can come back to processing the bulk data from Census FTP site

Prerequisites: Getting and activating key

"Your request for a new API key has been successfully submitted. Please check your email. In a few minutes you should receive a message with instructions on how to activate your new key."

Then create a settings.py in the same directory as this notebook (or somewhere else in your Python path) to hold settings.CENSUS_KEY


In [ ]:
import settings

In [ ]:
# This cell should run successfully if you have a string set up to represent your census key

try:
    import settings
    assert type(settings.CENSUS_KEY) == str or type(settings.CENSUS_KEY) == unicode
except Exception as e:
    print ("error in importing settings to get at settings.CENSUS_KEY", e)

states module


In [ ]:
# let's figure out a bit about the us module, in particular, us.states
# https://github.com/unitedstates/python-us

from us import states

for (i, state) in enumerate(states.STATES):
    print (i, state.name, state.fips)

Questions to ponder: How many states are in the list? Is DC included the states list? How to access the territories?

Formulating URL requests by hand

It's immensely useful to be able to access the census API directly but creating a URL with the proper parameters -- as well as using the census package.


In [ ]:
import requests

In [ ]:
# get the total population of all states
url = "http://api.census.gov/data/2010/sf1?key={key}&get=P0010001,NAME&for=state:*".format(key=settings.CENSUS_KEY)

In [ ]:
# note the structure of the response
r = requests.get(url)
print(r.content.decode('utf-8'))

Total Population


In [ ]:
# FILL IN
# drop the header record
from itertools import islice
# total population including PR is 312471327

In [ ]:
# FILL IN
# exclude PR:  308745538

In [ ]:
# let's now create a DataFrame from r.json()

df = DataFrame(r.json()[1:], columns=r.json()[0])
df['P0010001'] = df['P0010001'].astype(int)

In [ ]:
# FILL IN
# calculate the total population using df

df.P0010001.sum()

In [ ]:
# FILL IN -- now calculate the total population excluding Puerto Rico

Focusing on sf1 +2010 census

How to map out the geographical hierachy and pull out total population figures?

  1. Nation
  2. Regions
  3. Divisions
  4. State
  5. County
  6. Census Tract
  7. Block Group
  8. Census Block

Questions

  • What identifiers are used for these various geographic entities?
  • Can we get an enumeration of each of these entities?
  • How to figure out which census tract, block group, census block one is in?

Total Population of California

2010 Census Summary File 1

P0010001 is found in 2010 SF1 API Variables [XML] = "total population"


In [ ]:
from settings import CENSUS_KEY
import census

c=census.Census(settings.CENSUS_KEY) 
c.sf1.get(('NAME', 'P0010001'), {'for': 'state:%s' % states.CA.fips})

In [ ]:
"population of California: {0}".format(
        int(c.sf1.get(('NAME', 'P0010001'), {'for': 'state:%s' % states.CA.fips})[0]['P0010001']))

Let's try to get at the counties of California and their populations


In [ ]:
ca_counties = c.sf1.get(('NAME', 'P0010001'), geo={'for': 'county:*', 'in': 'state:%s' % states.CA.fips})

In [ ]:
# create a DataFrame, convert the 'P0010001' column
# show by descending population
df = DataFrame(ca_counties)
df['P0010001'] = df['P0010001'].astype('int')
df.sort_index(by='P0010001', ascending=False)

In [ ]:
#http://stackoverflow.com/a/13130357/7782
count,division = np.histogram(df['P0010001'])
df['P0010001'].hist(bins=division)

In [ ]: