To practice using generators to yield geographical entities of various types.
Generators are a bit complicated, and I won't try to explain all the intricacies here. I will show you how to use yield
in a function definition to return a generator. From Definition of a generator:
A function which returns an iterator. It looks like a normal function except that it contains yield statements for producing a series a values usable in a for-loop or that can be retrieved one at a time with the next() function. Each yield temporarily suspends processing, remembering the location execution state (including local variables and pending try-statements). When the generator resumes, it picks-up where it left-off (in contrast to functions which start fresh on every invocation)
For some background on Python generators:
Why use generators: http://stackoverflow.com/a/102632/7782
Generators are good for calculating large sets of results (in particular calculations involving loops themselves) where you don't know if you are going to need all results, or where you don't want to allocate the memory for all results at the same time.
Also, let's also practice using itertools.islice and enumerate -- two of my favorite constructions in Python
From http://api.census.gov/data/2010/sf1/geo.html, geographic entities we are specifically interested in this exercise:
state-county-tract
state-place
In [1]:
# usual imports for numpy, pandas, matplotlib
import numpy as np
import matplotlib.pyplot as plt
from pandas import DataFrame, Series, Index
import pandas as pd
In [2]:
# check that CENSUS_KEY is defined
import census
import us
import settings
assert settings.CENSUS_KEY is not None
In [3]:
# instantiate our Census object
c = census.Census(key=settings.CENSUS_KEY)
In [4]:
import string
print list(string.lowercase)
In [5]:
def abcs():
alphabet = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k',
'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v',
'w', 'x', 'y', 'z']
"""a generator that returns """
for letter in alphabet:
yield letter
# a generator that gives you the letters of the alphabet a letter at a time
say_abcs = abcs()
In [6]:
# run this line over and over again to see the letters one at a time
say_abcs.next()
Out[6]:
In [7]:
# you can use list to grab all the items in an iterator. But be careful if the number
# of items is large or even infinite! In this case, we're ok
list(abcs())
Out[7]:
Demonstration of how to use enumerate:
Return an enumerate object. sequence must be a sequence, an iterator, or some other object which supports iteration. The next() method of the iterator returned by enumerate() returns a tuple containing a count (from start which defaults to 0) and the values obtained from iterating over sequence
In [8]:
for (i, letter) in enumerate(abcs()):
print i, letter
You can use itertools.islice itertools.islice to return parts of the iterator.
Make an iterator that returns selected elements from the iterable. If start is non-zero, then elements from the iterable are skipped until start is reached. Afterward, elements are returned consecutively unless step is set higher than one which results in items being skipped. If stop is None, then iteration continues until the iterator is exhausted, if at all; otherwise, it stops at the specified position. Unlike regular slicing, islice() does not support negative values for start, stop, or step. Can be used to extract related fields from data where the internal structure has been flattened (for example, a multi-line report may list a name field on every third line).
In [9]:
# let's get the first 10 letters of the alphabet
from itertools import islice
list(islice(abcs(), 10))
Out[9]:
In [10]:
# you can use None to get all items in islice
# from docs: "If stop is None, then iteration continues until the iterator is exhausted,"
list(islice(abcs(), None))
Out[10]:
In [11]:
# itertools.count can in principle generate an infinite sequence
# http://www.python.org/doc//current/library/itertools.html#itertools.count
from itertools import count
# count starting zero
my_counter = count(0)
In [16]:
# try it out
my_counter.next()
Out[16]:
In [17]:
# DON'T do list(count(0)) -> you'll be trying to generate an infinite list
# but use an upper limit
list(islice(count(0),10))
Out[17]:
In [18]:
# start, stop
list(islice(count(),1,3))
Out[18]:
In [19]:
# get the syntax down for getting counties from CA -- so that then we can use it later
r = c.sf1.get('NAME,P0010001', geo={'for':'county:*',
'in':'state:{fips}'.format(fips=us.states.CA.fips)})
r[:5]
Out[19]:
With the census API, you can get the counties with one single call to the census API or state-by-state. The counties
generator below takes the first approach while counties2
takes the second approach. Although counties
is more efficient in most cases I can think of, it will be useful to know how to do calls on a state-by-state basis. For example, when we to query on a census tract level or below, we will need to work on a state-by-state basis.
In [20]:
def counties(variables='NAME'):
"""ask for all the states"""
# tabulate a set of fips codes for the states
states_fips = set([s.fips for s in us.states.STATES])
geo={'for':'county:*',
'in':'state:*'}
for county in c.sf1.get(variables, geo=geo):
# eliminate counties whose states aren't in a state or DC
if county['state'] in states_fips:
yield county
def counties2(variables='NAME'):
"""generator for all counties"""
# since we can get all the counties in one call,
# this function is for demonstrating the use of walking through
# the states to get at the counties
for state in us.states.STATES:
geo={'for':'county:*',
'in':'state:{fips}'.format(fips=state.fips)}
for county in c.sf1.get(variables, geo=geo):
yield county
In [21]:
counties_list = list(counties('NAME,P0010001'))
In [22]:
# add up the population to make sure we have the total right
counties_df = DataFrame(counties_list)
counties_df.P0010001 = counties_df.P0010001.astype('int')
counties_df.P0010001.sum()
Out[22]:
One reason for writing all the counties in the form of a Python generator is tha you can easily control the number of counties we work with at any given time -- and then easily scaling out to get all of them.
In [23]:
# make a list of the first ten counties
from itertools import islice
list(islice(counties2(),10))
Out[23]:
The following generator loops through all the states to get at the individual counties to then get at the census tracts.
In [24]:
def tracts(variables='NAME'):
for state in us.states.STATES:
# handy to print out state to monitor progress
print state.fips, state
counties_in_state={'for':'county:*',
'in':'state:{fips}'.format(fips=state.fips)}
for county in c.sf1.get('NAME', geo=counties_in_state):
# print county['state'], county['NAME']
tracts_in_county = {'for':'tract:*',
'in': 'state:{s_fips} county:{c_fips}'.format(s_fips=state.fips,
c_fips=county['county'])}
for tract in c.sf1.get(variables,geo=tracts_in_county):
yield tract
In [25]:
r = list(islice(tracts('NAME,P0010001'),10))
tracts_df = DataFrame(r)
tracts_df.P0010001 = tracts_df.P0010001.astype('int')
tracts_df['FIPS'] = tracts_df.apply(lambda s: s['state']+s['county']+s['tract'], axis=1)
print "number of tracts", len(tracts_df)
print "total pop", tracts_df.P0010001.sum()
tracts_df.head()
Out[25]:
Good to save the DataFrame so we can load up the census tracts without having call the census api again.
I/O: http://pandas.pydata.org/pandas-docs/dev/io.html
Today, we'll use pickle format and look at other formats.
In [26]:
TRACT_FILE_PICKLE = "tracts.pickle"
# UNCOMMENT THIS LINE TO SAVE YOUR FILE
# tracts_df.to_pickle(TRACT_FILE_PICKLE)
Let's read the DataFrame from disk to confirm that we were able to save the file properly.
In [27]:
df = pd.read_pickle(TRACT_FILE_PICKLE)
df.head()
Out[27]:
In [28]:
# UNCOMMENT TO DO COMPARISON
# you can compare the saved file to the file from disk
# np.all(tracts_df == df)