Goals

To practice using generators to yield geographical entities of various types.

Generators are a bit complicated, and I won't try to explain all the intricacies here. I will show you how to use yield in a function definition to return a generator. From Definition of a generator:

A function which returns an iterator. It looks like a normal function except that it contains yield statements for producing a series a values usable in a for-loop or that can be retrieved one at a time with the next() function. Each yield temporarily suspends processing, remembering the location execution state (including local variables and pending try-statements). When the generator resumes, it picks-up where it left-off (in contrast to functions which start fresh on every invocation)

For some background on Python generators:

Why use generators: http://stackoverflow.com/a/102632/7782

Generators are good for calculating large sets of results (in particular calculations involving loops themselves) where you don't know if you are going to need all results, or where you don't want to allocate the memory for all results at the same time.

Also, let's also practice using itertools.islice and enumerate -- two of my favorite constructions in Python

From http://api.census.gov/data/2010/sf1/geo.html, geographic entities we are specifically interested in this exercise:

state-county
state-county-tract
state-place
state-metropolitan statistical area/micropolitan statistical area
state-metropolitan statistical area/micropolitan statistical area-metropolitan division
state-combined statistical area



In [1]:

    
# usual imports for numpy, pandas, matplotlib

import numpy as np
import matplotlib.pyplot as plt
from pandas import DataFrame, Series, Index
import pandas as pd



In [2]:

    
# check that CENSUS_KEY is defined

import census
import us

import settings
assert settings.CENSUS_KEY is not None



In [3]:

    
# instantiate our Census object

c = census.Census(key=settings.CENSUS_KEY)

A bit of warmup with Generators



In [4]:

    
import string
print list(string.lowercase)









    



['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']



In [5]:

    
def abcs():
    alphabet = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 
                'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 
                'w', 'x', 'y', 'z']
    """a generator that returns """
    for letter in alphabet:
        yield letter

# a generator that gives you the letters of the alphabet a letter at a time         
say_abcs =  abcs()



In [6]:

    
# run this line over and over again to see the letters one at a time
say_abcs.next()









    Out[6]:





'a'



In [7]:

    
# you can use list to grab all the items in an iterator.  But be careful if the number
# of items is large or even infinite!  In this case, we're ok

list(abcs())









    Out[7]:





['a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z']

Demonstration of how to use enumerate:

Return an enumerate object. sequence must be a sequence, an iterator, or some other object which supports iteration. The next() method of the iterator returned by enumerate() returns a tuple containing a count (from start which defaults to 0) and the values obtained from iterating over sequence



In [8]:

    
for (i, letter) in enumerate(abcs()):
    print i, letter









    



0 a
1 b
2 c
3 d
4 e
5 f
6 g
7 h
8 i
9 j
10 k
11 l
12 m
13 n
14 o
15 p
16 q
17 r
18 s
19 t
20 u
21 v
22 w
23 x
24 y
25 z

You can use itertools.islice itertools.islice to return parts of the iterator.

Make an iterator that returns selected elements from the iterable. If start is non-zero, then elements from the iterable are skipped until start is reached. Afterward, elements are returned consecutively unless step is set higher than one which results in items being skipped. If stop is None, then iteration continues until the iterator is exhausted, if at all; otherwise, it stops at the specified position. Unlike regular slicing, islice() does not support negative values for start, stop, or step. Can be used to extract related fields from data where the internal structure has been flattened (for example, a multi-line report may list a name field on every third line).



In [9]:

    
# let's get the first 10 letters of the alphabet

from itertools import islice
list(islice(abcs(), 10))









    Out[9]:





['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']



In [10]:

    
# you can use None to get all items in islice
# from docs: "If stop is None, then iteration continues until the iterator is exhausted,"

list(islice(abcs(), None))









    Out[10]:





['a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z']



In [11]:

    
# itertools.count can in principle generate an infinite sequence
# http://www.python.org/doc//current/library/itertools.html#itertools.count

from itertools import count

# count starting zero
my_counter = count(0)



In [16]:

    
# try it out 
my_counter.next()









    Out[16]:





4



In [17]:

    
# DON'T do list(count(0))  -> you'll be trying to generate an infinite list
# but use an upper limit

list(islice(count(0),10))









    Out[17]:





[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]



In [18]:

    
# start, stop
list(islice(count(),1,3))









    Out[18]:





[1, 2]

Generator for US Counties



In [19]:

    
# get the syntax down for getting counties from CA -- so that then we can use it later

r = c.sf1.get('NAME,P0010001', geo={'for':'county:*',
                                'in':'state:{fips}'.format(fips=us.states.CA.fips)})
r[:5]









    Out[19]:





[{u'NAME': u'Alameda County',
  u'P0010001': u'1510271',
  u'county': u'001',
  u'state': u'06'},
 {u'NAME': u'Alpine County',
  u'P0010001': u'1175',
  u'county': u'003',
  u'state': u'06'},
 {u'NAME': u'Amador County',
  u'P0010001': u'38091',
  u'county': u'005',
  u'state': u'06'},
 {u'NAME': u'Butte County',
  u'P0010001': u'220000',
  u'county': u'007',
  u'state': u'06'},
 {u'NAME': u'Calaveras County',
  u'P0010001': u'45578',
  u'county': u'009',
  u'state': u'06'}]

With the census API, you can get the counties with one single call to the census API or state-by-state. The counties generator below takes the first approach while counties2 takes the second approach. Although counties is more efficient in most cases I can think of, it will be useful to know how to do calls on a state-by-state basis. For example, when we to query on a census tract level or below, we will need to work on a state-by-state basis.



In [20]:

    
def counties(variables='NAME'):
    """ask for all the states"""
    
    # tabulate a set of fips codes for the states
    states_fips = set([s.fips for s in us.states.STATES])
    
    geo={'for':'county:*',
             'in':'state:*'}    
    for county in c.sf1.get(variables, geo=geo):
        # eliminate counties whose states aren't in a state or DC
        if county['state'] in states_fips:
            yield county
        

def counties2(variables='NAME'):
    """generator for all counties"""
    
    # since we can get all the counties in one call, 
    # this function is for demonstrating the use of walking through 
    # the states to get at the counties

    for state in us.states.STATES:
        geo={'for':'county:*',
             'in':'state:{fips}'.format(fips=state.fips)}
        for county in c.sf1.get(variables, geo=geo):
            yield county



In [21]:

    
counties_list = list(counties('NAME,P0010001'))



In [22]:

    
# add up the population to make sure we have the total right
counties_df = DataFrame(counties_list)
counties_df.P0010001 = counties_df.P0010001.astype('int')
counties_df.P0010001.sum()









    Out[22]:





308745538

One reason for writing all the counties in the form of a Python generator is tha you can easily control the number of counties we work with at any given time -- and then easily scaling out to get all of them.



In [23]:

    
# make a list of the first ten counties

from itertools import islice
list(islice(counties2(),10))









    Out[23]:





[{u'NAME': u'Autauga County', u'county': u'001', u'state': u'01'},
 {u'NAME': u'Baldwin County', u'county': u'003', u'state': u'01'},
 {u'NAME': u'Barbour County', u'county': u'005', u'state': u'01'},
 {u'NAME': u'Bibb County', u'county': u'007', u'state': u'01'},
 {u'NAME': u'Blount County', u'county': u'009', u'state': u'01'},
 {u'NAME': u'Bullock County', u'county': u'011', u'state': u'01'},
 {u'NAME': u'Butler County', u'county': u'013', u'state': u'01'},
 {u'NAME': u'Calhoun County', u'county': u'015', u'state': u'01'},
 {u'NAME': u'Chambers County', u'county': u'017', u'state': u'01'},
 {u'NAME': u'Cherokee County', u'county': u'019', u'state': u'01'}]

Generator for Census Tracts

The following generator loops through all the states to get at the individual counties to then get at the census tracts.



In [24]:

    
def tracts(variables='NAME'):
    for state in us.states.STATES:
        
        # handy to print out state to monitor progress
        print state.fips, state
        counties_in_state={'for':'county:*',
             'in':'state:{fips}'.format(fips=state.fips)}
        
        for county in c.sf1.get('NAME', geo=counties_in_state):
            
            # print county['state'], county['NAME']
            tracts_in_county = {'for':'tract:*',
              'in': 'state:{s_fips} county:{c_fips}'.format(s_fips=state.fips, 
                                                            c_fips=county['county'])}
            
            for tract in c.sf1.get(variables,geo=tracts_in_county):
                yield tract



In [25]:

    
r = list(islice(tracts('NAME,P0010001'),10))
tracts_df = DataFrame(r)
tracts_df.P0010001 = tracts_df.P0010001.astype('int')
tracts_df['FIPS'] = tracts_df.apply(lambda s: s['state']+s['county']+s['tract'], axis=1)
print "number of tracts", len(tracts_df)
print "total pop", tracts_df.P0010001.sum()
tracts_df.head()









    



01 Alabama
number of tracts 10
total pop 48357






    Out[25]:






  
    
      
      NAME
      P0010001
      county
      state
      tract
      FIPS
    
  
  
    
      0
       Census Tract 201
        1912
       001
       01
       020100
       01001020100
    
    
      1
       Census Tract 202
        2170
       001
       01
       020200
       01001020200
    
    
      2
       Census Tract 203
        3373
       001
       01
       020300
       01001020300
    
    
      3
       Census Tract 204
        4386
       001
       01
       020400
       01001020400
    
    
      4
       Census Tract 205
       10766
       001
       01
       020500
       01001020500

Good to save the DataFrame so we can load up the census tracts without having call the census api again.

I/O: http://pandas.pydata.org/pandas-docs/dev/io.html

Today, we'll use pickle format and look at other formats.



In [26]:

    
TRACT_FILE_PICKLE = "tracts.pickle"

# UNCOMMENT THIS LINE TO SAVE YOUR FILE
# tracts_df.to_pickle(TRACT_FILE_PICKLE)

Let's read the DataFrame from disk to confirm that we were able to save the file properly.



In [27]:

    
df = pd.read_pickle(TRACT_FILE_PICKLE)
df.head()









    Out[27]:






  
    
      
      NAME
      P0010001
      county
      state
      tract
      FIPS
    
  
  
    
      0
       Census Tract 201
        1912
       001
       01
       020100
       01001020100
    
    
      1
       Census Tract 202
        2170
       001
       01
       020200
       01001020200
    
    
      2
       Census Tract 203
        3373
       001
       01
       020300
       01001020300
    
    
      3
       Census Tract 204
        4386
       001
       01
       020400
       01001020400
    
    
      4
       Census Tract 205
       10766
       001
       01
       020500
       01001020500



In [28]:

    
# UNCOMMENT TO DO COMPARISON
# you can compare the saved file to the file from disk
# np.all(tracts_df == df)

	NAME	P0010001	county	state	tract	FIPS
0	Census Tract 201	1912	001	01	020100	01001020100
1	Census Tract 202	2170	001	01	020200	01001020200
2	Census Tract 203	3373	001	01	020300	01001020300
3	Census Tract 204	4386	001	01	020400	01001020400
4	Census Tract 205	10766	001	01	020500	01001020500