Goals

To practice using generators to yield geographical entities of various types.

Generators are a bit complicated, and I won't try to explain all the intricacies here. I will show you how to use yield in a function definition to return a generator. From Definition of a generator:

A function which returns an iterator. It looks like a normal function except that it contains yield statements for producing a series a values usable in a for-loop or that can be retrieved one at a time with the next() function. Each yield temporarily suspends processing, remembering the location execution state (including local variables and pending try-statements). When the generator resumes, it picks-up where it left-off (in contrast to functions which start fresh on every invocation)

For some background on Python generators:

Why use generators: http://stackoverflow.com/a/102632/7782

Generators are good for calculating large sets of results (in particular calculations involving loops themselves) where you don't know if you are going to need all results, or where you don't want to allocate the memory for all results at the same time.

Also, let's also practice using itertools.islice and enumerate -- two of my favorite constructions in Python

From http://api.census.gov/data/2010/sf1/geo.html, geographic entities we are specifically interested in this exercise:

  • state-county
  • state-county-tract

  • state-place

  • state-metropolitan statistical area/micropolitan statistical area
  • state-metropolitan statistical area/micropolitan statistical area-metropolitan division
  • state-combined statistical area

In [1]:
# usual imports for numpy, pandas, matplotlib

import numpy as np
import matplotlib.pyplot as plt
from pandas import DataFrame, Series, Index
import pandas as pd

In [2]:
# check that CENSUS_KEY is defined

import census
import us

import settings
assert settings.CENSUS_KEY is not None

In [3]:
# instantiate our Census object

c = census.Census(key=settings.CENSUS_KEY)

A bit of warmup with Generators


In [4]:
import string
print list(string.lowercase)


['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

In [5]:
def abcs():
    alphabet = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 
                'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 
                'w', 'x', 'y', 'z']
    """a generator that returns """
    for letter in alphabet:
        yield letter

# a generator that gives you the letters of the alphabet a letter at a time         
say_abcs =  abcs()

In [6]:
# run this line over and over again to see the letters one at a time
say_abcs.next()


Out[6]:
'a'

In [7]:
# you can use list to grab all the items in an iterator.  But be careful if the number
# of items is large or even infinite!  In this case, we're ok

list(abcs())


Out[7]:
['a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z']

Demonstration of how to use enumerate:

Return an enumerate object. sequence must be a sequence, an iterator, or some other object which supports iteration. The next() method of the iterator returned by enumerate() returns a tuple containing a count (from start which defaults to 0) and the values obtained from iterating over sequence

In [8]:
for (i, letter) in enumerate(abcs()):
    print i, letter


0 a
1 b
2 c
3 d
4 e
5 f
6 g
7 h
8 i
9 j
10 k
11 l
12 m
13 n
14 o
15 p
16 q
17 r
18 s
19 t
20 u
21 v
22 w
23 x
24 y
25 z

You can use itertools.islice itertools.islice to return parts of the iterator.

Make an iterator that returns selected elements from the iterable. If start is non-zero, then elements from the iterable are skipped until start is reached. Afterward, elements are returned consecutively unless step is set higher than one which results in items being skipped. If stop is None, then iteration continues until the iterator is exhausted, if at all; otherwise, it stops at the specified position. Unlike regular slicing, islice() does not support negative values for start, stop, or step. Can be used to extract related fields from data where the internal structure has been flattened (for example, a multi-line report may list a name field on every third line).

In [9]:
# let's get the first 10 letters of the alphabet

from itertools import islice
list(islice(abcs(), 10))


Out[9]:
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

In [10]:
# you can use None to get all items in islice
# from docs: "If stop is None, then iteration continues until the iterator is exhausted,"

list(islice(abcs(), None))


Out[10]:
['a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z']

In [11]:
# itertools.count can in principle generate an infinite sequence
# http://www.python.org/doc//current/library/itertools.html#itertools.count

from itertools import count

# count starting zero
my_counter = count(0)

In [16]:
# try it out 
my_counter.next()


Out[16]:
4

In [17]:
# DON'T do list(count(0))  -> you'll be trying to generate an infinite list
# but use an upper limit

list(islice(count(0),10))


Out[17]:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [18]:
# start, stop
list(islice(count(),1,3))


Out[18]:
[1, 2]

Generator for US Counties


In [19]:
# get the syntax down for getting counties from CA -- so that then we can use it later

r = c.sf1.get('NAME,P0010001', geo={'for':'county:*',
                                'in':'state:{fips}'.format(fips=us.states.CA.fips)})
r[:5]


Out[19]:
[{u'NAME': u'Alameda County',
  u'P0010001': u'1510271',
  u'county': u'001',
  u'state': u'06'},
 {u'NAME': u'Alpine County',
  u'P0010001': u'1175',
  u'county': u'003',
  u'state': u'06'},
 {u'NAME': u'Amador County',
  u'P0010001': u'38091',
  u'county': u'005',
  u'state': u'06'},
 {u'NAME': u'Butte County',
  u'P0010001': u'220000',
  u'county': u'007',
  u'state': u'06'},
 {u'NAME': u'Calaveras County',
  u'P0010001': u'45578',
  u'county': u'009',
  u'state': u'06'}]

With the census API, you can get the counties with one single call to the census API or state-by-state. The counties generator below takes the first approach while counties2 takes the second approach. Although counties is more efficient in most cases I can think of, it will be useful to know how to do calls on a state-by-state basis. For example, when we to query on a census tract level or below, we will need to work on a state-by-state basis.


In [20]:
def counties(variables='NAME'):
    """ask for all the states"""
    
    # tabulate a set of fips codes for the states
    states_fips = set([s.fips for s in us.states.STATES])
    
    geo={'for':'county:*',
             'in':'state:*'}    
    for county in c.sf1.get(variables, geo=geo):
        # eliminate counties whose states aren't in a state or DC
        if county['state'] in states_fips:
            yield county
        

def counties2(variables='NAME'):
    """generator for all counties"""
    
    # since we can get all the counties in one call, 
    # this function is for demonstrating the use of walking through 
    # the states to get at the counties

    for state in us.states.STATES:
        geo={'for':'county:*',
             'in':'state:{fips}'.format(fips=state.fips)}
        for county in c.sf1.get(variables, geo=geo):
            yield county

In [21]:
counties_list = list(counties('NAME,P0010001'))

In [22]:
# add up the population to make sure we have the total right
counties_df = DataFrame(counties_list)
counties_df.P0010001 = counties_df.P0010001.astype('int')
counties_df.P0010001.sum()


Out[22]:
308745538

One reason for writing all the counties in the form of a Python generator is tha you can easily control the number of counties we work with at any given time -- and then easily scaling out to get all of them.


In [23]:
# make a list of the first ten counties

from itertools import islice
list(islice(counties2(),10))


Out[23]:
[{u'NAME': u'Autauga County', u'county': u'001', u'state': u'01'},
 {u'NAME': u'Baldwin County', u'county': u'003', u'state': u'01'},
 {u'NAME': u'Barbour County', u'county': u'005', u'state': u'01'},
 {u'NAME': u'Bibb County', u'county': u'007', u'state': u'01'},
 {u'NAME': u'Blount County', u'county': u'009', u'state': u'01'},
 {u'NAME': u'Bullock County', u'county': u'011', u'state': u'01'},
 {u'NAME': u'Butler County', u'county': u'013', u'state': u'01'},
 {u'NAME': u'Calhoun County', u'county': u'015', u'state': u'01'},
 {u'NAME': u'Chambers County', u'county': u'017', u'state': u'01'},
 {u'NAME': u'Cherokee County', u'county': u'019', u'state': u'01'}]

Generator for Census Tracts

The following generator loops through all the states to get at the individual counties to then get at the census tracts.


In [24]:
def tracts(variables='NAME'):
    for state in us.states.STATES:
        
        # handy to print out state to monitor progress
        print state.fips, state
        counties_in_state={'for':'county:*',
             'in':'state:{fips}'.format(fips=state.fips)}
        
        for county in c.sf1.get('NAME', geo=counties_in_state):
            
            # print county['state'], county['NAME']
            tracts_in_county = {'for':'tract:*',
              'in': 'state:{s_fips} county:{c_fips}'.format(s_fips=state.fips, 
                                                            c_fips=county['county'])}
            
            for tract in c.sf1.get(variables,geo=tracts_in_county):
                yield tract

In [25]:
r = list(islice(tracts('NAME,P0010001'),10))
tracts_df = DataFrame(r)
tracts_df.P0010001 = tracts_df.P0010001.astype('int')
tracts_df['FIPS'] = tracts_df.apply(lambda s: s['state']+s['county']+s['tract'], axis=1)
print "number of tracts", len(tracts_df)
print "total pop", tracts_df.P0010001.sum()
tracts_df.head()


01 Alabama
number of tracts 10
total pop 48357
Out[25]:
NAME P0010001 county state tract FIPS
0 Census Tract 201 1912 001 01 020100 01001020100
1 Census Tract 202 2170 001 01 020200 01001020200
2 Census Tract 203 3373 001 01 020300 01001020300
3 Census Tract 204 4386 001 01 020400 01001020400
4 Census Tract 205 10766 001 01 020500 01001020500

Good to save the DataFrame so we can load up the census tracts without having call the census api again.

I/O: http://pandas.pydata.org/pandas-docs/dev/io.html

Today, we'll use pickle format and look at other formats.


In [26]:
TRACT_FILE_PICKLE = "tracts.pickle"

# UNCOMMENT THIS LINE TO SAVE YOUR FILE
# tracts_df.to_pickle(TRACT_FILE_PICKLE)

Let's read the DataFrame from disk to confirm that we were able to save the file properly.


In [27]:
df = pd.read_pickle(TRACT_FILE_PICKLE)
df.head()


Out[27]:
NAME P0010001 county state tract FIPS
0 Census Tract 201 1912 001 01 020100 01001020100
1 Census Tract 202 2170 001 01 020200 01001020200
2 Census Tract 203 3373 001 01 020300 01001020300
3 Census Tract 204 4386 001 01 020400 01001020400
4 Census Tract 205 10766 001 01 020500 01001020500

In [28]:
# UNCOMMENT TO DO COMPARISON
# you can compare the saved file to the file from disk
# np.all(tracts_df == df)