Midterm

The upcoming midterm: (Day 17, 2014-03-18). It will probably consist of mostly multiple choice questions.

The goal of this notebook is to help students to prepare for the midterm through providing highlights of what we've covered so far.

Suggestions about How to Prepare

  • read through all the materials from the course so far and outline what you understand and don't.
  • focus on key concepts and those programming constructs that are repeated often.

Open Data

Working definition of open data

From http://en.wikipedia.org/w/index.php?title=Special:Cite&page=Open_data&id=532390265:

Open data is the idea that certain data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control.

http://opendefinition.org/:

A piece of content or data is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and/or share-alike.

Readings from Day 1

World Populations

Day_01_B_World_Population.ipynb

  • How was the JSON data from the Wikipeida and the CIA Factbook produced?
  • Why do the totals from the two sources differ?

Racial Dot Map (as a framing example)

The Racial Dot Map: One Dot Per Person | Weldon Cooper Center for Public Service

  • What is the Racial Dot Map displaying?
  • How would you get data relevant to the Racial Dot Map from the Census API?

Census API

Day_02_A_US_Census_API.ipynb

  • What's the purpose of an API key?
  • What is pip and how to use it?
  • Remember the issues of sometimes having to filter to Puerto Rico

In [2]:
# set up your census object
# example from https://github.com/sunlightlabs/census

from census import Census
from us import states

import settings

c = Census(settings.CENSUS_KEY)
for (i, state) in enumerate(states.STATES):
    print i, state.name, state.fips


0 Alabama 01
1 Alaska 02
2 Arizona 04
3 Arkansas 05
4 California 06
5 Colorado 08
6 Connecticut 09
7 Delaware 10
8 District of Columbia 11
9 Florida 12
10 Georgia 13
11 Hawaii 15
12 Idaho 16
13 Illinois 17
14 Indiana 18
15 Iowa 19
16 Kansas 20
17 Kentucky 21
18 Louisiana 22
19 Maine 23
20 Maryland 24
21 Massachusetts 25
22 Michigan 26
23 Minnesota 27
24 Mississippi 28
25 Missouri 29
26 Montana 30
27 Nebraska 31
28 Nevada 32
29 New Hampshire 33
30 New Jersey 34
31 New Mexico 35
32 New York 36
33 North Carolina 37
34 North Dakota 38
35 Ohio 39
36 Oklahoma 40
37 Oregon 41
38 Pennsylvania 42
39 Rhode Island 44
40 South Carolina 45
41 South Dakota 46
42 Tennessee 47
43 Texas 48
44 Utah 49
45 Vermont 50
46 Virginia 51
47 Washington 53
48 West Virginia 54
49 Wisconsin 55
50 Wyoming 56

In [3]:
import requests
# get the total population of all states
url = "http://api.census.gov/data/2010/sf1?key={key}&get=P0010001,NAME&for=state:*".format(key=settings.CENSUS_KEY)
r = requests.get(url)

r.json()[:5]


Out[3]:
[[u'P0010001', u'NAME', u'state'],
 [u'4779736', u'Alabama', u'01'],
 [u'710231', u'Alaska', u'02'],
 [u'6392017', u'Arizona', u'04'],
 [u'2915918', u'Arkansas', u'05']]

In [4]:
c.sf1.get(('NAME', 'P0010001'), {'for': 'state:%s' % states.CA.fips})


Out[4]:
[{u'NAME': u'California', u'P0010001': u'37253956', u'state': u'06'}]

Learning the Basics of NumPy and Pandas

Day_06_D_Assignment.ipynb: exercise to write a generator for Census Places (answer: Day_06_E_Assignment_Answers.ipynb)


In [9]:
# You should understand how this works.

import pandas as pd
from pandas import DataFrame

import census
import settings
import us

from itertools import islice

c=census.Census(settings.CENSUS_KEY)

def places(variables="NAME"):
    
    for state in us.states.STATES:
        print state
        geo = {'for':'place:*', 'in':'state:{s_fips}'.format(s_fips=state.fips)}
        for place in c.sf1.get(variables, geo=geo):
            yield place

r = list(islice(places("NAME,P0010001"), None))
places_df = DataFrame(r)
places_df.P0010001 = places_df.P0010001.astype('int')

places_df['FIPS'] = places_df.apply(lambda s: s['state']+s['place'], axis=1)

# print "number of places", len(places_df)
# print "total pop", places_df.P0010001.sum()
# places_df.head()

assert places_df.P0010001.sum() == 228457238
# number of places in 2010 Census
assert len(places_df) == 29261
# places_df


Alabama
Alaska
Arizona
Arkansas
California
Colorado
Connecticut
Delaware
District of Columbia
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana
Iowa
Kansas
Kentucky
Louisiana
Maine
Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma
Oregon
Pennsylvania
Rhode Island
South Carolina
South Dakota
Tennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming

Apply and lambda functions

apply + lambda functions: Day_06_A_Apply_Lambda.ipynb

P005* variables in the census

http://www.census.gov/developers/data/sf1.xml

compare to http://www.census.gov/prod/cen2010/briefs/c2010br-02.pdf

I think the P0050001 might be the key category

  • P0010001 = P0050001
  • P0050001 = P0050002 + P0050010

P0050002 Not Hispanic or Latino (total) =

  • P0050003 Not Hispanic White only
  • P0050004 Not Hispanic Black only
  • P0050006 Not Hispanic Asian only
  • Not Hispanic Other (should also be P0050002 - (P0050003 + P0050004 + P0050006)

    • P0050005 Not Hispanic: American Indian/ American Indian and Alaska Native alone
    • P0050007 Not Hispanic: Native Hawaiian and Other Pacific Islander alone
    • P0050008 Not Hispanic: Some Other Race alone
    • P0050009 Not Hispanic: Two or More Races
  • P0050010 Hispanic or Latino

P0050010 = P0050011...P0050017

"Whites are coded as blue; African-Americans, green; Asians, red; Hispanics, orange; and all other racial categories are coded as brown."

Census has lots of interesting data (optional)

Day_07_E_Census_fields.ipynb is an exploration of the concepts and variables in the 2010 Census.

Groupby

Day_07_F_Groupby.ipynb: gives you background on how to understand and use groupby in Pandas. Don't miss AJ's Day_10_Groupby_Examples.ipynb, which should be helpful, especially if you found Day_10_Groupby_Examples.ipynb obscure.

Census Metro Diversity Exercise

Day_07_G_Calculating_Diversity.ipynb: a prelude to the big diversity-calculation assignment Day_08_A_Metro_Diversity.ipynb

Projects

not a focal point for the midterm (though, of course, it's good for projects to be in the background of your thinking)

Relevant references:

Plotting and Mapping preparation

I will assume that you've read Chapter 8 of PfDA and can run Day_11_B_Setting_Up_for_PfDA.ipynb.

study overview slide: Day 12: Overview of Plotting Options.

Note some fundamental conceptual aspects to matplotlib (as I outline in Day_12_A_Matplotlib_Intro.ipynb and try to make basic plots on your own (line plots, scatter plots, bar plots).

Baby Names

Day_12_B_Baby_Names_Starter.ipynb#Names-that-are-both-M-and-F

Before you use Day_13_C_Baby_Names_MF_Completed.ipynb, try the approach in Day_13_B_Baby_Names_MF_Starter.ipynb

Assignment in nbviewer.ipython.org/github/rdhyee/working-open-data-2014/blob/master/notebooks/Day_13_B_Baby_Names_MF_Starter.ipynb:

Submit a notebook that describes what you've learned about the nature of ambigendered names in the baby names database. (Due date: Monday, March 10 Wed, March 12 at 11:5pm --> bCourses assignment) I'm interested in seeing what you do with the data set in this regard. At the minimum, show that you are able to run Day_13_C_Baby_Names_MF_Completed. Be creative and have fun.


In [5]: