Welcome to the first project in Data 8! We will be exploring possible connections between water usage, geography, and income in California. The water data for this project was procured from the California State Water Resources Control Board and curated by the Pacific Institute. The map data includes US topography, California counties, and ZIP codes.
The dataset on income comes from the IRS (documented here). We have identified some interesting columns in the dataset, but a full description of all the columns (and a definition of the population in the dataset and some interesting anonymization procedures they used) is available here.
While collaboration is encouraged on this and other assignments, sharing answers is never okay. In particular, posting code or other assignment answers publicly on Piazza (or elsewhere) is academic dishonesty. It will result in a reduced project grade at a minimum. If you wish to ask a question and include code, you must make it a private post.
You may complete the project with up to one partner. Partnerships are an exception to the rule against sharing answers. If you have a partner, one person in the partnership should submit your project on Gradescope and include the other partner in the submission. (Gradescope will prompt you to fill this in.)
Your partner must be in your lab section. You can ask your TA to pair you with someone from your lab if you’re unable to find a partner. (That will happen in lab the week the project comes out.)
Part of the project will be due early. Parts 1 and 2 of the project (out of 3) are due Tuesday, September 27th at 5PM. Unlike the final submission, this early checkpoint will be graded for completion. It will be worth approximately 10% of the total project grade. Simply submit your partially-completed notebook as a PDF, as you would submit any other notebook. (See the note above on submitting with a partner.)
The entire project (parts 1, 2, and 3) will be due Tuesday, October 4 at 5PM. (Again, see the note above on submitting with a partner.)
As usual, run the cell below to prepare the automatic tests. Passing the automatic tests does not guarantee full credit on any question. The tests are provided to help catch some common errors, but it is your responsibility to answer the questions correctly.
In [ ]:
# Run this cell, but please don't change it.
import numpy as np
import math
from datascience import *
# These lines set up the plotting functionality and formatting.
import matplotlib
matplotlib.use('Agg', warn=False)
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
# These lines load the tests.
from client.api.assignment import load_assignment
tests = load_assignment('project1.ok')
First, load the data. Loading may take some time.
In [ ]:
# Run this cell, but please don't change it.
districts = Map.read_geojson('water_districts.geojson')
zips = Map.read_geojson('ca_zips.geojson.gz')
usage_raw = Table.read_table('water_usage.csv', dtype={'pwsid': str})
income_raw = Table.read_table('ca_income_by_zip.csv', dtype={'ZIP': str}).drop('STATEFIPS', 'STATE', 'agi_stub')
wd_vs_zip = Table.read_table('wd_vs_zip.csv', dtype={'PWSID': str, 'ZIP': str}).set_format(make_array(2, 3), PercentFormatter)
The districts
and zips
data sets are Map
objects. Documentation on mapping in the datascience
package can be found at data8.org/datascience/maps.html. To view a map of California's water districts, run the cell below. Click on a district to see its description.
In [ ]:
districts.format(width=400, height=200)
A Map
is a collection of regions and other features such as points and markers, each of which has a string id
and various properties. You can view the features of the districts
map as a table using Table.from_records
.
In [ ]:
district_table = Table.from_records(districts.features)
district_table.show(3)
To display a Map
containing only two features from the district_table
, call Map
on an array containing those two features from the feature
column.
Question 1.1. Draw a map of the Alameda County Water District (row 0) and the East Bay Municipal Utilities District (row 2).
In [ ]:
# Fill in the next line so the last line draws a map of those two districts.
alameda_and_east_bay = ...
Map(alameda_and_east_bay, height=300, width=300)
In [ ]:
_ = tests.grade('q11')
Hint: If scrolling becomes slow on your computer, you can clear maps for the cells above by running Cell > All Output > Clear
from the Cell
menu.
Let's look at the income_raw
table, which comes from the IRS. We're going to link this information about incomes to our information about water. First, we need to investigate the income data and get it into a more usable form.
In [ ]:
income_raw
Some observations:
'N02650'
is the number of returns that included a total income amount, and 'A02650'
is the total amount of total income (in thousands of dollars) from those returns.For the analysis we're about to do, we won't need to use the information about tax brackets. We will need to know the total income, total number of returns, and other totals from each ZIP code.
Question 2.1. Assign the name income_by_zipcode
to a table with just one row per ZIP code. When you group according to ZIP code, the remaining columns should be summed. In other words, for any other column such as 'N02650'
, the value of 'N02650'
in a row corresponding to ZIP code 90210 (for example) should be the sum of the values of 'N02650'
in the 6 rows of income_raw
corresponding to ZIP code 90210.
In [ ]:
income_by_zipcode = ...
income_by_zipcode
In [ ]:
_ = tests.grade('q21')
Your income_by_zipcode
table probably has column names like N1 sum
, which looks a little weird.
Question 2.2. Relabel the columns in income_by_zipcode
to match the labels in income_raw
Hint: Inspect income_raw.labels
and income_by_zipcode.labels
to find the differences you need to change.
Hint 2: Since there are many columns, it will be easier to relabel each of them by using a for
statement. See Chapter 8 of the textbook for details.
Hint 3: You can use the replace
method of a string to remove excess content. See lab02 for examples.
Hint 4: To create a new table from an existing table with one label replaced, use relabeled
. To change a label in an existing table permanently, use relabel
. Both methods take two arguments: the old label and the new label. You can solve this problem with either one, but relabel
is simpler.
In [ ]:
...
...
income_by_zipcode
In [ ]:
_ = tests.grade('q22')
Question 2.3.
Create a table called income
with one row per ZIP code and the following columns.
ZIP
column with the same contents as 'ZIP'
from income_by_zipcode
.num returns
column containing the total number of tax returns that include a total income amount (column 'N02650'
from income_by_zipcode
).total income ($)
column containing the total income in all tax returns in thousands of dollars (column 'A02650'
from income_by_zipcode
).num farmers
column containing the number of farmer returns (column 'SCHF'
from income_by_zipcode
).
In [ ]:
income = Table().with_columns(
...
...
...
...
)
income.set_format('total income ($)', NumberFormatter(0)).show(5)
In [ ]:
_ = tests.grade('q23')
Question 2.4. All ZIP codes with less than 100 returns (or some other special conditions) are grouped together into one ZIP code with a special code. Remove the row for that ZIP code from the income
table.
Hint 1: This ZIP code value has far more returns than any of the other ZIP codes. Try using group
and sort
to find it.
Hint 2: To remove a row in the income
table using where
, assign income
to the smaller table using the following expression structure:
income = income.where(...)
Hint 3: Each ZIP code is represented as a string, not an int
.
In [ ]:
income = ...
In [ ]:
_ = tests.grade('q24')
Because each ZIP code has a different number of people, computing the average income across several ZIP codes requires some care. This will come up several times in this project. Here is a simple example:
Question 2.5 Among all the tax returns that
what is the average total income? Express the answer in dollars as an int
rounded to the nearest dollar.
In [ ]:
# Our solution took several lines of code.
average_income = ...
average_income
In [ ]:
_ = tests.grade('q25')
Question 2.6. Among all California tax returns that include a total income amount, what is the average total income? Express the answer in dollars as an int
rounded to the nearest dollar.
In [ ]:
avg_total = ...
avg_total
Farms use water, so it's plausible that farming is an important factor in water usage. Here, we will check for a relationship between farming and income.
Among the tax returns in California for ZIP codes represented in the incomes
table, is there an association between income and living in a ZIP code with a lot of farmers?
We'll try to answer the question in 3 ways.
Question 2.7. Make a scatter plot with one point for each ZIP code. Display the average income in dollars on the vertical axis and the proportion of returns that are from farmers on the horizontal axis.
In [ ]:
# Write code to make a scatter plot here.
...
Question 2.8. From the graph, can you say whether ZIP codes with more farmers typically have lower or higher average income than ZIP codes with few or no farmers? Can you say how much lower or higher?
Write your answer here, replacing this text.
Question 2.9. Compare the average incomes for two groups of tax returns: those in ZIP codes with a greater-than-average proportion of farmers and those in ZIP codes with a less-than-average (or average) proportion. Make sure both of these values are displayed (preferably in a table). Then, describe your findings.
Hint: Make sure your result correctly accounts for the number of tax returns in each ZIP code, as in questions 2.5 and 2.6.
In [ ]:
# Build and display a table with two rows:
# 1) incomes of returns in ZIP codes with a greater-than-average proportion of farmers
# 2) incomes of returns in other ZIP codes
Write your answer here, replacing this text.
Question 2.10. The graph below displays two histograms: the distribution of average incomes of ZIP codes that have above-average proportions of farmers, and that of ZIP codes with below-average proportions of farmers.
Are ZIP codes with below-average proportions of farmers more or less likely to have very low incomes? Explain how your answer is consistent with your answer to question 2.8.
Write your answer here, replacing this text.
ZIP codes cover all the land in California and do not overlap. Here's a map of all of them.
Question 2.11. Among the ZIP codes represented in the incomes
table, is there an association between high average income and some aspect of the ZIP code's location? If so, describe one aspect of the location that is clearly associated with high income.
Answer the question by drawing a map of all ZIP codes that have an average income above 100,000 dollars. Then, describe an association that you observe.
In order to create a map of certain ZIP codes, you need to:
high_average_zips
.high_average_zips
with the zip_features
table to find the region for each ZIP code of interest.Map(...)
on the column of features (provided).
In [ ]:
# Write code to draw a map of only the high-income ZIP codes.
# We have filled in some of it and suggested names for variables
# you might want to define.
zip_features = Table.from_records(zips.features)
high_average_zips = ...
high_zips_with_region = ...
Map(high_zips_with_region.column('feature'), width=400, height=300)
Write your answer here, replacing this text.
We will now investigate water usage in California. The usage
table contains three columns:
PWSID
: The Public Water Supply Identifier of the districtPopulation
: Estimate of average population served in 2015Water
: Average residential water use (gallons per person per day) in 2014-2015
In [ ]:
# Run this cell to create the usage table.
usage_raw.set_format(4, NumberFormatter)
max_pop = usage_raw.select(0, 'population').group(0, max).relabeled(1, 'Population')
avg_water = usage_raw.select(0, 'res_gpcd').group(0, np.mean).relabeled(1, 'Water')
usage = max_pop.join('pwsid', avg_water).relabeled(0, 'PWSID')
usage
Question 3.1. Draw a map of the water districts, colored by the per capita water usage in each district.
Use the districts.color(...)
method to generate the map. It takes as its first argument a two-column table with one row per district that has the district PWSID
as its first column. The label of the second column is used in the legend of the map, and the values are used to color each region.
In [ ]:
# We have filled in the call to districts.color(...). Set per_capita_usage
# to an appropriate table so that a map of all the water districts is
# displayed.
per_capita_usage = ...
districts.color(per_capita_usage, key_on='feature.properties.PWSID')
In [ ]:
_ = tests.grade('q31')
Question 3.2. Based on the map above, which part of California appears to use more water per person: the San Francisco area or the Los Angeles area?
Write your answer here, replacing this text.
Next, we will try to match each ZIP code with a water district. ZIP code boundaries do not always line up with water districts, and one water district often covers multiple ZIP codes, so this process is imprecise. It is even the case that some water districts overlap each other. Nonetheless, we can continue our analysis by matching each ZIP code to the water district with the largest geographic overlap.
The table wd_vs_zip
describes the proportion of land in each ZIP code that is contained in each water district and vice versa. (The proportions are approximate because they do not correctly account for discontiguous districts, but they're mostly accurate.)
In [ ]:
wd_vs_zip.show(5)
Question 3.3. Complete the district_for_zip
function that takes a ZIP code as its argument. It returns the PWSID with the largest value of ZIP in District
for that zip_code
, if that value is at least 50%. Otherwise, it returns the string 'No District'
.
In [ ]:
def district_for_zip(zip_code):
zip_code = str(zip_code) # Ensure that the ZIP code is a string, not an integer
districts = ...
at_least_half = ...
if at_least_half:
...
else:
return 'No District'
district_for_zip(94709)
In [ ]:
_ = tests.grade('q33')
This function can be used to associate each ZIP code in the income
table with a PWSID
and discard ZIP codes that do not lie (mostly) in a water district.
In [ ]:
zip_pwsids = income.apply(district_for_zip, 'ZIP')
income_with_pwsid = income.with_column('PWSID', zip_pwsids).where('PWSID', are.not_equal_to("No District"))
income_with_pwsid.set_format(2, NumberFormatter(0)).show(5)
Question 3.4. Create a table called district_data
with one row per PWSID and the following columns:
PWSID
: The ID of the districtPopulation
: Population estimateWater
: Average residential water use (gallons per person per day) in 2014-2015Income
: Average income in dollars of all tax returns in ZIP codes that are (mostly) contained in the district according to income_with_pwsid
.Hint: First create a district_income
table that sums the incomes and returns for ZIP codes in each water district.
In [ ]:
district_income = ...
district_data = ...
district_data.set_format(make_array('Population', 'Water', 'Income'), NumberFormatter(0))
In [ ]:
_ = tests.grade('q34')
Question 3.5. The bay_districts
table gives the names of all water districts in the San Francisco Bay Area. Is there an association between water usage and income among Bay Area water districts? Use the tables you have created to compare water usage between the 10 Bay Area water districts with the highest average income and the rest of the Bay Area districts, then describe the association. Do not include any districts in your analysis for which you do not have income information.
The names below are just suggestions; you may perform the analysis in any way you wish.
Note: Some Bay Area water districts may not appear in your district_data
table. That's ok. Perform your analysis only on the subset of districts where you have both water usage & income information.
In [ ]:
bay_districts = Table.read_table('bay_districts.csv')
bay_water_vs_income = ...
top_10 = ...
...
Complete this one-sentence conclusion: In the Bay Area, people in the top 10 highest-income water districts used an average of _________
more gallons of water per person per day than people in the rest of the districts.
Question 3.6. In one paragraph, summarize what you have discovered through the analyses in this project and suggest what analysis should be conducted next to better understand California water usage, income, and geography. What additional data would be helpful in performing this next analysis?
Replace this line with your conclusion paragraph.
Congratulations - you've finished Project 1 of Data 8!
To submit:
Run All
from the Cell
menu to ensure that you have executed all cells, including the test cells. Make sure that the visualizations you create are actually displayed.Download as PDF via LaTeX (.pdf)
from the File
menu. (Sometimes that seems to fail. If it does, you can download as HTML, open the .html file in your browser, and print it to a PDF.)project1.pdf
) to Gradescope.If you cannot submit online, come to office hours for assistance. The office hours schedule appears on data8.org/weekly.
In [ ]:
# For your convenience, you can run this cell to run all the tests at once!
import os
_ = [tests.grade(q[:-3]) for q in os.listdir("tests") if q.startswith('q')]
If you want, draw some more maps below.
In [ ]:
# Your extensions here (completely optional)