This notebook contains the first homework for this class, and is due on Friday, October 23rd, 2016 at 11:59 p.m.. Please make sure to get started early, and come by the instructors' office hours if you have any questions. Office hours and locations can be found in the course syllabus. IMPORTANT: While it's fine if you talk to other people in class about this homework - and in fact we encourage it! - you are responsible for creating the solutions for this homework on your own, and each student must submit their own homework assignment.
Some links that you may find helpful:
Put your name here!
Look around online and find a model that you think is interesting. This model can be of anything that you are intrigued by, and as simple or complicated as you want. Write a paragraph or two describing the model, and identifying the components of the model that we talked about in class - the model's inputs, inner workings, outputs, and how one might decide if this is a good model or not. Make sure to cite your sources by providing links to the web pages that you looked at to create this description. You can either just paste the URL into the cell below, or do something prettier, like this: google. The syntax for that second one is [google](http://google.com)
.
ANSWER: There's not really a "solution" to this problem, per se. We want to see that students have found an actual model online (as opposed to something else that may have inputs and outputs and so on, but which is not a model). The model can be descriptive (i.e., carbon cycle), predictive (black hole mergers and gravitational waves), or some sort of statistical model (i.e., fantasy baseball-type modeling of player performance).
Part 1. Consider this: What volume of gasoline does a typical automobile (car, SUV, or pickup) use during its entire lifetime? How does the weight of the total fuel consumed compare to the weight of the car? How about the price of the fuel compared to the price of the car?
Come up with a simple order-of-magnitude approximation for each of those three questions, and in the cell below this one write a paragraph or two addressing each of the questions above. What are the factors you need to consider? What range of values might they have? In what way is your estimate limited? (Also, to add a twist: does it matter what type of car you choose?)
Note: if you use a Google search or two to figure out what range of values you might want to use, include links to the relevant web page(s). As described above, you can either just paste the URL, or do something prettier, like this: google. The syntax for that second one is [google](http://google.com)
.
ANSWER:
We want to figure out how much gas is consumed by a typical car during its lifetime, and then answer some questions relating to that. To figure this out, we need to know the volume of a gallon of gas ($V_{1 gal}$), the mass of a gallon of gas ($M_{1 gal}$), the cost of a gallon of gas ($C_{1 gal}$), the fuel efficiency of the car in question ($M_{PG}$), and the number of miles that a car is tyipcally driven ($M_{drive}$). In that case, the model would be:
$V_{gas} = V_{1 gal} * N_{gals} = V_{1 gal} * \frac{M_{drive}}{M_{PG}}$
and for the total mass of gas:
$M_{gas} = M_{1 gal} * N_{gals} = M_{1 gal} * \frac{M_{drive}}{M_{PG}}$
and the cost for all of the gas:
$C_{gas} = C_{1 gal} * N_{gals} = C_{1 gal} * \frac{M_{drive}}{M_{PG}}$
The various quantities we need are:
This estimate is limited because the cost of gasoline varies over time and fuel efficiency also varies over time with weather and also with the age and maintenance of the car. It's also somewhat hard to tell how long a given car is driven, because cars are often resold.
Part 2. In the space below, write a Python program to model the answer to all three of those questions, and keep track of the answers in a numpy array. Plot your answers to both questions in some convenient way (probably not a scatter plot - look at the matplotlib gallery for inspiration!). Do the answers you get make sense to you?
In [ ]:
# write any code you need here!
# Create additional cells if you need them by using the
# 'Insert' menu at the top of the browser window.
import numpy as np
C_gas = [2.0, 3.0, 4.0, 5.0]
M_drive = [1.0e+5, 2.0e+5, 3.0e+5, 4.0e+5, 5.0e+5]
M_pg = [8,15,25,35,45,60]
V1g = 0.003785 # in m^3
M1g = 2.9 # in kg
gals_gas = []
cost_gas = []
for Md in M_drive:
for Mpg in M_pg:
gals_gas.append(Md/Mpg)
for Cg in C_gas:
cost_gas.append(Cg*Md/Mpg)
gals_gas = np.array(gals_gas)
cost_gas = np.array(cost_gas)
vol_gas = gals_gas * V1g
mass_gas = gals_gas * M1g
In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.hist(vol_gas,bins=20)
plt.title("Volume of gas")
plt.xlabel("Volume of gas [cubic meters]")
plt.ylabel("number")
In [ ]:
plt.hist(mass_gas,bins=20)
plt.title("Mass of gas")
plt.xlabel("Mass [kg]")
plt.ylabel("number")
In [ ]:
plt.hist(cost_gas,bins=20)
plt.title("Cost of gas")
plt.xlabel("Cost [dollars]")
plt.ylabel("number")
Answers to questions (based on this model):
It does matter what type of car I choose - smaller, more fuel efficient cars tend to have better mileage, which radically affects the other values.
This is real data. And, you're some of the first people with the chance to analyze it and make your results public.
We want you to create a new Jupyter notebook to answer this question (which will be uploaded separately, as a second document to this notebook) that we can post publicly, to the world, on the CMSE Department Homepage.
Your presentation should try to answer the following questions:
pandas
.
In [ ]:
# THIS CELL READS IN THE FLINT DATASET - DO NOT CHANGE ANYTHING!
# Make plots inline
%matplotlib inline
# Make inline plots vector graphics instead of raster graphics
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('pdf', 'svg')
# import modules for plotting and data analysis
import matplotlib.pyplot # as plt
import pandas
import numpy as np
import functools
def add_bottle_id_column(data, key_name):
data['bottleID'] = np.repeat(key_name, data.shape[0])
return data
'''
Loads the flint water quality dataset from the spreadsheet.
This manipulation is necessary because (1) the data is in a spreadsheet
rather than a CSV file or something else, and (2) the data is spread out
across multiple sheets in the spreadsheet.
'''
def load_flint_water_data():
flint_water_data = pandas.read_excel(
# NOTE: uncomment the following line and comment out the one after that if
# you have problems getting this to run on a Windows machine.
#io = “https://github.com/ComputationalModeling/flint-water-data/raw/f6093bba145b1745b68bac2964b341fa30f3a08a/Flint%20Lead%20Kits%20ICP%20Data.xlsx”,
io = "Flint_Lead_Kits_ICP_Data.xlsx",
sheetname = [
"Sub_B1-8.15",
"Sub_B2-8.15",
"Sub_B3-8.15",
"Sub_B1-3.16",
"Sub_B2-3.16",
"Sub_B3-3.16",
"Sub_B1-7.16",
"Sub_B2-7.16",
"Sub_B3-7.16"],
header = 0,
skiprows = 3,
names = [
"Sample",
"208Pb",
"",
"23Na",
"25Mg",
"27Al",
"28Si",
"31P",
"PO4",
"34S",
"35Cl",
"39K",
"43Ca",
"47Ti",
"51V",
"52Cr",
"54Fe",
"55Mn",
"59Co",
"60Ni",
"65Cu",
"66Zn",
"75As",
"78Se",
"88Sr",
"95Mo",
"107Ag",
"111Cd",
"112Sn",
"137Ba",
"238U"
]
)
data_with_id = [
add_bottle_id_column(value, key)
for key, value
in flint_water_data.items()]
# collapse dataframes into one long dataframe
flint_water_data = functools.reduce(lambda x,y: x.append(y), data_with_id)
return flint_water_data
def add_date_and_bottle_number(flint_data):
flint_data['bottle_number'] = flint_data['bottleID'].apply(lambda x: x.split('-')[0])
flint_data['date_collected'] = flint_data['bottleID'].apply(lambda x: x.split('-')[1])
return(flint_data)
bottle_map = {
'Sub_B1': 'bottle1',
'Sub_B2': 'bottle2',
'Sub_B3': 'bottle3'
}
date_map = {
'8.15': '2015-08-01',
'3.16': '2016-03-01',
'7.16': '2016-07-01'
}
flint_data = load_flint_water_data()
flint_data = add_date_and_bottle_number(flint_data)
flint_data = flint_data.replace(
{'bottle_number': bottle_map,
'date_collected': date_map })
flint_data['date_collected'] = pandas.DatetimeIndex(flint_data['date_collected'])
flint_data = flint_data.drop('bottleID', axis = 1)
# the end result is that you have a data frame called "flint_data"
In [ ]:
flint_data.columns
In [ ]:
# break data into individual days
aug15 = flint_data[flint_data['date_collected'] == '2015-08-01']
mar16 = flint_data[flint_data['date_collected'] == '2016-03-01']
jul16 = flint_data[flint_data['date_collected'] == '2016-07-01']
In [ ]:
# break data into individual bottles (for each day)
# note: data is in ppb, and we want to think about it in mg/L.
# 1 part per million = 1 mg/L, so 1000 ppb = 1 mg/L.
# The EPA action limit is 0.015 mg/L, or 15 ppb.
aug15_b1 = aug15[aug15['bottle_number']=='bottle1']
aug15_b2 = aug15[aug15['bottle_number']=='bottle2']
aug15_b3 = aug15[aug15['bottle_number']=='bottle3']
mar16_b1 = mar16[mar16['bottle_number']=='bottle1']
mar16_b2 = mar16[mar16['bottle_number']=='bottle2']
mar16_b3 = mar16[mar16['bottle_number']=='bottle3']
jul16_b1 = jul16[jul16['bottle_number']=='bottle1']
jul16_b2 = jul16[jul16['bottle_number']=='bottle2']
jul16_b3 = jul16[jul16['bottle_number']=='bottle3']
Note: The raw data is in ppb, and we want to think about it in mg/L.
1 part per million = 1 mg/L, so 1000 ppb = 1 mg/L.
The EPA action limit is 0.015 mg/L, or 15 ppb.
Our strategy is to figure out what fraction of the houses are above the action limit for each day and for each bottle, and make a line plot of that showing the temporal evolution of the fraction of samples above the action limit.
In [ ]:
aug15_b1['208Pb'].describe()
In [ ]:
# list containing all 3 days
days = [1,2,3]
'''
What I do here is a very pithy way of describing things; what I'm doing is counting
the number of lead samples above 15 ppb for a given bottle on a given day, and then
dividing by the total number of lead samples for a given bottle on a given day.
'''
frac_aug15_b1 = aug15_b1['208Pb'][aug15_b1['208Pb'] > 15.0].count()/aug15_b1['208Pb'].count()
frac_aug15_b2 = aug15_b2['208Pb'][aug15_b2['208Pb'] > 15.0].count()/aug15_b2['208Pb'].count()
frac_aug15_b3 = aug15_b3['208Pb'][aug15_b3['208Pb'] > 15.0].count()/aug15_b3['208Pb'].count()
frac_mar16_b1 = mar16_b1['208Pb'][mar16_b1['208Pb'] > 15.0].count()/mar16_b1['208Pb'].count()
frac_mar16_b2 = mar16_b2['208Pb'][mar16_b2['208Pb'] > 15.0].count()/mar16_b2['208Pb'].count()
frac_mar16_b3 = mar16_b3['208Pb'][mar16_b3['208Pb'] > 15.0].count()/mar16_b3['208Pb'].count()
frac_jul16_b1 = jul16_b1['208Pb'][jul16_b1['208Pb'] > 15.0].count()/jul16_b1['208Pb'].count()
frac_jul16_b2 = jul16_b2['208Pb'][jul16_b2['208Pb'] > 15.0].count()/jul16_b2['208Pb'].count()
frac_jul16_b3 = jul16_b3['208Pb'][jul16_b3['208Pb'] > 15.0].count()/jul16_b3['208Pb'].count()
# empty lists for bottle 1, 2, 3 evolution:
b1_evol = [frac_aug15_b1,frac_mar16_b1,frac_jul16_b1]
b2_evol = [frac_aug15_b2,frac_mar16_b2,frac_jul16_b2]
b3_evol = [frac_aug15_b3,frac_mar16_b3,frac_jul16_b3]
In [ ]:
b1, = plt.plot(days,b1_evol,'bo--')
b2, = plt.plot(days,b2_evol,'ro--')
b3, = plt.plot(days,b3_evol,'go--')
key, = plt.plot([0.9,3.1],[0.1,0.1],'k-')
#plt.xlim(0.5,3.5)
plt.ylim(0.0,0.2)
plt.xlabel("Sample date")
plt.ylabel("Fraction of samples above limit")
plt.title("Fraction of samples above action limit")
plt.legend([b1,b2,b3,key],['Bottle 1', 'Bottle 2', 'Bottle 3', 'EPA cutoff'])
plt.xticks([1,2,3],['2015-08-01','2016-03-01','2016-07-01'])
plt.margins(0.3)
#plt.legend()
Figure 1: (above). This figure shows the fraction of samples above the EPA's "action limit" of 15 mg/L for each of the three days of sampling, for all three bottles sampled from each house. Bottle 1 typically has the highest lead value, bottle 2 the second-highest, and bottle 3 the third highest, which is what one would expect. In the first two dates, Bottle 1 has more than 10% of samples above the action limit, and in the third date it is slightly below the action limit. Overall, this indicates that the problem is getting slightly better over time, but is still worrisome.
In [ ]:
aug15_b1_000 = aug15_b1['208Pb'].quantile(0.0)/1000
aug15_b1_025 = aug15_b1['208Pb'].quantile(0.25)/1000
aug15_b1_050 = aug15_b1['208Pb'].quantile(0.5)/1000
aug15_b1_075 = aug15_b1['208Pb'].quantile(0.75)/1000
aug15_b1_090 = aug15_b1['208Pb'].quantile(0.90)/1000
aug15_b1_095 = aug15_b1['208Pb'].quantile(0.95)/1000
aug15_b1_100 = aug15_b1['208Pb'].quantile(1.0)/1000
mar16_b1_000 = mar16_b1['208Pb'].quantile(0.0)/1000
mar16_b1_025 = mar16_b1['208Pb'].quantile(0.25)/1000
mar16_b1_050 = mar16_b1['208Pb'].quantile(0.5)/1000
mar16_b1_075 = mar16_b1['208Pb'].quantile(0.75)/1000
mar16_b1_090 = mar16_b1['208Pb'].quantile(0.90)/1000
mar16_b1_095 = mar16_b1['208Pb'].quantile(0.95)/1000
mar16_b1_100 = mar16_b1['208Pb'].quantile(1.0)/1000
jul16_b1_000 = jul16_b1['208Pb'].quantile(0.0)/1000
jul16_b1_025 = jul16_b1['208Pb'].quantile(0.25)/1000
jul16_b1_050 = jul16_b1['208Pb'].quantile(0.5)/1000
jul16_b1_075 = jul16_b1['208Pb'].quantile(0.75)/1000
jul16_b1_090 = jul16_b1['208Pb'].quantile(0.90)/1000
jul16_b1_095 = jul16_b1['208Pb'].quantile(0.95)/1000
jul16_b1_100 = jul16_b1['208Pb'].quantile(1.0)/1000
b1_000 = [aug15_b1_000, mar16_b1_000, jul16_b1_000]
b1_025 = [aug15_b1_025, mar16_b1_025, jul16_b1_025]
b1_050 = [aug15_b1_050, mar16_b1_050, jul16_b1_050]
b1_075 = [aug15_b1_075, mar16_b1_075, jul16_b1_075]
b1_090 = [aug15_b1_090, mar16_b1_090, jul16_b1_090]
b1_095 = [aug15_b1_095, mar16_b1_095, jul16_b1_095]
b1_100 = [aug15_b1_100, mar16_b1_100, jul16_b1_100]
In [ ]:
b000, = plt.plot(days,b1_000,linewidth=2)
b025, = plt.plot(days,b1_025,linewidth=2)
b050, = plt.plot(days,b1_050,linewidth=2)
b075, = plt.plot(days,b1_075,linewidth=2)
b090, = plt.plot(days,b1_090,linewidth=2)
b095, = plt.plot(days,b1_095,linewidth=2)
b100, = plt.plot(days,b1_100,linewidth=2)
key, = plt.plot([0.9,3.1],[0.015,0.015],'k--')
plt.yscale('log')
plt.xlim(0.5,4.0)
plt.ylim(1.0e-4,3.0)
plt.xlabel("Sample date")
plt.ylabel("Lead content [mg/L]")
plt.title("Bottle 1 lead content (various data percentiles)")
plt.legend([b000,b025,b050,b075,b090,b095,b100,key],['min','25%','50%','75%','90%','95%','max','action'],loc='upper right')
plt.xticks([1,2,3],['2015-08-01','2016-03-01','2016-07-01'])
plt.margins(0.3)
Figure 2: (above). This figure shows the lead content in water samples from Bottle 1 at various percentiles in each of the three dates, which effectively shows how the distributions change over time. The various lines show where the minimim sample value is, 25th percentile, etc., all the way up to the maximum sample value. This shows that the maximum lead measurements for Bottle 1 are always really high, but that the situation overall seems to be improving (as indicated by the values for houses between the 50th and 95th percentile generally decreasing over time). Nearly 10% of the houses are still over the action limit as of July 2016, and probably around 8% are still over it.
Summary of results: Figures 1 and 2 (above) show that things were bad in August 2015 - around 16% of houses had lead limits above the EPA max limit of 0.015 mg/L for the first sample taken from each house, and quite a few of them vastly exceeded the EPA action limit (10% were at least 2x that limit, and 5% were 4x that limit). Over time the situation has gotten better - slightly less than 10% of the samples were over the limit as of July 2016 for the first sample taken from each house - but quite a few of the houses are still substantially over the EPA limit.
One could take the two figures shown to say that the problem is getting better and that nothing further should be done, because 10% of the houses are no longer above the EPA's action limit for lead. This is a reasonable argument, but doesn't consider the whole story - some houses still have extraordinarily high lead levels, and almost 10% of the houses DO have lead levels above the EPA action limit.
In [ ]:
from IPython.display import HTML
HTML(
"""
<iframe
src="https://docs.google.com/forms/d/e/1FAIpQLSd0yvuDR2XP5QhWHJZTZHgsSi84QAZU7x-C9NEA40y6NnArAA/viewform?embedded=true"
width="80%"
height="1200px"
frameborder="0"
marginheight="0"
marginwidth="0">
Loading...
</iframe>
"""
)