Homework #1

This notebook contains the first homework for this class, and is due on Friday, October 23rd, 2016 at 11:59 p.m.. Please make sure to get started early, and come by the instructors' office hours if you have any questions. Office hours and locations can be found in the course syllabus. IMPORTANT: While it's fine if you talk to other people in class about this homework - and in fact we encourage it! - you are responsible for creating the solutions for this homework on your own, and each student must submit their own homework assignment.

Some links that you may find helpful:

Your name

Put your name here!


Section 1: Find me a model, any model

Look around online and find a model that you think is interesting. This model can be of anything that you are intrigued by, and as simple or complicated as you want. Write a paragraph or two describing the model, and identifying the components of the model that we talked about in class - the model's inputs, inner workings, outputs, and how one might decide if this is a good model or not. Make sure to cite your sources by providing links to the web pages that you looked at to create this description. You can either just paste the URL into the cell below, or do something prettier, like this: google. The syntax for that second one is [google](http://google.com).

ANSWER: There's not really a "solution" to this problem, per se. We want to see that students have found an actual model online (as opposed to something else that may have inputs and outputs and so on, but which is not a model). The model can be descriptive (i.e., carbon cycle), predictive (black hole mergers and gravitational waves), or some sort of statistical model (i.e., fantasy baseball-type modeling of player performance).


Section 2: Car conundrum

Part 1. Consider this: What volume of gasoline does a typical automobile (car, SUV, or pickup) use during its entire lifetime? How does the weight of the total fuel consumed compare to the weight of the car? How about the price of the fuel compared to the price of the car?

Come up with a simple order-of-magnitude approximation for each of those three questions, and in the cell below this one write a paragraph or two addressing each of the questions above. What are the factors you need to consider? What range of values might they have? In what way is your estimate limited? (Also, to add a twist: does it matter what type of car you choose?)

Note: if you use a Google search or two to figure out what range of values you might want to use, include links to the relevant web page(s). As described above, you can either just paste the URL, or do something prettier, like this: google. The syntax for that second one is [google](http://google.com).

ANSWER:

We want to figure out how much gas is consumed by a typical car during its lifetime, and then answer some questions relating to that. To figure this out, we need to know the volume of a gallon of gas ($V_{1 gal}$), the mass of a gallon of gas ($M_{1 gal}$), the cost of a gallon of gas ($C_{1 gal}$), the fuel efficiency of the car in question ($M_{PG}$), and the number of miles that a car is tyipcally driven ($M_{drive}$). In that case, the model would be:

$V_{gas} = V_{1 gal} * N_{gals} = V_{1 gal} * \frac{M_{drive}}{M_{PG}}$

and for the total mass of gas:

$M_{gas} = M_{1 gal} * N_{gals} = M_{1 gal} * \frac{M_{drive}}{M_{PG}}$

and the cost for all of the gas:

$C_{gas} = C_{1 gal} * N_{gals} = C_{1 gal} * \frac{M_{drive}}{M_{PG}}$

The various quantities we need are:

  • $V_{1 gal}$ - 3785 cubic cm (via google)
  • $M_{1 gal}$ - 2.9 kg (via Google)
  • $C_{1 gal}$ - between \$2 - \$5 depending on where you live and when it is
  • $M_{drive}$ - between $10^5$ and $5 \times 10^5$ miles
  • $M_{PG}$ - between $8-60$ miles per gallon depending on the type of car you drive
  • Mass of car: between 1000-4000 kg depending on the type of car you drive (cars with poorer fuel efficiency tend to be more massive)

This estimate is limited because the cost of gasoline varies over time and fuel efficiency also varies over time with weather and also with the age and maintenance of the car. It's also somewhat hard to tell how long a given car is driven, because cars are often resold.

Part 2. In the space below, write a Python program to model the answer to all three of those questions, and keep track of the answers in a numpy array. Plot your answers to both questions in some convenient way (probably not a scatter plot - look at the matplotlib gallery for inspiration!). Do the answers you get make sense to you?


In [ ]:
# write any code you need here!  
# Create additional cells if you need them by using the 
# 'Insert' menu at the top of the browser window.
import numpy as np

C_gas = [2.0, 3.0, 4.0, 5.0]
M_drive = [1.0e+5, 2.0e+5, 3.0e+5, 4.0e+5, 5.0e+5]
M_pg = [8,15,25,35,45,60]

V1g = 0.003785  # in m^3
M1g = 2.9  # in kg

gals_gas = []
cost_gas = []

for Md in M_drive:
    for Mpg in M_pg:
        gals_gas.append(Md/Mpg)
        for Cg in C_gas:
            cost_gas.append(Cg*Md/Mpg)

gals_gas = np.array(gals_gas)  
cost_gas = np.array(cost_gas)

vol_gas = gals_gas * V1g
mass_gas = gals_gas * M1g

In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.hist(vol_gas,bins=20)
plt.title("Volume of gas")
plt.xlabel("Volume of gas [cubic meters]")
plt.ylabel("number")

In [ ]:
plt.hist(mass_gas,bins=20)
plt.title("Mass of gas")
plt.xlabel("Mass [kg]")
plt.ylabel("number")

In [ ]:
plt.hist(cost_gas,bins=20)
plt.title("Cost of gas")
plt.xlabel("Cost [dollars]")
plt.ylabel("number")

Answers to questions (based on this model):

  • The volume of gas ranges from a few cubic meters to 200+, with a median value around 40-50 cubic meters. This is typically much more than the volome of the car itself.
  • The mass of gas ranges from a few thousand kg to 150,000+ kg, with a median value around 25,000 kg. This is typically much more than the mass of the car.
  • The cost of gas ranges from a few tens of thousands of dollars to much more than 300,000, with a median of somewhere in the 30-40,000 range. The price of the fuel is comparable to the price of the car, on average.

It does matter what type of car I choose - smaller, more fuel efficient cars tend to have better mileage, which radically affects the other values.


Section 3: Get the Lead Out (continued from your in-class assignment)

You're going to make a Jupyter Notebook. We'll feature our class's work on the CMSE Homepage

This is real data. And, you're some of the first people with the chance to analyze it and make your results public.

We want you to create a new Jupyter notebook to answer this question (which will be uploaded separately, as a second document to this notebook) that we can post publicly, to the world, on the CMSE Department Homepage.

Your Notebook Presentation Should Answer These Questions:

Your presentation should try to answer the following questions:

  1. How bad was the lead level situation in August, 2015 when the first lead levels were taken?
  2. How has the lead situation changed since August, 2015?
    1. Is it getting better? If so, show your readers and convince them
    2. Is it getting worse? If so, show your readers and convince them

How you answer the questions is up to you. But, remember to:

  1. State your positions clearly.
  2. Justify your positions with graphics, calculations, and written analysis to explain why you think what you think.
  3. Consider counterarguments. Could someone try to use the same data to arrive at a different conclusion than yours? If they could, explain that conclusion and (if appropriate) why you think that position is flawed.
  4. Do your best. Write as clearly as you can, use what you know, and don't confuse sizzle with quality. You don't need fancy pants visual and graphical animations to be persuasive. The humble scatterplot and its cousins the bar chart and histogram are extraordinarily powerful when you craft them carefully. And all of them are built-in to pandas.
  • The conclusions you draw matter. These are Flint resident's actual living conditions.
  • Any numerical conclusions you draw should be backed up by your code. If you say the average lead level was below EPA limits, you'll need to be able to back up that claim in your notebook either with graphical evidence or numerical evidence (calculations). So, make sure to show your (computational) work!
  • Your analysis is a check on the scientific community. The more eyes we have looking at this data and offering reproducible analyses (which Jupyter Notebooks are), the more confidence we can have in the data.
  • You may find other results online, but you still have to do your own analysis to decide whether you agree with their results.

In [ ]:
# THIS CELL READS IN THE FLINT DATASET - DO NOT CHANGE ANYTHING!

# Make plots inline
%matplotlib inline

# Make inline plots vector graphics instead of raster graphics
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('pdf', 'svg')

# import modules for plotting and data analysis
import matplotlib.pyplot # as plt
import pandas
import numpy as np
import functools

def add_bottle_id_column(data, key_name):
    data['bottleID'] = np.repeat(key_name, data.shape[0])
    return data

'''
Loads the flint water quality dataset from the spreadsheet.  
This manipulation is necessary because (1) the data is in a spreadsheet
rather than a CSV file or something else, and (2) the data is spread out
across multiple sheets in the spreadsheet.
'''
def load_flint_water_data():
    flint_water_data = pandas.read_excel(
        # NOTE: uncomment the following line and comment out the one after that if
        # you have problems getting this to run on a Windows machine.
        #io = “https://github.com/ComputationalModeling/flint-water-data/raw/f6093bba145b1745b68bac2964b341fa30f3a08a/Flint%20Lead%20Kits%20ICP%20Data.xlsx”,
        io = "Flint_Lead_Kits_ICP_Data.xlsx",
        sheetname = [
            "Sub_B1-8.15",
            "Sub_B2-8.15",
            "Sub_B3-8.15",
            "Sub_B1-3.16",
            "Sub_B2-3.16",
            "Sub_B3-3.16",
            "Sub_B1-7.16",
            "Sub_B2-7.16",
            "Sub_B3-7.16"],
        header = 0,
        skiprows = 3,
        names = [
            "Sample",
            "208Pb",
            "",
            "23Na",
            "25Mg",
            "27Al",
            "28Si",
            "31P",
            "PO4",
            "34S",
            "35Cl",
            "39K",
            "43Ca",
            "47Ti",
            "51V",
            "52Cr",
            "54Fe",
            "55Mn",
            "59Co",
            "60Ni",
            "65Cu",
            "66Zn",
            "75As",
            "78Se",
            "88Sr",
            "95Mo",
            "107Ag",
            "111Cd",
            "112Sn",
            "137Ba",
            "238U"
        ]    
    )
    data_with_id = [
    add_bottle_id_column(value, key)
        for key, value
        in flint_water_data.items()]
    # collapse dataframes into one long dataframe
    flint_water_data = functools.reduce(lambda x,y: x.append(y), data_with_id)
    return flint_water_data

def add_date_and_bottle_number(flint_data): 
    flint_data['bottle_number'] = flint_data['bottleID'].apply(lambda x: x.split('-')[0])
    flint_data['date_collected'] = flint_data['bottleID'].apply(lambda x: x.split('-')[1])
    return(flint_data)

bottle_map = {
    'Sub_B1': 'bottle1',
    'Sub_B2': 'bottle2',
    'Sub_B3': 'bottle3'
}

date_map = {
    '8.15': '2015-08-01',
    '3.16': '2016-03-01',
    '7.16': '2016-07-01'
}


flint_data = load_flint_water_data()
flint_data = add_date_and_bottle_number(flint_data)
flint_data = flint_data.replace(
    {'bottle_number': bottle_map,
     'date_collected': date_map })
flint_data['date_collected'] = pandas.DatetimeIndex(flint_data['date_collected'])
flint_data = flint_data.drop('bottleID', axis = 1)

# the end result is that you have a data frame called "flint_data"

In [ ]:
flint_data.columns

In [ ]:
# break data into individual days

aug15 = flint_data[flint_data['date_collected'] == '2015-08-01']
mar16 = flint_data[flint_data['date_collected'] == '2016-03-01']
jul16 = flint_data[flint_data['date_collected'] == '2016-07-01']

In [ ]:
# break data into individual bottles (for each day)

# note: data is in ppb, and we want to think about it in mg/L.
# 1 part per million = 1 mg/L, so 1000 ppb = 1 mg/L.  
# The EPA action limit is 0.015 mg/L, or 15 ppb.

aug15_b1 = aug15[aug15['bottle_number']=='bottle1']
aug15_b2 = aug15[aug15['bottle_number']=='bottle2']
aug15_b3 = aug15[aug15['bottle_number']=='bottle3']

mar16_b1 = mar16[mar16['bottle_number']=='bottle1']
mar16_b2 = mar16[mar16['bottle_number']=='bottle2']
mar16_b3 = mar16[mar16['bottle_number']=='bottle3']

jul16_b1 = jul16[jul16['bottle_number']=='bottle1']
jul16_b2 = jul16[jul16['bottle_number']=='bottle2']
jul16_b3 = jul16[jul16['bottle_number']=='bottle3']

Note: The raw data is in ppb, and we want to think about it in mg/L. 1 part per million = 1 mg/L, so 1000 ppb = 1 mg/L.
The EPA action limit is 0.015 mg/L, or 15 ppb.

Our strategy is to figure out what fraction of the houses are above the action limit for each day and for each bottle, and make a line plot of that showing the temporal evolution of the fraction of samples above the action limit.


In [ ]:
aug15_b1['208Pb'].describe()

In [ ]:
# list containing all 3 days
days = [1,2,3]


'''
What I do here is a very pithy way of describing things; what I'm doing is counting 
the number of lead samples above 15 ppb for a given bottle on a given day, and then 
dividing by the total number of lead samples for a given bottle on a given day. 
'''

frac_aug15_b1 = aug15_b1['208Pb'][aug15_b1['208Pb'] > 15.0].count()/aug15_b1['208Pb'].count()
frac_aug15_b2 = aug15_b2['208Pb'][aug15_b2['208Pb'] > 15.0].count()/aug15_b2['208Pb'].count()
frac_aug15_b3 = aug15_b3['208Pb'][aug15_b3['208Pb'] > 15.0].count()/aug15_b3['208Pb'].count()

frac_mar16_b1 = mar16_b1['208Pb'][mar16_b1['208Pb'] > 15.0].count()/mar16_b1['208Pb'].count()
frac_mar16_b2 = mar16_b2['208Pb'][mar16_b2['208Pb'] > 15.0].count()/mar16_b2['208Pb'].count()
frac_mar16_b3 = mar16_b3['208Pb'][mar16_b3['208Pb'] > 15.0].count()/mar16_b3['208Pb'].count()

frac_jul16_b1 = jul16_b1['208Pb'][jul16_b1['208Pb'] > 15.0].count()/jul16_b1['208Pb'].count()
frac_jul16_b2 = jul16_b2['208Pb'][jul16_b2['208Pb'] > 15.0].count()/jul16_b2['208Pb'].count()
frac_jul16_b3 = jul16_b3['208Pb'][jul16_b3['208Pb'] > 15.0].count()/jul16_b3['208Pb'].count()


# empty lists for bottle 1, 2, 3 evolution:

b1_evol = [frac_aug15_b1,frac_mar16_b1,frac_jul16_b1]
b2_evol = [frac_aug15_b2,frac_mar16_b2,frac_jul16_b2]
b3_evol = [frac_aug15_b3,frac_mar16_b3,frac_jul16_b3]

In [ ]:
b1, = plt.plot(days,b1_evol,'bo--')
b2, = plt.plot(days,b2_evol,'ro--')
b3, = plt.plot(days,b3_evol,'go--')
key, = plt.plot([0.9,3.1],[0.1,0.1],'k-')
#plt.xlim(0.5,3.5)
plt.ylim(0.0,0.2)
plt.xlabel("Sample date")
plt.ylabel("Fraction of samples above limit")
plt.title("Fraction of samples above action limit")
plt.legend([b1,b2,b3,key],['Bottle 1', 'Bottle 2', 'Bottle 3', 'EPA cutoff'])
plt.xticks([1,2,3],['2015-08-01','2016-03-01','2016-07-01'])
plt.margins(0.3)
#plt.legend()

Figure 1: (above). This figure shows the fraction of samples above the EPA's "action limit" of 15 mg/L for each of the three days of sampling, for all three bottles sampled from each house. Bottle 1 typically has the highest lead value, bottle 2 the second-highest, and bottle 3 the third highest, which is what one would expect. In the first two dates, Bottle 1 has more than 10% of samples above the action limit, and in the third date it is slightly below the action limit. Overall, this indicates that the problem is getting slightly better over time, but is still worrisome.


In [ ]:
aug15_b1_000 = aug15_b1['208Pb'].quantile(0.0)/1000
aug15_b1_025 = aug15_b1['208Pb'].quantile(0.25)/1000
aug15_b1_050 = aug15_b1['208Pb'].quantile(0.5)/1000
aug15_b1_075 = aug15_b1['208Pb'].quantile(0.75)/1000
aug15_b1_090 = aug15_b1['208Pb'].quantile(0.90)/1000
aug15_b1_095 = aug15_b1['208Pb'].quantile(0.95)/1000
aug15_b1_100 = aug15_b1['208Pb'].quantile(1.0)/1000

mar16_b1_000 = mar16_b1['208Pb'].quantile(0.0)/1000
mar16_b1_025 = mar16_b1['208Pb'].quantile(0.25)/1000
mar16_b1_050 = mar16_b1['208Pb'].quantile(0.5)/1000
mar16_b1_075 = mar16_b1['208Pb'].quantile(0.75)/1000
mar16_b1_090 = mar16_b1['208Pb'].quantile(0.90)/1000
mar16_b1_095 = mar16_b1['208Pb'].quantile(0.95)/1000
mar16_b1_100 = mar16_b1['208Pb'].quantile(1.0)/1000

jul16_b1_000 = jul16_b1['208Pb'].quantile(0.0)/1000
jul16_b1_025 = jul16_b1['208Pb'].quantile(0.25)/1000
jul16_b1_050 = jul16_b1['208Pb'].quantile(0.5)/1000
jul16_b1_075 = jul16_b1['208Pb'].quantile(0.75)/1000
jul16_b1_090 = jul16_b1['208Pb'].quantile(0.90)/1000
jul16_b1_095 = jul16_b1['208Pb'].quantile(0.95)/1000
jul16_b1_100 = jul16_b1['208Pb'].quantile(1.0)/1000

b1_000 = [aug15_b1_000, mar16_b1_000, jul16_b1_000]
b1_025 = [aug15_b1_025, mar16_b1_025, jul16_b1_025]
b1_050 = [aug15_b1_050, mar16_b1_050, jul16_b1_050]
b1_075 = [aug15_b1_075, mar16_b1_075, jul16_b1_075]
b1_090 = [aug15_b1_090, mar16_b1_090, jul16_b1_090]
b1_095 = [aug15_b1_095, mar16_b1_095, jul16_b1_095]
b1_100 = [aug15_b1_100, mar16_b1_100, jul16_b1_100]

In [ ]:
b000, = plt.plot(days,b1_000,linewidth=2)
b025, = plt.plot(days,b1_025,linewidth=2)
b050, = plt.plot(days,b1_050,linewidth=2)
b075, = plt.plot(days,b1_075,linewidth=2)
b090, = plt.plot(days,b1_090,linewidth=2)
b095, = plt.plot(days,b1_095,linewidth=2)
b100, = plt.plot(days,b1_100,linewidth=2)
key, = plt.plot([0.9,3.1],[0.015,0.015],'k--')

plt.yscale('log')

plt.xlim(0.5,4.0)
plt.ylim(1.0e-4,3.0)
plt.xlabel("Sample date")
plt.ylabel("Lead content [mg/L]")
plt.title("Bottle 1 lead content (various data percentiles)")
plt.legend([b000,b025,b050,b075,b090,b095,b100,key],['min','25%','50%','75%','90%','95%','max','action'],loc='upper right')
plt.xticks([1,2,3],['2015-08-01','2016-03-01','2016-07-01'])
plt.margins(0.3)

Figure 2: (above). This figure shows the lead content in water samples from Bottle 1 at various percentiles in each of the three dates, which effectively shows how the distributions change over time. The various lines show where the minimim sample value is, 25th percentile, etc., all the way up to the maximum sample value. This shows that the maximum lead measurements for Bottle 1 are always really high, but that the situation overall seems to be improving (as indicated by the values for houses between the 50th and 95th percentile generally decreasing over time). Nearly 10% of the houses are still over the action limit as of July 2016, and probably around 8% are still over it.

Summary of results: Figures 1 and 2 (above) show that things were bad in August 2015 - around 16% of houses had lead limits above the EPA max limit of 0.015 mg/L for the first sample taken from each house, and quite a few of them vastly exceeded the EPA action limit (10% were at least 2x that limit, and 5% were 4x that limit). Over time the situation has gotten better - slightly less than 10% of the samples were over the limit as of July 2016 for the first sample taken from each house - but quite a few of the houses are still substantially over the EPA limit.

One could take the two figures shown to say that the problem is getting better and that nothing further should be done, because 10% of the houses are no longer above the EPA's action limit for lead. This is a reasonable argument, but doesn't consider the whole story - some houses still have extraordinarily high lead levels, and almost 10% of the houses DO have lead levels above the EPA action limit.


Section 4: Feedback (required!)


In [ ]:
from IPython.display import HTML
HTML(
"""
<iframe
    src="https://docs.google.com/forms/d/e/1FAIpQLSd0yvuDR2XP5QhWHJZTZHgsSi84QAZU7x-C9NEA40y6NnArAA/viewform?embedded=true" 
    width="80%" 
    height="1200px" 
    frameborder="0" 
    marginheight="0" 
    marginwidth="0">
    Loading...
</iframe>
"""
)

Congratulations, you're done!

How to submit this assignment

Log into the course Desire2Learn website (d2l.msu.edu) and go to the "Homework assignments" folder. There will be a dropbox labeled "Homework 1". Upload this notebook there.