In the Salt Lake area, where smog will periodically hide the mountains that surround the city, air pollution is a consistent concern. The smog is a result of the pollution trapped by the unique topographic features of the Wasatch area. In this notebook, we will explore the relationship between air quality (pm25 and ozone) and income and elevation in the Salt Lake valley. This will allow us to explore questions related to social justice such as, "Is air pollution exposure equally distributed across different socio-economic groups?"
In [3]:
%matplotlib inline
In [4]:
import pandas as pd
import statsmodels
import statsmodels.formula.api as smf
import patsy
import os
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from ipywidgets import widgets
In [ ]:
DATADIR = "/home/jovyan/DATA/AirQuality"
In [ ]:
os.listdir(DATADIR)
In our data directory we have two csv files (csv stands for Comma Separated Values). For different regions in the Salt Lake valley (zip code), we have computed the average
These files have been created by Dr. Daniel L. Mendoza at the University of Utah.
In [ ]:
#original
pm25 = pd.read_csv(os.path.join(DATADIR, "Class_PM25_Data.csv"))
pm25.head(), pm25.tail()
In [5]:
pm25 = pd.read_csv("Class_PM25_Data.csv")
pm25.tail()
Out[5]:
NaN value for Income in our last data rowPandas uses NaN (Not a number) to represent missing values. That is our data file did not have an income value for the last row. There are a variety of approaches for dealing with missing data, but we are just going to drop that value using the Pandas dataframe dropna method.
In [6]:
pm25 = pm25.dropna()
pm25.tail()
Out[6]:
In [ ]:
#original
ozone = pd.read_csv(os.path.join(DATADIR, "Class_Ozone_Data.csv"))
ozone.tail()
In [7]:
ozone = pd.read_csv("Class_Ozone_Data.csv")
ozone = ozone.dropna()
ozone.tail()
Out[7]:
In statistics, it is always a good idea to start by plotting your data to get an idea of what it looks like. This can help you determine which analyses would be useful to perform. Here we are going to be using Seaborn to do some exploratory visualization.
Our first plot is going to be a histogram of our particle measurements.
Seaborn has several options for plotting histograms. We are going to use the distplot() function. Note that we set the kernal density estimate argument (a smoothing function) to True. What happens when you change it to False?
In [13]:
fig1, ax1 = plt.subplots(1)
sns.distplot(pm25["PM25_MAR_8"], ax=ax1, kde=True)
ax1.set_xlabel("PM25")
ax1.set_ylabel("Proportion")
ax1.set_title("Histogram of PM25 for March 8, 2016")
Out[13]:
Next we'll use jointplot to visualize the relationship between income and pollution levels. jointplot has a number of options that determine what kind of joint plot to generate. The default is "scatter", but here we'll use the regression fit. Notice we also added kde to the marginal histograms with the regression fit. Any of the following types can be plotted using jointplot by changing the 'kind' argument:
In [11]:
sns.jointplot(x="Income", y="PM25_MAR_8", data=pm25, kind="reg",
color='purple');
We will use statsmodels to do ordinary least squares regression.
mod = smf.ols(formula='PM25_MAR_8 ~ Income', data=pm25)
PM25_MAR_8, and our independent variable, Income.'PM25_MAR_8 ~ Income'
data=pm25
pm25 datasetmod.fit() fits the model to the datares.summary() provides a detailed report on how well the model fit the data.
In [14]:
mod = smf.ols(formula='PM25_MAR_8 ~ Income', data=pm25)
res = mod.fit()
res.summary()
Out[14]:
Income variable
In [15]:
sns.jointplot(y="Income", x="Elevation", data=pm25, kind="reg",
color='purple');
In [16]:
mod = smf.ols(formula='PM25_MAR_8 ~ Elevation', data=pm25)
res = mod.fit()
res.summary()
Out[16]:
In [19]:
mod = smf.ols(formula='PM25_MAR_8 ~ Elevation + Income + Elevation:Income', data=pm25)
res = mod.fit()
res.summary()
Out[19]:
Ozone is another cause for concern in air quality. It is created by chemical reactions between oxides of nitrogen (NOx) and volatile organic compounds (VOC) in the presence of sunlight. Emissions from industrial facilities and electric utilities, motor vehicle exhaust, gasoline vapors, and chemical solvents are some of the major sources of NOx and VOC. Ground level ozone can cause breathing problems in sensitive populations and can also be harmful to delicate ecosystems.
In [20]:
fig2, ax2 = plt.subplots(1)
sns.distplot(ozone["Ozone_AUG_10"], ax=ax2, label="10:00")
sns.distplot(ozone["Ozone_AUG_15"], ax=ax2, color='r', label="15:00")
ax2.set_xlabel("Ozone Density")
ax2.legend()
Out[20]:
In [21]:
sns.jointplot(x="Income", y="Ozone_AUG_15", data=ozone, kind="reg",
color='purple');
Here it looks like we have the opposite relationship between income and ozone. As income increases, ozone exposure increases. Why might this be? (Click one of the options.)
In [17]:
button1 = widgets.Button(description='a. Ozone levels decrease with elevation',layout=widgets.Layout(width='40%', height='40px'))
button2 = widgets.Button(description='b. Wealthy people spend more time outdoors',layout=widgets.Layout(width='40%', height='40px'))
button3 = widgets.Button(description='c. Ozone levels increase with elevation, as does income',layout=widgets.Layout(width='40%', height='40px'))
button4 = widgets.Button(description='d. Too much noise; relationship is not obvious',layout=widgets.Layout(width='40%', height='40px'))
display(button1,button2,button3, button4)
def on_correct_clicked(b):
print('Correct.')
def on_incorrect_clicked(b):
print('Not quite.')
button1.on_click(on_incorrect_clicked)
button2.on_click(on_incorrect_clicked)
button3.on_click(on_correct_clicked)
button4.on_click(on_incorrect_clicked)
In [22]:
mod = smf.ols(formula='Ozone_AUG_15 ~ Elevation', data=ozone)
res = mod.fit()
res.summary()
Out[22]:
In [23]:
mod = smf.ols(formula='Ozone_AUG_15 ~ Income', data=ozone)
res = mod.fit()
res.summary()
Out[23]:
In [ ]: