Air Quality and Social Justice

In the Salt Lake area, where smog will periodically hide the mountains that surround the city, air pollution is a consistent concern. The smog is a result of the pollution trapped by the unique topographic features of the Wasatch area. In this notebook, we will explore the relationship between air quality (pm25 and ozone) and income and elevation in the Salt Lake valley. This will allow us to explore questions related to social justice such as, "Is air pollution exposure equally distributed across different socio-economic groups?"

Tell our plotting routines to draw the graphics in our web browser


In [3]:
%matplotlib inline

Import packages that we are going to use

These include

  • Pandas: A package for reading and manipulating tabular data.
  • Seaborn: Seaborn is a plotting package that provides "nice" pre-defined plots for a variety of common graph types.
  • Folium: Folium is a good package for visualizing data on a map.
  • statsmodels: This package provides a number of common statistical functions for analysing data.

In [4]:
import pandas as pd
import statsmodels
import statsmodels.formula.api as smf
import patsy
import os
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from ipywidgets import widgets

Tell our program where our data are located


In [ ]:
DATADIR = "/home/jovyan/DATA/AirQuality"

In [ ]:
os.listdir(DATADIR)

What are these data

In our data directory we have two csv files (csv stands for Comma Separated Values). For different regions in the Salt Lake valley (zip code), we have computed the average

  1. Income
  2. Elevation
  3. pm25 levels for March 8, 2016
  4. ozone levels at 10:00 am and 3:00 PM from August 2016.

These files have been created by Dr. Daniel L. Mendoza at the University of Utah.

Use Pandas to read in the data

Pandas reads the data into a Pandas dataframe which we assing to the variable pm25. Dataframes have two methods for looking at the data head() and tail()


In [ ]:
#original
pm25 = pd.read_csv(os.path.join(DATADIR, "Class_PM25_Data.csv"))
pm25.head(), pm25.tail()

In [5]:
pm25 = pd.read_csv("Class_PM25_Data.csv")
pm25.tail()


Out[5]:
Block_Group_ID Income Elevation PM25_MAR_8
60 4.903510e+11 69188.0 1321.630859 4.409883
61 4.903510e+11 69493.0 1312.114014 3.989309
62 4.903510e+11 92551.0 1557.853394 1.537066
63 4.903510e+11 60817.0 1552.666016 2.252555
64 4.903600e+11 NaN 1275.227905 3.717521

Notice the NaN value for Income in our last data row

Pandas uses NaN (Not a number) to represent missing values. That is our data file did not have an income value for the last row. There are a variety of approaches for dealing with missing data, but we are just going to drop that value using the Pandas dataframe dropna method.


In [6]:
pm25 = pm25.dropna()
pm25.tail()


Out[6]:
Block_Group_ID Income Elevation PM25_MAR_8
59 4.903510e+11 35375.0 1305.029541 3.595146
60 4.903510e+11 69188.0 1321.630859 4.409883
61 4.903510e+11 69493.0 1312.114014 3.989309
62 4.903510e+11 92551.0 1557.853394 1.537066
63 4.903510e+11 60817.0 1552.666016 2.252555

Now let's repeat this with the ozone data


In [ ]:
#original
ozone = pd.read_csv(os.path.join(DATADIR, "Class_Ozone_Data.csv"))
ozone.tail()

In [7]:
ozone = pd.read_csv("Class_Ozone_Data.csv")
ozone = ozone.dropna()
ozone.tail()


Out[7]:
Block_Group_ID Income Elevation Ozone_AUG_10 Ozone_AUG_15
59 490351140001 35375.0 1305.029541 22.561118 48.027862
60 490351142001 69188.0 1321.630859 30.320732 57.617152
61 490351142002 69493.0 1312.114014 31.675820 59.208955
62 490351152091 92551.0 1557.853394 36.326890 58.199097
63 490351152092 60817.0 1552.666016 36.182543 57.752619

Exploratory Plots

In statistics, it is always a good idea to start by plotting your data to get an idea of what it looks like. This can help you determine which analyses would be useful to perform. Here we are going to be using Seaborn to do some exploratory visualization.

Our first plot is going to be a histogram of our particle measurements.

Seaborn has several options for plotting histograms. We are going to use the distplot() function. Note that we set the kernal density estimate argument (a smoothing function) to True. What happens when you change it to False?


In [13]:
fig1, ax1 = plt.subplots(1)
sns.distplot(pm25["PM25_MAR_8"], ax=ax1, kde=True)
ax1.set_xlabel("PM25")
ax1.set_ylabel("Proportion")
ax1.set_title("Histogram of PM25 for March 8, 2016")


Out[13]:
Text(0.5,1,'Histogram of PM25 for March 8, 2016')

Next we'll use jointplot to visualize the relationship between income and pollution levels. jointplot has a number of options that determine what kind of joint plot to generate. The default is "scatter", but here we'll use the regression fit. Notice we also added kde to the marginal histograms with the regression fit. Any of the following types can be plotted using jointplot by changing the 'kind' argument:

  • "scatter"
  • "reg"
  • "resid"
  • "kde"
  • "hex"

In [11]:
sns.jointplot(x="Income", y="PM25_MAR_8", data=pm25, kind="reg",
             color='purple');


We can see that there is a reasonable linear relationship between pollution and income

Let's quantify this relationship

We will use statsmodels to do ordinary least squares regression.

mod = smf.ols(formula='PM25_MAR_8 ~ Income', data=pm25)
  • Use Patsy to specify what our regression is. This formula is stating a linear relationship between our dependent variable, PM25_MAR_8, and our independent variable, Income.
'PM25_MAR_8 ~ Income'
data=pm25
  • Use the pm25 dataset
  • mod.fit() fits the model to the data
  • res.summary() provides a detailed report on how well the model fit the data.

In [14]:
mod = smf.ols(formula='PM25_MAR_8 ~ Income', data=pm25)
res = mod.fit()
res.summary()


Out[14]:
OLS Regression Results
Dep. Variable: PM25_MAR_8 R-squared: 0.458
Model: OLS Adj. R-squared: 0.449
Method: Least Squares F-statistic: 52.32
Date: Sun, 13 May 2018 Prob (F-statistic): 8.48e-10
Time: 14:17:09 Log-Likelihood: -77.678
No. Observations: 64 AIC: 159.4
Df Residuals: 62 BIC: 163.7
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 4.8552 0.256 18.981 0.000 4.344 5.367
Income -3.74e-05 5.17e-06 -7.233 0.000 -4.77e-05 -2.71e-05
Omnibus: 6.338 Durbin-Watson: 1.194
Prob(Omnibus): 0.042 Jarque-Bera (JB): 5.446
Skew: 0.624 Prob(JB): 0.0657
Kurtosis: 3.697 Cond. No. 1.22e+05

This is a lot of information. What are the key points?

Let's start by looking at our overall model.

  • Prob (F-statistic): 8.48e-10 ($8.48e^{-10}$).
    • This is the probability that the linear relationship between our variables is purely due to chance.
  • R-squared: 0.458
    • This is the proportion of the variability in our data that is explained by our model, so about half of the variability in the data is explained by income.
  • Cond. No. 1.22e+05 ($1.22e^{05}$)
    • A large condition number indicates numeric problems with our model/data and mean the results are less reliable.

Now let's look at our Income variable

  • coef=-3.74e-05 ($-3.74e^{-05}$). This is the slope of the line. The slope is negative, meaning as income increases, air pollution decreases.
  • P=0.000. This is the "p-value" and describes the probability that the linear relationship is just random chance. Since our p-value is less than 0.001, we can accept our hypothesis that income and pollution are related.

Let's repeat this for the relationship between Income and Elevation


In [15]:
sns.jointplot(y="Income", x="Elevation", data=pm25, kind="reg",
             color='purple');


The relationship between Income and Elevation is strong

Now let us use Elevation to predict PM25_MAR_8


In [16]:
mod = smf.ols(formula='PM25_MAR_8 ~ Elevation', data=pm25)
res = mod.fit()
res.summary()


Out[16]:
OLS Regression Results
Dep. Variable: PM25_MAR_8 R-squared: 0.451
Model: OLS Adj. R-squared: 0.442
Method: Least Squares F-statistic: 50.84
Date: Sun, 13 May 2018 Prob (F-statistic): 1.28e-09
Time: 14:18:24 Log-Likelihood: -78.095
No. Observations: 64 AIC: 160.2
Df Residuals: 62 BIC: 164.5
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 19.1668 2.247 8.531 0.000 14.675 23.658
Elevation -0.0119 0.002 -7.130 0.000 -0.015 -0.009
Omnibus: 1.175 Durbin-Watson: 0.842
Prob(Omnibus): 0.556 Jarque-Bera (JB): 0.988
Skew: 0.013 Prob(JB): 0.610
Kurtosis: 2.392 Cond. No. 2.90e+04

Can we build a model that uses both Elevation and Income to predict pollution levels?


In [19]:
mod = smf.ols(formula='PM25_MAR_8 ~ Elevation + Income + Elevation:Income', data=pm25)
res = mod.fit()
res.summary()


Out[19]:
OLS Regression Results
Dep. Variable: PM25_MAR_8 R-squared: 0.580
Model: OLS Adj. R-squared: 0.559
Method: Least Squares F-statistic: 27.61
Date: Sun, 13 May 2018 Prob (F-statistic): 2.42e-11
Time: 14:29:07 Log-Likelihood: -69.505
No. Observations: 64 AIC: 147.0
Df Residuals: 60 BIC: 155.6
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 22.4080 6.418 3.492 0.001 9.571 35.245
Elevation -0.0135 0.005 -2.811 0.007 -0.023 -0.004
Income -0.0002 0.000 -1.624 0.110 -0.000 3.82e-05
Elevation:Income 1.034e-07 7.42e-08 1.393 0.169 -4.5e-08 2.52e-07
Omnibus: 3.191 Durbin-Watson: 1.242
Prob(Omnibus): 0.203 Jarque-Bera (JB): 2.305
Skew: 0.386 Prob(JB): 0.316
Kurtosis: 3.517 Cond. No. 4.74e+09

What happens when we include both predictors?

Now let's look at ozone

Ozone is another cause for concern in air quality. It is created by chemical reactions between oxides of nitrogen (NOx) and volatile organic compounds (VOC) in the presence of sunlight. Emissions from industrial facilities and electric utilities, motor vehicle exhaust, gasoline vapors, and chemical solvents are some of the major sources of NOx and VOC. Ground level ozone can cause breathing problems in sensitive populations and can also be harmful to delicate ecosystems.

We have ozone measurements at two times: 10:00 AM and 3:00 PM (15:00)

How are these measurements different? Let's use distplot() to visualize the histograms side-by-side.

What do you think might account for differences between the two time points?


In [20]:
fig2, ax2 = plt.subplots(1)
sns.distplot(ozone["Ozone_AUG_10"], ax=ax2, label="10:00")
sns.distplot(ozone["Ozone_AUG_15"], ax=ax2, color='r', label="15:00")
ax2.set_xlabel("Ozone Density")
ax2.legend()


Out[20]:
<matplotlib.legend.Legend at 0x1e2851499e8>

Ozone vs pm25

Use the same plotting and statistical functions we used above to explore the relationship between ozone and income and elevation


In [21]:
sns.jointplot(x="Income", y="Ozone_AUG_15", data=ozone, kind="reg",
             color='purple');


Here it looks like we have the opposite relationship between income and ozone. As income increases, ozone exposure increases. Why might this be? (Click one of the options.)


In [17]:
button1 = widgets.Button(description='a. Ozone levels decrease with elevation',layout=widgets.Layout(width='40%', height='40px'))
button2 = widgets.Button(description='b. Wealthy people spend more time outdoors',layout=widgets.Layout(width='40%', height='40px'))
button3 = widgets.Button(description='c. Ozone levels increase with elevation, as does income',layout=widgets.Layout(width='40%', height='40px'))
button4 = widgets.Button(description='d. Too much noise; relationship is not obvious',layout=widgets.Layout(width='40%', height='40px'))
display(button1,button2,button3, button4)

def on_correct_clicked(b):
    print('Correct.')
def on_incorrect_clicked(b):
    print('Not quite.')
    
button1.on_click(on_incorrect_clicked)
button2.on_click(on_incorrect_clicked)
button3.on_click(on_correct_clicked)
button4.on_click(on_incorrect_clicked)


Please try again.
Correct.

In [22]:
mod = smf.ols(formula='Ozone_AUG_15 ~ Elevation', data=ozone)
res = mod.fit()
res.summary()


Out[22]:
OLS Regression Results
Dep. Variable: Ozone_AUG_15 R-squared: 0.124
Model: OLS Adj. R-squared: 0.110
Method: Least Squares F-statistic: 8.786
Date: Sun, 13 May 2018 Prob (F-statistic): 0.00430
Time: 14:38:35 Log-Likelihood: -193.18
No. Observations: 64 AIC: 390.4
Df Residuals: 62 BIC: 394.7
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 12.3968 13.569 0.914 0.364 -14.728 39.521
Elevation 0.0299 0.010 2.964 0.004 0.010 0.050
Omnibus: 5.342 Durbin-Watson: 0.926
Prob(Omnibus): 0.069 Jarque-Bera (JB): 5.559
Skew: 0.343 Prob(JB): 0.0621
Kurtosis: 4.271 Cond. No. 2.90e+04

In [23]:
mod = smf.ols(formula='Ozone_AUG_15 ~ Income', data=ozone)
res = mod.fit()
res.summary()


Out[23]:
OLS Regression Results
Dep. Variable: Ozone_AUG_15 R-squared: 0.327
Model: OLS Adj. R-squared: 0.316
Method: Least Squares F-statistic: 30.06
Date: Sun, 13 May 2018 Prob (F-statistic): 8.16e-07
Time: 14:38:35 Log-Likelihood: -184.78
No. Observations: 64 AIC: 373.6
Df Residuals: 62 BIC: 377.9
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 45.7387 1.363 33.548 0.000 43.013 48.464
Income 0.0002 2.76e-05 5.483 0.000 9.6e-05 0.000
Omnibus: 27.944 Durbin-Watson: 1.597
Prob(Omnibus): 0.000 Jarque-Bera (JB): 77.778
Skew: 1.245 Prob(JB): 1.29e-17
Kurtosis: 7.792 Cond. No. 1.22e+05

In [ ]: