Examining Weather Data and Air Quality Data

In this notebook we are going to learn how to read tabular data (e.g. spreadsheets) with the python package Pandas. Pandas is a very useful tool for datascience applications and provides a number of built in functions that makes it much easier for us to write programs.



In [ ]:

    
%matplotlib inline



In [ ]:

    
import os
import pandas as pd
import datetime
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import YouTubeVideo



In [ ]:

    
#!pip install xlrd

We need to create a variable that will tell our program where the data are located



In [ ]:

    
DATADIR = os.path.join(os.path.expanduser("~"), "DATA", "TimeSeries", "EPA")
os.path.exists(DATADIR)

What files are in the directory?



In [ ]:

    
files = os.listdir(DATADIR)
files

Read the air quality data

We'll use the Pandas read_excel function to read in our data into a Pandas dataframe. Pandas will automatically recongize column names, and data types (e.g. text, numbers). After we read in the data, we'll take a quick look at what it looks like.



In [ ]:

    
slc = pd.read_excel(os.path.join(DATADIR, 'Salt_Lake_2016_PM25.xlsx'))

A dataframe is an object with attributes and methods.

Some important attributes are

columns
shape

Useful methods include head and tail



In [ ]:

    
print(slc.columns)
print(slc.shape)

In addition to looking at the column names, we can also look at the data

See what happens when you pass an integer number to head (e.g. head(15). Try changing head to tail.



In [ ]:

    
slc.head(10)

There is lots of stuff here, more than we're interested in. `Thow it away`

What columns do we want to keep?

What is the difference between `Time local` and `Time GMT`?

Hint: GMT is more appropriately called UTC

There are lots of ways to throw data away

We can tell Pandas what columns to read when reading in the data.
We can tell Pandas to drop particular columns
We can create a new pandas dataframe by explicitly stating which columns we want to use. (This is the approach we will use.)



In [ ]:

    
YouTubeVideo("A4ZysWTWXEk")



In [ ]:

    
slc = slc[["Date Local", "Time Local", 
           "Sample Measurement", "MDL", 
           "Latitude", "Longitude", "Site Num"]]

Comments:

Dates and times are split into separate columns
We have both local time and UTC time

Merging Dates and Time

Dates and times are an important data type, and tricky---think leap years, time zones, etc, days of the week, etc. Luckily, Python comes with date and time objects and lots of functions and methods for working with them.

We'll use the datetime object combine method to merge the date and time columns.



In [ ]:

    
obs=20
print(slc.loc[obs]["Date Local"], slc.loc[obs]["Time Local"])
print(datetime.datetime.combine(slc.loc[obs]["Date Local"], 
                                slc.loc[obs]["Time Local"]))

Applying `datetime.combine` to all the dates and times in our dataframe

We're going to create a new column called "Date/Time Local" using the dataframe method apply. apply takes a function and applies it to the data in the dataframe. In this case we going to do some fancy Python and create what is called an anonymous function with a lambda statement.

See if you can describe what we are doing.



In [ ]:

    
slc["Date/Time Local"] = \
slc.apply(lambda x: datetime.datetime.combine(x["Date Local"],
                                              x["Time Local"]), 
          axis=1)
slc["Date/Time Local"].tail()

Let's look at the data

Since we have two different measurement sites, we're going to select only the data for site 3006.



In [ ]:

    
slc[slc["Site Num"]==3006].plot(x="Date Local", 
                                y="Sample Measurement")

Read in weather data



In [ ]:

    
slc_weather = pd.read_excel(os.path.join(DATADIR, 'SLC_Weather_2016.xlsx'))
slc_weather.head()

The data file uses the 2nd row to describe the units.

This is confusing Pandas
Let's skip the second row
This file uses a "-" to indicate data are missing. We need to tell Pandas this.



In [ ]:

    
slc_weather = pd.read_excel(os.path.join(DATADIR, 
                                         'SLC_Weather_2016.xlsx'), 
                            skiprows=[1],
                           na_values='-')
slc_weather.head()



In [ ]:

    
slc_weather['Day'][0]



In [ ]:

    
slc_weather.plot(x="Day", y="High")

Our Weather Data Have Resolution of Days

Our pollutant data has resolution of hours

What should we do?

We want to aggregate the data across days.

How might we do this?

What was the maximum value?
What was the minimum value?
What was the sum of the value?
What was the average (mean) of the value?



In [ ]:

    
slc.groupby("Date Local", as_index=False).aggregate(np.mean).head()

Group and take sum?



In [ ]:

    
slc.groupby("Date Local", as_index=False).aggregate(np.sum).head()

Now we need to combine the pollution data with the weather data



In [ ]:

    
slc_day_all = slc_day.merge(slc_weather, 
                            left_on="Date Local", 
                            right_on="Day")
slc_day_all.head()

Explore the Relationship between various weather variables and `Sample Measurement`



In [ ]:

    
f, ax1 = plt.subplots(1)
slc_day_all[slc_day_all["Site Num"]==3006].plot(x="Date Local", 
                                                y="High", ax=ax1)
slc_day_all[slc_day_all["Site Num"]==3006].plot(secondary_y=True, x="Date Local", 
                                                y="Sample Measurement", ax=ax1)

<span xmlns:dct="http://purl.org/dc/terms/" property="dct:title">University of Uah Data Science for Health</span> by <span xmlns:cc="http://creativecommons.org/ns#" property="cc:attributionName">Brian E. Chapman</span> is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.|



In [ ]: