In this notebook we are going to learn how to read tabular data (e.g. spreadsheets) with the python package Pandas. Pandas is a very useful tool for datascience applications and provides a number of built in functions that makes it much easier for us to write programs.
In [ ]:
%matplotlib inline
In [ ]:
import os
import pandas as pd
import datetime
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import YouTubeVideo
In [ ]:
#!pip install xlrd
In [ ]:
DATADIR = os.path.join(os.path.expanduser("~"), "DATA", "TimeSeries", "EPA")
os.path.exists(DATADIR)
In [ ]:
files = os.listdir(DATADIR)
files
In [ ]:
slc = pd.read_excel(os.path.join(DATADIR, 'Salt_Lake_2016_PM25.xlsx'))
A dataframe is an object with attributes and methods.
Some important attributes are
Useful methods include head
and tail
In [ ]:
print(slc.columns)
print(slc.shape)
In [ ]:
slc.head(10)
Thow it away
Time local
and Time GMT
?GMT
is more appropriately called UTC
In [ ]:
YouTubeVideo("A4ZysWTWXEk")
In [ ]:
slc = slc[["Date Local", "Time Local",
"Sample Measurement", "MDL",
"Latitude", "Longitude", "Site Num"]]
Dates and times are an important data type, and tricky---think leap years, time zones, etc, days of the week, etc. Luckily, Python comes with date and time objects and lots of functions and methods for working with them.
datetime
object combine
method to merge the date and time columns.
In [ ]:
obs=20
print(slc.loc[obs]["Date Local"], slc.loc[obs]["Time Local"])
print(datetime.datetime.combine(slc.loc[obs]["Date Local"],
slc.loc[obs]["Time Local"]))
datetime.combine
to all the dates and times in our dataframeWe're going to create a new column called "Date/Time Local" using the dataframe method apply
. apply
takes a function and applies it to the data in the dataframe. In this case we going to do some fancy Python and create what is called an anonymous function with a lambda
statement.
See if you can describe what we are doing.
In [ ]:
slc["Date/Time Local"] = \
slc.apply(lambda x: datetime.datetime.combine(x["Date Local"],
x["Time Local"]),
axis=1)
slc["Date/Time Local"].tail()
In [ ]:
slc[slc["Site Num"]==3006].plot(x="Date Local",
y="Sample Measurement")
In [ ]:
slc_weather = pd.read_excel(os.path.join(DATADIR, 'SLC_Weather_2016.xlsx'))
slc_weather.head()
In [ ]:
slc_weather = pd.read_excel(os.path.join(DATADIR,
'SLC_Weather_2016.xlsx'),
skiprows=[1],
na_values='-')
slc_weather.head()
In [ ]:
slc_weather['Day'][0]
In [ ]:
slc_weather.plot(x="Day", y="High")
In [ ]:
slc.groupby("Date Local", as_index=False).aggregate(np.mean).head()
In [ ]:
slc.groupby("Date Local", as_index=False).aggregate(np.sum).head()
In [ ]:
slc_day_all = slc_day.merge(slc_weather,
left_on="Date Local",
right_on="Day")
slc_day_all.head()
In [ ]:
f, ax1 = plt.subplots(1)
slc_day_all[slc_day_all["Site Num"]==3006].plot(x="Date Local",
y="High", ax=ax1)
slc_day_all[slc_day_all["Site Num"]==3006].plot(secondary_y=True, x="Date Local",
y="Sample Measurement", ax=ax1)
<span xmlns:dct="http://purl.org/dc/terms/" property="dct:title">University of Uah Data Science for Health</span> by <span xmlns:cc="http://creativecommons.org/ns#" property="cc:attributionName">Brian E. Chapman</span> is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.|
In [ ]: