Before working on this assignment please read these instructions fully. In the submission area, you will notice that you can click the link to Preview the Grading for each step of the assignment. This is the criteria that will be used for peer grading. Please familiarize yourself with the criteria before beginning the assignment.
An NOAA dataset has been stored in the file data/C2A2_data/BinnedCsvs_d400/fb441e62df2d58994928907a91895ec62c2c42e6cd075c2700843b89.csv
. The data for this assignment comes from a subset of The National Centers for Environmental Information (NCEI) Daily Global Historical Climatology Network (GHCN-Daily). The GHCN-Daily is comprised of daily climate records from thousands of land surface stations across the globe.
Each row in the assignment datafile corresponds to a single observation.
The following variables are provided to you:
For this assignment, you must:
The data you have been given is near Ann Arbor, Michigan, United States, and the stations the data comes from are shown on the map below.
In [1]:
import matplotlib.pyplot as plt
import mplleaflet
import pandas as pd
import numpy as np
def leaflet_plot_stations(binsize, hashid):
df = pd.read_csv('BinSize_d{}.csv'.format(binsize))
station_locations_by_hash = df[df['hash'] == hashid]
lons = station_locations_by_hash['LONGITUDE'].tolist()
lats = station_locations_by_hash['LATITUDE'].tolist()
plt.figure(figsize=(8,8))
plt.scatter(lons, lats, c='r', alpha=0.7, s=200)
return mplleaflet.display()
leaflet_plot_stations(400,'fb441e62df2d58994928907a91895ec62c2c42e6cd075c2700843b89')
Out[1]:
In [2]:
df = pd.read_csv("fb441e62df2d58994928907a91895ec62c2c42e6cd075c2700843b89.csv")
In [3]:
df.head()
Out[3]:
In [4]:
print("Unique days in the dataset: {}".format(len(df["Date"].unique())))
In [5]:
print("Observations in the dataset: {}".format(df.shape[0]))
It seems that our data contains 165085 observations for 4017 days. So there are several observations per day. Let's see how the date looks if we sort it by the date.
In [6]:
df.sort_values("Date").head()
Out[6]:
In [7]:
# Add month-day column so we can find record breaking days from 2015
df["Month-Day"] = df.apply(lambda row: "-".join(row["Date"].split("-")[1:]), axis=1)
In [8]:
df[ df["Month-Day"] == "01-01" ].min()
Out[8]:
Before going further drop leap day from the date set like instructed.
In [9]:
df = df[ df.Date.str.contains("-02-29") == False ]
Let's now convert dates from string to date object.
In [10]:
df.dtypes
Out[10]:
In [11]:
df["Date"] = pd.to_datetime(df["Date"])
In [12]:
df.dtypes
Out[12]:
Let's now separate min and max for easier handling.
In [13]:
tmax = df[ df["Element"] == "TMAX" ]
tmin = df[ df["Element"] == "TMIN" ]
In [14]:
tmin.sort_values("Date").head()
Out[14]:
In [15]:
tmax.sort_values("Date").head()
Out[15]:
Group to 2005-2014 and 2015 to own datasets.
We also need to find dates when record low and high were broken in 2015. Strategy here is to merge min/max and 2015 data frames on 'month-day', keep only high/low for each day and keep 2015 values only if they are lower or higher than day observation for period 2005-2014.
In [16]:
# Take year earlier than 2015
df_period_min = tmin[ tmin["Date"] < "2015-01-01" ]
df_period_max = tmax[ tmax["Date"] < "2015-01-01" ]
# Separate 2015
df_2015_min = tmin[ tmin["Date"] >= "2015-01-01" ]
df_2015_max = tmax[ tmax["Date"] >= "2015-01-01" ]
In [17]:
tmax_grouped = df_period_max.groupby("Month-Day").max()
tmin_grouped = df_period_min.groupby("Month-Day").min()
tmax_15_grouped = df_2015_max.groupby("Month-Day").max()
tmin_15_grouped = df_2015_min.groupby("Month-Day").min()
In [18]:
tmin_15_grouped = tmin_15_grouped.reset_index()
df_min= tmin_grouped.reset_index().merge(tmin_15_grouped, on="Month-Day").set_index("Month-Day")
tmax_15_grouped = tmax_15_grouped.reset_index()
df_max= tmax_grouped.reset_index().merge(tmax_15_grouped, on="Month-Day").set_index("Month-Day")
df_max.rename(columns={"Data_Value_y":"Data_Max_2015", "Data_Value_x" : "Data_Value"}, inplace=True)
df_min.rename(columns={"Data_Value_y":"Data_Min_2015", "Data_Value_x" : "Data_Value"}, inplace=True)
In [19]:
df_max.head()
Out[19]:
In [20]:
df_min.head()
Out[20]:
For 2015 columns keep values only if value is higher or lower than for 2005-2014 months. Setting values to np.NaN
is a good way of setting value to none since NaN
values are not included in plot or in possible calculations.
In [21]:
df_min["Data_Min_2015"] = df_min.apply(lambda row: row["Data_Min_2015"] if (row["Data_Min_2015"] < row["Data_Value"]) else np.NaN , axis=1)
df_max["Data_Max_2015"] = df_max.apply(lambda row: row["Data_Max_2015"] if (row["Data_Max_2015"] > row["Data_Value"]) else np.NaN , axis=1)
In [22]:
df_min = df_min.sort_index()
df_max = df_max.sort_index()
In [23]:
%matplotlib notebook
In [24]:
# Reset indexes so plots can be drawn
df_min = df_min.reset_index()
df_max = df_max.reset_index()
In [25]:
import datetime
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
days = mdates.DayLocator() # every day
years = mdates.YearLocator() # every year
months = mdates.MonthLocator() # every month
fmt = mdates.DateFormatter('%m-%d')
fig, ax = plt.subplots()
ax.plot(df_min.index, df_min["Data_Value"], label="Low")
ax.plot(df_max.index, df_max["Data_Value"], label="High")
# format the ticks
#ax.xaxis.set_major_locator(years)
ax.xaxis.set_major_formatter(fmt)
#ax.xaxis.set_minor_locator(months)
ax.scatter(df_min.index, df_min["Data_Min_2015"], s=10, c="blue", alpha=0.5, label="Low (2015)", zorder=10)
ax.scatter(df_max.index, df_max["Data_Max_2015"], s=10, c="red", alpha=0.5, label="High (2015)", zorder=10)
ax.fill_between(df_min.index, df_min["Data_Value"], df_max["Data_Value"], facecolor='cyan')
# Simplify graph
plt.legend(loc=9, bbox_to_anchor=(0.5, -0.25), ncol=4, frameon = False)
ax.tick_params(top='off', bottom='on', left='off', right='off', labelleft='on', labelbottom='on')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
# set labels and title
plt.xlabel("Date")
plt.ylabel("Temperature, Tenths of Degrees C")
plt.title("Daily Minimum and Maximum\nAnn Arbor, Michigan")
ax.grid(False)
plt.xticks(np.arange(1, 365, 30), rotation=45)
# make some room for the legend
fig.subplots_adjust(bottom=0.3)
plt.show()