In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import folium
import seaborn as sns
import datetime
from bcycle_lib.utils import *
%matplotlib inline
plt.rc('xtick', labelsize=14)
plt.rc('ytick', labelsize=14)
# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2
I used Weather Underground to download a CSV with daily weather information from Austin's Camp Mabry station (KATT). This includes the following data fields:
The load_weather
function includes a lot of cleaning and pre-processing to get the raw CSV into a good state for the rest of the analysis.
In [2]:
weather_df = load_weather()
weather_df.head(6)
Out[2]:
In [3]:
weather_df.describe()
Out[3]:
The summary above shows descriptive statistics for each of the numeric columns in the table. There is a good range of weather conditions in there, including:
This should give a good distribution of data to work from. But we have a wide mix of units in each column (MPH, °F, percentages, and weather conditions), so we may have to use some feature normalization to give good results later on.
In [4]:
fig, ax = plt.subplots(1,1, figsize=(18,10))
ax = weather_df.plot(y=['max_temp', 'min_temp'], ax=ax)
ax.legend(fontsize=13)
xtick = pd.date_range( start=weather_df.index.min( ), end=weather_df.index.max( ), freq='D' )
ax.set_xticks( xtick )
# ax.set_xticklabels(weather_df.index.strftime('%a %b %d'))
ax.set_xlabel('Date', fontdict={'size' : 14})
ax.set_ylabel('Temperature (°F)', fontdict={'size' : 14})
ax.set_title('Austin Minimum and Maximum Temperatures during April and May 2016', fontdict={'size' : 16})
# fig.autofmt_xdate(rotation=90)
ttl = ax.title
ttl.set_position([.5, 1.02])
ax.legend(['Max Temp', 'Min Temp'], fontsize=14, loc=1)
ax.tick_params(axis='x', labelsize=14)
ax.tick_params(axis='y', labelsize=14)
The plot above shows the trends in minimum and maximum temperature during April and May 2016. The overall trend is an increase in both min and max temperatures, with a lot of variation in the changes in temperature. For example, around the 2nd May, the maximum temperature was less than the minimum temperature a few days earlier!
In [5]:
fig, ax = plt.subplots(1,2, figsize=(12,6))
# ax[0] = weather_df['min_temp'].plot.hist(ax=ax[0]) # sns.distplot(weather_df['min_temp'], ax=ax[0])
# ax[1] = weather_df['max_temp'].plot.hist(ax=ax[1]) # sns.distplot(weather_df['max_temp'], ax=ax[1])
ax[0] = sns.distplot(weather_df['min_temp'], ax=ax[0])
ax[1] = sns.distplot(weather_df['max_temp'], ax=ax[1])
for axis in ax:
axis.set_xlabel('Temperature (°F)', fontdict={'size' : 14})
axis.set_ylabel('Density', fontdict={'size' : 14})
ax[0].set_title('Minimum Temperature Distribution', fontdict={'size' : 16})
ax[1].set_title('Maximum Temperature Distribution', fontdict={'size' : 16})
Out[5]:
In [6]:
g = sns.pairplot(data=weather_df[['min_temp', 'max_temp']], kind='reg',size=4)
The pair plots show there's a reasonable correlation between the maximum and minimum temperatures.
In [7]:
fig, ax = plt.subplots(1,1, figsize=(18,10))
ax = weather_df.plot(y=['max_pressure', 'min_pressure'], ax=ax)
ax.legend(fontsize=13)
xtick = pd.date_range( start=weather_df.index.min( ), end=weather_df.index.max( ), freq='D' )
ax.set_xticks( xtick )
# ax.set_xticklabels(weather_df.index.strftime('%a %b %d'))
ax.set_xlabel('Date', fontdict={'size' : 14})
ax.set_ylabel('Pressure (inches)', fontdict={'size' : 14})
ax.set_title('Min and Max Pressure', fontdict={'size' : 18})
# fig.autofmt_xdate(rotation=90)
ax.tick_params(axis='x', labelsize=14)
ax.tick_params(axis='y', labelsize=14)
The plot shows both the max and min pressure as being highly correlated. There may also be correlations between the pressure and other more directly observable factors such as temperature and wind.
In [8]:
fig, ax = plt.subplots(1,1, figsize=(18,10))
ax = weather_df['precipitation'].plot.bar(ax=ax, legend=None)
ax.set_xticklabels(weather_df.index.strftime('%a %b %d'))
ax.set_xlabel('', fontdict={'size' : 14})
ax.set_ylabel('Precipitation (inches)', fontdict={'size' : 14})
ax.set_title('Austin Precipitation in April and May 2016', fontdict={'size' : 16})
ax.tick_params(axis='x', labelsize=13)
ax.tick_params(axis='y', labelsize=14)
ttl = ax.title
ttl.set_position([.5, 1.02])
The graph shows there was some serious rain in April and May. As well as some dry spells through early April and May, there were also individual days where over an inch of rain fell. I'd definitely not be tempted to take a bike ride in those conditions !
To see how the distribution of rainfall looks, let's plot out the histogram and Kernel Density Estimate below. Based on the daily plot above, you can see there will likely be a very right skewed distribution with a long tail. For this reason, I'll use the pandas histogram directly, instead of fitting a Kernel Density Estimate.
In [9]:
fig, ax = plt.subplots(1,1, figsize=(6,6))
ax = weather_df['precipitation'].plot.hist(ax=ax)
ax.set_xlabel('Precipitation (inches)', fontdict={'size' : 14})
ax.set_ylabel('Count', fontdict={'size' : 14})
ax.set_title('Precipitation distribution', fontdict={'size' : 16})
Out[9]:
This plot shows the majority of days had no rainfall at all. There were about 10 days with less than 0.5" of rain, and the count of days drops off steeply as the rainfall value increases. We may be able to transform this one-sided skewed distribution by setting a threshold, and converting to a boolean (above / below the threshold).
In [10]:
fig, ax = plt.subplots(1,1, figsize=(18,10))
ax = weather_df.plot(y=['max_wind', 'min_wind', 'max_gust'], ax=ax)
ax.legend(fontsize=13)
xtick = pd.date_range( start=weather_df.index.min( ), end=weather_df.index.max( ), freq='D' )
ax.set_xticks( xtick )
# ax.set_xticklabels(weather_df.index.strftime('%a %b %d'))
ax.set_xlabel('Date', fontdict={'size' : 14})
ax.set_ylabel('Wind speed (MPH)', fontdict={'size' : 14})
ax.set_title('Wind speeds', fontdict={'size' : 18})
# fig.autofmt_xdate(rotation=90)
ax.tick_params(axis='x', labelsize=14)
ax.tick_params(axis='y', labelsize=14)
The graph shows a close correlation between the min_wind
, max_wind
, and max_gust
speeds, as you'd expect. When building linear models, it's best to remove highly correlated values so we may just use the max_gust
of the three based on how correlated they are.
In [11]:
g = sns.pairplot(data=weather_df[['min_wind', 'max_wind', 'max_gust']], kind='reg',size=3.5)
This pairplot shows a high positive correlation between the max_wind
and max_gust
, as you'd expect. There is also a strong correlation between the minimum and maximum wind speeds. When building models, we probably need to take the max_wind
or max_gust
to avoid multiple correlated columns.
In [12]:
# weather_df[['thunderstorm', 'rain', 'fog']].plot.bar(figsize=(20,20))
heatmap_df = weather_df.copy()
heatmap_df = heatmap_df[['thunderstorm', 'rain', 'fog']]
heatmap_df = heatmap_df.reset_index()
heatmap_df['day'] = heatmap_df['date'].dt.dayofweek
heatmap_df['week'] = heatmap_df['date'].dt.week
heatmap_df = heatmap_df.pivot_table(values='thunderstorm', index='day', columns='week')
heatmap_df = heatmap_df.fillna(False)
# ['day'] = heatmap_df.index.dt.dayofweek
# Restore proper day and week-of-month labels.
heatmap_df.index = ['Mon', 'Tues', 'Wed', 'Thur', 'Fri', 'Sat', 'Sun']
weeks = heatmap_df.columns
weeks = ['2016-W' + str(week) for week in weeks] # Convert to '2016-Wxx'
weeks = [datetime.datetime.strptime(d + '-0', "%Y-W%W-%w").strftime('%b %d') for d in weeks]
heatmap_df.columns = weeks
fig, ax = plt.subplots(1,1, figsize=(8, 6))
sns.heatmap(data=heatmap_df, square=True, cmap='Blues', linewidth=2, cbar=False, linecolor='white', ax=ax)
ax.set_title('Thunderstorms by day and week', fontdict={'size' : 18})
ttl = ax.title
ttl.set_position([.5, 1.05])
ax.set_xlabel('Week ending (Sunday)', fontdict={'size' : 14})
ax.set_ylabel('')
ax.tick_params(axis='x', labelsize=13)
ax.tick_params(axis='y', labelsize=13)
The heatmap above shows which days had thunderstorms with the dark blue squares. Light blue squares are either days outside of April or May, or those in April and May which had thunderstorms. The plot shows there were more thunderstorms in May, where there were contiguous days of thunderstorms from 3 to 4 days long.