BCycle Austin models

This notebook analyzes the weather patterns during April and May 2016. The data has been already downloaded from Weather Underground, and should be in ../input/weather.csv. Please check and unzip this file if you need to.

Imports and data loading

Before getting started, let's import some useful libraries for visualization, and the bcycle utils library.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import folium
import seaborn as sns

import datetime

from bcycle_lib.utils import *

%matplotlib inline
plt.rc('xtick', labelsize=14) 
plt.rc('ytick', labelsize=14) 

# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

Loading and cleaning weather data

I used Weather Underground to download a CSV with daily weather information from Austin's Camp Mabry station (KATT). This includes the following data fields:

  • Date
  • Min, mean, and max:
    • Temperature (degrees Fahreinheit)
    • Dew Point (degrees Fahreinheit)
    • Humidity (%)
    • Sea Level Pressure (inches)
    • Visibility (miles)
    • Wind speed (mph)
  • Max gust (mph)
  • Precipitation (inches)
  • Events (combinations of Fog, Rain, Thunderstorm)

The load_weather function includes a lot of cleaning and pre-processing to get the raw CSV into a good state for the rest of the analysis.


In [2]:
weather_df = load_weather()
weather_df.head(6)


Out[2]:
max_temp min_temp max_humidity min_humidity max_pressure min_pressure max_wind min_wind max_gust precipitation cloud_pct thunderstorm rain fog
date
2016-04-01 66 51 72 42 30.17 29.76 21 8 37 0.34 62.5 1 1 0
2016-04-02 74 45 76 23 30.32 30.18 13 5 20 0.00 0.0 0 0 0
2016-04-03 79 44 89 27 30.26 30.08 12 3 17 0.00 0.0 0 0 0
2016-04-04 83 53 66 30 30.21 30.10 12 4 18 0.00 0.0 0 0 0
2016-04-05 82 53 66 33 30.22 30.03 15 5 25 0.00 0.0 0 0 0
2016-04-06 82 55 90 21 30.12 29.93 15 6 23 0.00 37.5 0 0 0

In [3]:
weather_df.describe()


Out[3]:
max_temp min_temp max_humidity min_humidity max_pressure min_pressure max_wind min_wind max_gust precipitation cloud_pct thunderstorm rain fog
count 61.000000 61.000000 61.000000 61.000000 61.000000 61.000000 61.000000 61.000000 61.000000 61.000000 61.000000 61.000000 61.000000 61.000000
mean 81.901639 62.360656 90.721311 53.475410 30.045246 29.867213 13.737705 5.295082 22.770492 0.235410 55.532787 0.393443 0.491803 0.065574
std 5.682443 7.122810 8.374826 17.527699 0.121197 0.134872 3.224395 1.676745 5.643266 0.424349 34.348442 0.492568 0.504082 0.249590
min 66.000000 44.000000 66.000000 15.000000 29.760000 29.540000 8.000000 2.000000 13.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 78.000000 58.000000 87.000000 40.000000 29.980000 29.780000 12.000000 4.000000 19.000000 0.000000 37.500000 0.000000 0.000000 0.000000
50% 83.000000 63.000000 93.000000 56.000000 30.070000 29.900000 14.000000 5.000000 22.000000 0.010000 62.500000 0.000000 0.000000 0.000000
75% 87.000000 67.000000 97.000000 67.000000 30.120000 29.960000 16.000000 7.000000 26.000000 0.310000 87.500000 1.000000 1.000000 0.000000
max 91.000000 77.000000 100.000000 84.000000 30.320000 30.180000 21.000000 9.000000 37.000000 2.250000 100.000000 1.000000 1.000000 1.000000

The summary above shows descriptive statistics for each of the numeric columns in the table. There is a good range of weather conditions in there, including:

  • Min and max temperatures ranging from 44°F to 91°F.
  • Wind speeds ranging from 2MPH to 21MPH, with individual gusts up to 37MPH !
  • Maximum precipitation of 2.25 inches.
  • Weather events including fog, thunderstorms, and rain (these aren't included in the summary statistics above).

This should give a good distribution of data to work from. But we have a wide mix of units in each column (MPH, °F, percentages, and weather conditions), so we may have to use some feature normalization to give good results later on.

Visualizing weather in April/May 2016

Now we have the weather information in a convenient dataframe, we can make some plots to visualize the conditions during April and May.

Temperature plots

Let's see how the minimum and maximum temperatures varied.


In [4]:
fig, ax = plt.subplots(1,1, figsize=(18,10))
ax = weather_df.plot(y=['max_temp', 'min_temp'], ax=ax)
ax.legend(fontsize=13)
xtick = pd.date_range( start=weather_df.index.min( ), end=weather_df.index.max( ), freq='D' )
ax.set_xticks( xtick )
# ax.set_xticklabels(weather_df.index.strftime('%a %b %d'))
ax.set_xlabel('Date', fontdict={'size' : 14})
ax.set_ylabel('Temperature (°F)', fontdict={'size' : 14})
ax.set_title('Austin Minimum and Maximum Temperatures during April and May 2016', fontdict={'size' : 16}) 
# fig.autofmt_xdate(rotation=90)
ttl = ax.title
ttl.set_position([.5, 1.02])
ax.legend(['Max Temp', 'Min Temp'], fontsize=14, loc=1)



ax.tick_params(axis='x', labelsize=14)
ax.tick_params(axis='y', labelsize=14)


The plot above shows the trends in minimum and maximum temperature during April and May 2016. The overall trend is an increase in both min and max temperatures, with a lot of variation in the changes in temperature. For example, around the 2nd May, the maximum temperature was less than the minimum temperature a few days earlier!

Temperature distributions

Now we have an idea of how the temperature changed over time, we can check the distribution of min and max temperatures. Some of the models we'll be using expect features to be normally distributed, so we may need to transform the values if they aren't.


In [5]:
fig, ax = plt.subplots(1,2, figsize=(12,6))

# ax[0] = weather_df['min_temp'].plot.hist(ax=ax[0]) # sns.distplot(weather_df['min_temp'], ax=ax[0])
# ax[1] = weather_df['max_temp'].plot.hist(ax=ax[1]) # sns.distplot(weather_df['max_temp'], ax=ax[1])

ax[0] = sns.distplot(weather_df['min_temp'], ax=ax[0])
ax[1] = sns.distplot(weather_df['max_temp'], ax=ax[1])

for axis in ax:
    axis.set_xlabel('Temperature (°F)', fontdict={'size' : 14})
    axis.set_ylabel('Density', fontdict={'size' : 14})

ax[0].set_title('Minimum Temperature Distribution', fontdict={'size' : 16}) 
ax[1].set_title('Maximum Temperature Distribution', fontdict={'size' : 16})


Out[5]:
<matplotlib.text.Text at 0x7f9e6b93f630>

Temperature pair plots

To see how the temperatures are correlated, let's use a pairplot.


In [6]:
g = sns.pairplot(data=weather_df[['min_temp', 'max_temp']], kind='reg',size=4)


The pair plots show there's a reasonable correlation between the maximum and minimum temperatures.

Pressure

Let's check the pressure difference in April and May. We don't perceive pressure as directly as temperature, precipitation, or thunderstorms. But there may be some interesting trends.


In [7]:
fig, ax = plt.subplots(1,1, figsize=(18,10))
ax = weather_df.plot(y=['max_pressure', 'min_pressure'], ax=ax)
ax.legend(fontsize=13)
xtick = pd.date_range( start=weather_df.index.min( ), end=weather_df.index.max( ), freq='D' )
ax.set_xticks( xtick )
# ax.set_xticklabels(weather_df.index.strftime('%a %b %d'))
ax.set_xlabel('Date', fontdict={'size' : 14})
ax.set_ylabel('Pressure (inches)', fontdict={'size' : 14})
ax.set_title('Min and Max Pressure', fontdict={'size' : 18}) 
# fig.autofmt_xdate(rotation=90)

ax.tick_params(axis='x', labelsize=14)
ax.tick_params(axis='y', labelsize=14)


The plot shows both the max and min pressure as being highly correlated. There may also be correlations between the pressure and other more directly observable factors such as temperature and wind.

Precipitation

Let's take a look at the precipitation, to see how much it rained during the data collection phase.


In [8]:
fig, ax = plt.subplots(1,1, figsize=(18,10))
ax = weather_df['precipitation'].plot.bar(ax=ax, legend=None)
ax.set_xticklabels(weather_df.index.strftime('%a %b %d'))
ax.set_xlabel('', fontdict={'size' : 14})
ax.set_ylabel('Precipitation (inches)', fontdict={'size' : 14})
ax.set_title('Austin Precipitation in April and May 2016', fontdict={'size' : 16})
ax.tick_params(axis='x', labelsize=13)
ax.tick_params(axis='y', labelsize=14)
ttl = ax.title
ttl.set_position([.5, 1.02])


The graph shows there was some serious rain in April and May. As well as some dry spells through early April and May, there were also individual days where over an inch of rain fell. I'd definitely not be tempted to take a bike ride in those conditions !

Precipitation histogram

To see how the distribution of rainfall looks, let's plot out the histogram and Kernel Density Estimate below. Based on the daily plot above, you can see there will likely be a very right skewed distribution with a long tail. For this reason, I'll use the pandas histogram directly, instead of fitting a Kernel Density Estimate.


In [9]:
fig, ax = plt.subplots(1,1, figsize=(6,6))
ax = weather_df['precipitation'].plot.hist(ax=ax)
ax.set_xlabel('Precipitation (inches)', fontdict={'size' : 14})
ax.set_ylabel('Count', fontdict={'size' : 14})
ax.set_title('Precipitation distribution', fontdict={'size' : 16})


Out[9]:
<matplotlib.text.Text at 0x7f9e6998ce48>

This plot shows the majority of days had no rainfall at all. There were about 10 days with less than 0.5" of rain, and the count of days drops off steeply as the rainfall value increases. We may be able to transform this one-sided skewed distribution by setting a threshold, and converting to a boolean (above / below the threshold).

Windspeed

The windspeed is likely to play a role in the amount of bike rentals too. I've plotted the minimum, maximum, and gust speeds in the line graph below.


In [10]:
fig, ax = plt.subplots(1,1, figsize=(18,10))
ax = weather_df.plot(y=['max_wind', 'min_wind', 'max_gust'], ax=ax)
ax.legend(fontsize=13)
xtick = pd.date_range( start=weather_df.index.min( ), end=weather_df.index.max( ), freq='D' )
ax.set_xticks( xtick )
# ax.set_xticklabels(weather_df.index.strftime('%a %b %d'))
ax.set_xlabel('Date', fontdict={'size' : 14})
ax.set_ylabel('Wind speed (MPH)', fontdict={'size' : 14})
ax.set_title('Wind speeds', fontdict={'size' : 18}) 
# fig.autofmt_xdate(rotation=90)

ax.tick_params(axis='x', labelsize=14)
ax.tick_params(axis='y', labelsize=14)


The graph shows a close correlation between the min_wind, max_wind, and max_gust speeds, as you'd expect. When building linear models, it's best to remove highly correlated values so we may just use the max_gust of the three based on how correlated they are.

Wind speed distributions

As I suspect the wind speeds are very correlated, let's use a pairplot to see the correlations as well as individual distributions.


In [11]:
g = sns.pairplot(data=weather_df[['min_wind', 'max_wind', 'max_gust']], kind='reg',size=3.5)


This pairplot shows a high positive correlation between the max_wind and max_gust, as you'd expect. There is also a strong correlation between the minimum and maximum wind speeds. When building models, we probably need to take the max_wind or max_gust to avoid multiple correlated columns.

Weather events

As well as the numeric weather values, there are 3 dummy variables for the events on each day. These are thunderstorm, rain, and fog. Let's plot these below.


In [12]:
# weather_df[['thunderstorm', 'rain', 'fog']].plot.bar(figsize=(20,20))
heatmap_df = weather_df.copy()
heatmap_df = heatmap_df[['thunderstorm', 'rain', 'fog']]
heatmap_df = heatmap_df.reset_index()
heatmap_df['day'] = heatmap_df['date'].dt.dayofweek
heatmap_df['week'] = heatmap_df['date'].dt.week
heatmap_df = heatmap_df.pivot_table(values='thunderstorm', index='day', columns='week')
heatmap_df = heatmap_df.fillna(False)
# ['day'] = heatmap_df.index.dt.dayofweek

# Restore proper day and week-of-month labels. 
heatmap_df.index = ['Mon', 'Tues', 'Wed', 'Thur', 'Fri', 'Sat', 'Sun']
weeks = heatmap_df.columns
weeks = ['2016-W' + str(week) for week in weeks] # Convert to '2016-Wxx'
weeks = [datetime.datetime.strptime(d + '-0', "%Y-W%W-%w").strftime('%b %d') for d in weeks]
heatmap_df.columns = weeks

fig, ax = plt.subplots(1,1, figsize=(8, 6))
sns.heatmap(data=heatmap_df, square=True, cmap='Blues', linewidth=2, cbar=False, linecolor='white', ax=ax)
ax.set_title('Thunderstorms by day and week', fontdict={'size' : 18})
ttl = ax.title
ttl.set_position([.5, 1.05])
ax.set_xlabel('Week ending (Sunday)', fontdict={'size' : 14})
ax.set_ylabel('')
ax.tick_params(axis='x', labelsize=13)
ax.tick_params(axis='y', labelsize=13)


The heatmap above shows which days had thunderstorms with the dark blue squares. Light blue squares are either days outside of April or May, or those in April and May which had thunderstorms. The plot shows there were more thunderstorms in May, where there were contiguous days of thunderstorms from 3 to 4 days long.