Before doing any statistical test and prediction model, let's explore the characters of the ridership and weather data. I'm examining it from the following perspectives. Please notice that I'm following the nomenclature in the data set and call it hourly entries and exits. But the numbers actually mean the 4-hour differences from the previous regular reading.
In [16]:
#libraries for analytical computing
import numpy as np
import pandas as pd
import datetime as d
#libraries for plotting
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
#libraries for inline plotting
#from IPython.html.widgets import interact
#import IPython
%matplotlib inline
#initialize global turnstile dataframe for reuse
df = pd.read_csv('turnstile_weather_v2.csv', parse_dates=['datetime'])
#print df.describe()
You can uncomnet the code "print df.describe()" above to see count, mean, std, min, max and etc of each numeric column. I'm more interested in the key stats or both numerical and categorical columns.
From to the queries below, we can see turnstile data is provided for every day of May 2011. It consists of a total of 42,649 rows. The turnstiles were read every 4 hours (0am, 4am, 8am, 12pm, 4pm, 8pm) for a total of 207 stations and 240 remote units. Interestingly, total entries of May 2011 almost double total exits. Deeper investigation might be needed. But due to the scope of this project, I'll be using hourly entries (ENTRIESn_hourly) as the indicator of ridership. Comparing to average weather of New York, May 2011's rainfall was slightly less but its temperature stayed within normal range.
In [17]:
print 'Row count: {:,}'.format(len(df))
date_unique = df['DATEn'].unique()
print 'Date range of data: {0} to {1}'.format(date_unique.min(), date_unique.max())
print 'Days of data: {}'.format(len(date_unique))
list_hours = list(df['hour'].unique())
list_hours.sort()
print 'Sample hours: {}'.format(list_hours)
print 'Number of stations: {}'.format(len(df['station'].unique()))
print 'Number of remotes: {}'.format(len(df['UNIT'].unique()))
print 'Total entries: {:,.0f}'.format(df['ENTRIESn_hourly'].sum())
print 'Total exits: {:,.0f}'.format(df['EXITSn_hourly'].sum())
print 'Rain days: {}'.format(df.groupby(['DATEn'])['DATEn','rain'].aggregate(np.max)['rain'].sum())
print 'Mean precipitation: {:.5f} (inches)'.format(np.mean(df['meanprecipi']))
print 'Mean temperature: {:.5f} (F)'.format(np.mean(df['meantempi']))
First I'm plotting the histograms of the frequencies of entries/exits readings. From the histograms we can see both entries and exits are right-skewed (the mass of the distribution is concentrated on the left). And for both the medians are lower than the means. It can indicate that overall the turnstile planning did not match commuters' pattern. A more normal (bell curve) distribuion is more desired as it indicateds the usage is more even. In section 1 I will double check the normality with statistical test.
I'm trying to make the code as reusable as possible. Below are 2 functions used to calculate and plot comparison histograms. In the future I'll omit explanation like this.
In [18]:
'''
Display count, total, median and mean for 2 side-by-side factored histograms.
'''
def print_stats(title0, array0, title1, array1):
print "{0}: Count = {1:,.0f}, Total = {2:,.0f}, Median = {3:,.0f}, Mean = {4:,.2f}" \
.format(title0, len(array0), np.sum(array0), np.median(array0), np.mean(array0))
print "{0}: Count = {1:,.0f}, Total = {2:,.0f}, Median = {3:,.0f}, Mean = {4:,.2f}" \
.format(title1, len(array1), np.sum(array1), np.median(array1), np.mean(array1))
In [19]:
'''
Plot 2 side-by-side factored histograms. I'm making them side-by-side
for visual clearness.
'''
def plot_histograms(title0, array0, title1, array1, bin_size=None, normed=False):
f, axes = plt.subplots(1, 2, figsize=(25, 9), sharex=True, sharey=True)
max_data = np.r_[array0, array1].max()
if bin_size == None:
bin_size = int(np.log10(max_data))*20
bins = np.linspace(0, max_data, bin_size)
if normed == True:
ylabel="Probability density"
else:
ylabel = "Frequency"
axes[0].set(xlabel="{0}, Max: {1:,.0f}, Bin size: {2}".format(title0, np.max(array0), bin_size), \
ylabel=ylabel, title="Histogram of {}".format(title0))
axes[0].hist(array0, bins=bins, normed=normed)
axes[1].set(xlabel="{0}, Max: {1:,.0f}, Bin size: {2}".format(title1, np.max(array1), bin_size), \
ylabel=ylabel, title="Histogram of {}".format(title1))
axes[1].hist(array1, bins=bins, normed=normed, color='green')
In [20]:
ent_hr = list(df["ENTRIESn_hourly"])
ext_hr = list(df["EXITSn_hourly"])
print_stats('Hourly entries', ent_hr, 'Hourly exits', ext_hr)
plot_histograms('Hourly entries', ent_hr, 'Hourly exits', ext_hr)
If taking "rain" into consideration, is ridership distribution still right-skewed? Here I'm plotting the histograms of rainy and non-rainy days. Again, both are right-skewed. According to the medians and means we can tell ridership was a bit higher when it rained. But what do the differences mean? Had rainfall really affected ridership? I'll test the statistical significance in section 1.
In [21]:
ent_hr_rain = list(df[df['rain'] == 1]['ENTRIESn_hourly'])
ent_hr_no_rain = list(df[df['rain'] == 0]['ENTRIESn_hourly'])
print_stats('Hourly entries - rain', ent_hr_rain, 'Hourly entries - no rain', ent_hr_no_rain)
plot_histograms('Hourly entries - rain', ent_hr_rain, 'Hourly entries - no rain', ent_hr_no_rain)
plot_histograms('Hourly entries - rain', ent_hr_rain, 'Hourly entries - no rain', ent_hr_no_rain, normed=True)
If including rain factor, since there were 10 rainy days in May 2011, the better comparison plot would be average ridership of every 4 hours. The chart below is showing the average entries of every 4 hours when it rained or not. It seems when it rained, more people chose to take subway. But it doesn't look like the pattern varied too much when it rained or not. Again I'll test the statistical significance in section 1 to determine.
In [22]:
'''
plot the bar chart of average entries of every 4-hour for 2 categorical factors
'''
def plot_4_hour_average_entries_factor(data, x, y, series, factor, kind):
g = sns.factorplot(x, y, series, data=data, kind=kind, size=6)
g.set_axis_labels('Hour', 'Mean of 4-hour entries')
titles = ['Average 4-hour entries by {}'.format(factor)]
for ax, title in zip(g.axes.flat, titles):
ax.set_title(title)
ax.set_ylim(bottom=0)
In [23]:
hour_rain_group = df[['hour', 'ENTRIESn_hourly', 'rain']]
hour_rain_group['Rained'] = np.where(hour_rain_group['rain'] == 1, '1 Yes', '2 No')
plot_4_hour_average_entries_factor(hour_rain_group, 'hour', 'ENTRIESn_hourly', 'Rained', 'Rain', 'bar')
Besides rain, another indicator could be the weather condition. And I'm takine a quick look of conds because otherwise I'd need to include multiple columns like 'precipi', 'pressurei', 'tempi', 'wspdi' all together. Plotting average ridership by weather condition and we can see the pattern varies more. This can be another candidate for prediction feature.
In [24]:
hour_conds_group = df[['hour', 'ENTRIESn_hourly', 'conds']]
hour_rain_group['Rained?'] = np.where(hour_rain_group['rain'] == 1, '1 Yes', '2 No')
plot_4_hour_average_entries_factor(hour_conds_group, 'hour', 'ENTRIESn_hourly', 'conds', 'Weather Condition', 'point')
Throughout May 2011, did ridership vary by date? From the chart below we can tell it followed a repeated pattern that seemed aligned with day of week. I displayed the date of every Sunday. Looks like NYC Subway had lower ridership on Fridays and Saturdays. One exception that Memorial Sunday 05-30-11 presented an ultra low average. We may consider to use holiday as a candidate of prediction feature. There were only 10 raining days in May 2011, so it wouldn't make too much sense to divide it into rain vs. no rain.
In [25]:
'''
plot the bar chart of average entries of every 4-hour for a categorical factor
'''
def plot_4_hour_average_entries_bar(x, y, factor, xticks=None, xticklabels=None):
f, ax = plt.subplots()
ax = sns.barplot(x=x, y=y)
ax.set_xlabel(factor)
ax.set_ylabel('Mean of 4-hour entries')
ax.set_title('Average 4-hour entries by {}'.format(factor))
if xticks <> None:
ax.set_xticks(xticks)
if xticklabels <> None:
ax.set_xticklabels(xticklabels, rotation=35, ha='right')
In [26]:
sundays_dates = df[df['day_week'] == 0]['DATEn'].unique()
sundays_count = len(sundays_dates)
first_sunday_index = (d.datetime.strptime(sundays_dates[0], '%m-%d-%y') - d.datetime(2011, 5, 1, 0, 0, 0)).days
sunday_indices = np.linspace(first_sunday_index, 7 * (sundays_count-1) + first_sunday_index, sundays_count, dtype=np.float64)
date_group = df.groupby(['DATEn'], as_index=False)['DATEn', 'ENTRIESn_hourly'].aggregate(np.mean)
plot_4_hour_average_entries_bar(date_group['DATEn'], date_group['ENTRIESn_hourly'], 'Date', xticks=sunday_indices, xticklabels=sundays_dates)
Let me group the entries by day of the week to double check the repeated pattern. From the plot we can say, yes NYC Subway had lower ridership on Fridays and Saturdays. We may consider to use day of the week as a candidate of prediction feature.
In [27]:
day_week_group = df.groupby(['day_week'])[['day_week', 'ENTRIESn_hourly']].aggregate(np.mean)
week_days = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']
plot_4_hour_average_entries_bar(day_week_group['day_week'], day_week_group['ENTRIESn_hourly'], 'Day of the Week', xticklabels=week_days)
From the histograms above we already know that in reality many units were idling. Let's check how busiest to least busy unit vary. Below is the plot of averages of 4-hour entries from the most used unit to the least used uni. Some units are much busier, so unit may be a candidate for prediction feature.
In [28]:
'''
Plot basic connected data points sorted from high to low
'''
def plot_4_hour_average_entries_desc(data, factor):
data = data.order(ascending=False)
data_len = len(data)
f, ax = plt.subplots()
plt.plot(data)
ax.set_xlabel('Total of {0} {1}s (index: 0 - {2})'.format(data_len, factor, data_len - 1))
ax.set_ylabel('Mean of 4-hour entries')
ax.set_title('Average 4-hour entries by {}, from high to low'.format(factor))
In [29]:
station_average = df.groupby(['UNIT'], as_index=False)['ENTRIESn_hourly'].aggregate(np.mean)['ENTRIESn_hourly']
plot_4_hour_average_entries_desc(station_average, 'Unit')
Double check how busiest to least busy station vary. Below is the plot of averages of 4-hour entries from the most used station to the least used station. Some stations are much busier, so station may also be a candidate for prediction feature.
In [30]:
station_average = df.groupby(['station'], as_index=False)['ENTRIESn_hourly'].aggregate(np.mean)['ENTRIESn_hourly']
plot_4_hour_average_entries_desc(station_average, 'Station')