This project look at the NYC Subway data and figure out if more people ride the subway when it is raining versus when it is not raining.
The data can be found at:
In [1]:
%matplotlib inline
import numpy as np
import pandas
import matplotlib.pyplot as plt
try:
import seaborn as sns
sns.set_palette("deep", desat=.6)
sns.set_context(rc={"figure.figsize": (10, 6)})
except ImportError:
sns = None
In [2]:
df = pandas.read_csv('turnstile_data_master_with_weather.csv')
I split up the data into two category, with rain and without rain. I then summarize and plot the distribution for this two samples.
In [3]:
gb = df.groupby('rain')
gb.get_group(1).ENTRIESn_hourly
Out[3]:
In [4]:
with_rain_df = df['ENTRIESn_hourly'][df['rain'] == 1]
without_rain_df = df['ENTRIESn_hourly'][df['rain'] == 0]
# Let's take a look at the summary of these two samples
summary = pandas.concat([with_rain_df.describe(), without_rain_df.describe()], axis=1)
summary.columns = ['With rain summary', 'Without rain summary']
print summary
In [5]:
plot_description = ('This is the histogram of the ENTRIESn_hourly when it is raining vs when it is not raining.\n'
'In both instances, they are not normally distributed and are positively skewed.')
figure, plot_list = plt.subplots(1, 2, sharey=True, figsize=(16, 8))
figure.subplots_adjust(wspace=0.08)
figure.text(0.08, 0.5, 'Frequency', va='center', rotation='vertical')
figure.text(0.5, 0.95, 'Histogram of ENTRIESn_hourly', ha='center')
figure.text(0.5, 0.01, plot_description, ha='center')
df_list = [(with_rain_df, 'Rain'), (without_rain_df, 'No rain')]
for idx, plot in enumerate(plot_list):
hist, bins = np.histogram(df_list[idx][0], bins=60)
width = 0.8 * (bins[1] - bins[0])
center = (bins[:-1] + bins[1:]) / 2
plot.bar(center, hist, align='center', width=width)
plot.set_title(df_list[idx][1])
plot.set_xlabel('ENTRIESn_hourly')
I perform Mann-Whitney U test on the two samples. We will be performing a two-tail P value with .05 p-critical value, with the following hypothesis:
The null hypothesis is:
The distributions of both groups are identical, so that there is a 50% probability that an observation from a value randomly selected from one population exceeds an observation randomly selected from the other population.
P(x > y) = 0.5
Alternative hypothesis:
The distributions of both groups are not identical, the probability that an observation from a value randomly selected from one population exceeds an observation randomly selected from the other population is not 50%.
P(x > y) != 0.5
I am using this test because the samples is non-normal. Generally, we use Mann-Whitney U test, if it follow several assumptions:
Below the computed result:
In [6]:
import scipy.stats
U, p = scipy.stats.mannwhitneyu(with_rain_df, without_rain_df)
print "U :", U
print "two-tail p :", p * 2 # the module returns one-tail value, we need to multiply by 2 to get two-tail value
We can see that the p < .05.
Now let's build a model so we can make predictions.
To compute the coefficients theta and produce prediction for ENTRIESn_hourly in the regression model I choose to use OLS using Statsmodels.
I added 3 helper function here:
In [7]:
import statsmodels.api as sm
def linear_regression(features, values):
"""
Perform linear regression given a data set with an arbitrary number of features.
"""
features = sm.add_constant(features)
model = sm.OLS(values, features)
results = model.fit()
intercept, params = results.params[0], results.params[1:]
return intercept, params
def compute_r_squared(data, predictions):
'''
Compute r_squared given a data set and the predictions.
'''
data_mean = np.mean(data)
ss_res = ((data - predictions)**2).sum()
ss_tot = ((data - data_mean)**2).sum()
r_squared = 1 - (ss_res/ss_tot)
return r_squared
def calculate_predictions(dataframe, features):
'''
Generate prediction given dataframe and features.
'''
# Add UNIT to features using dummy variables
dummy_units = pandas.get_dummies(dataframe['UNIT'], prefix='unit')
features = features.join(dummy_units)
# Values
values = dataframe['ENTRIESn_hourly']
# Get the numpy arrays
#features_array = features.values
#values_array = values.values
# Perform linear regression
intercept, params = linear_regression(features, values)
predictions = intercept + np.dot(features, params)
return predictions, intercept, params
I will try to compute the prediction for several possible features. Please check the code below for the features. I aim to have R^2 score of 0.40 or better.
The features of interest are listed below along with the reason:
I will create combinations out of these feature, compute the predictions and get the R^2 value.
In addition to this, I have added dummy variables 'UNIT' for the features.
In [8]:
from itertools import combinations
# Build our features list combination
# feature_of_interest = ['rain', 'meantempi', 'fog', 'meanwindspdi', 'precipi', 'Hour']
# features_list = []
# for count in range(1, len(feature_of_interest) + 1):
# for item in combinations(feature_of_interest, count):
# features_list.append(list(item))
features_list = [['rain', 'meantempi', 'fog', 'meanwindspdi', 'precipi', 'Hour']]
best_score = {'score': 0}
for features in features_list:
predictions, intercept, params = calculate_predictions(df, df[features])
r2_score = compute_r_squared(df['ENTRIESn_hourly'], predictions)
# Save the best score
if r2_score > best_score['score']:
best_score['score'] = r2_score
best_score['feature_variables'] = features
best_score['intercept'] = intercept
best_score['param'] = params
best_score['predictions_values'] = predictions
print "\n\n"
print "The highest score :", best_score['score']
print "Feature variables for the highest score:", ', '.join(best_score['feature_variables'])
print "Coefficient for the highest score"
print " Intercept:", best_score['intercept']
print " Non-dummy Parameters:", best_score['param'][:len(best_score['feature_variables'])]
In [9]:
figure = plt.figure()
plt.title('Histogram of residual value')
plt.xlabel('Residual value')
plt.ylabel('Relative frequency')
(df['ENTRIESn_hourly'] - best_score['predictions_values']).hist(bins=100)
plot_description = 'Residual value histogram of the model. The plot shows long tails.'
figure.text(0.5, 0, plot_description, ha='center')
Out[9]:
The histogram has long tails, which suggests that there are some very large residuals. To make sure I plot normal probablity plot.
In [10]:
import scipy.stats
figure = plt.figure()
prob_plot = scipy.stats.probplot(df['ENTRIESn_hourly'] - best_score['predictions_values'], plot=plt)
plot_description = ('Normal probability plot of the residual values.\n'
'The curve which starts below the normal line,\n'
'bends to follow it, and ends above it indicates long tails.')
figure.text(0.5, -0.02, plot_description, ha='center')
Out[10]:
The plot confirms that we have long tails problem which means we are seeing more variance than we would expect in a normal distribution.
In fact if we simply plot the residual values we got:
In [11]:
figure = plt.figure()
plt.title('Residual value plot')
plt.xlabel('Number of data')
plt.ylabel('Residual value (data - prediction)')
plt.plot(df['ENTRIESn_hourly'] - best_score['predictions_values'])
plot_description = 'The plot shows that the residuals follow a cyclical pattern.'
figure.text(0.5, 0, plot_description, ha='center')
Out[11]:
It shows that there are some non-linearity in the data. And this should be addressed by designing a non linear model.
In conclusion, we achieve a R^2 value that we set (> 0.40), but on further inspection we find out that the linear model is not appropriate to predict ridership.
Let's use some visualization to ask get some more answer from the data.
First, let's see ridership by day of week.
In [12]:
import datetime
from ggplot import *
df_with_weekday = df.copy()
df_with_weekday['weekday'] = df_with_weekday['DATEn'].apply(lambda x: datetime.datetime.strptime(x, '%Y-%m-%d').weekday())
days = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
df_with_weekday = df_with_weekday[['weekday', 'ENTRIESn_hourly', 'EXITSn_hourly']].groupby('weekday').sum().reset_index()
plot = ggplot(aes(x='weekday'), data=df_with_weekday) + \
geom_bar(aes(x='weekday', y='ENTRIESn_hourly'), stat='bar') + \
xlab('Day of the week') + \
ylab('Total entries') + \
scale_x_continuous(breaks=range(7), labels=days) + \
ggtitle('Ridership by day of the week')
print plot
Now, let's try to see the top 20 station with the highest total entry.
In [13]:
unit_entries_df = df[['UNIT', 'ENTRIESn_hourly']].groupby(['UNIT']).sum().sort(['ENTRIESn_hourly'], ascending =[0]).reset_index()[:20]
plot= ggplot(aes(x='UNIT', y='ENTRIESn_hourly'), data=unit_entries_df) + \
geom_bar(aes(x='UNIT', weight='ENTRIESn_hourly'), stat='bar') + \
xlab('Station') + \
ylab('Total entries') + \
ggtitle('Top 20 ridership per station')
print plot
Another visualization. This time using matplotlib. Showing total ridership by week day and hour.
In [14]:
df_with_weekday = df.copy()
df_with_weekday['weekday'] = df_with_weekday['DATEn'].apply(lambda x: datetime.datetime.strptime(x, '%Y-%m-%d').weekday())
weekday = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
figure, plot_list = plt.subplots(1, 7, sharey=True, figsize=(16, 8))
figure.text(0.04, 0.5, 'Total entries', va='center', rotation='vertical')
figure.text(0.5, 0.95, 'Ridership throughout the day.', ha='center')
figure.subplots_adjust(wspace=0.08)
for day, item in enumerate(plot_list):
total_per_hour = []
for i in range(23):
total_per_hour.append(df_with_weekday['ENTRIESn_hourly'][(df_with_weekday['weekday'] == day) & (df_with_weekday['Hour'] == i)].sum())
item.plot(range(23), total_per_hour)
item.set_xlim(0, 23)
item.set_xlabel('Time of day')
item.set_title(weekday[day])
plot_description = ('This plot shows ridership throughout the day. We see more ridership on weekdays vs on weekends.\n'
'The peak ridership are around 11AM and 8PM on weekdays and around 3PM and 8PM on weekends.')
figure.text(0.5, 0.01, plot_description, ha='center')
Out[14]:
Looks like there is more entry on weekdays versus on weekends. And it is pretty similar routine on weekdays. A peak at around 10AM and another one around 8PM.
Our statistical analysis shows that there is significant difference in ridership when it is raining vs when it is not raining.
We develop a model to predict the ridership. We used linear model (OLS) and achieve R^2 value that we set as the goal. However after analyzing the residual, we find that the model is not appropriate to predict the ridership.
The dataset although big (131951 records) only covers 30 days of data. The data was collected from 2011-05-01 to 2011-05-30 This is probably not enough. It will be interesting to see data from multiple months. If for instance on April it rains a lot, can we see that the ridership on April is significantly different then the ridership on May.
We used linear regression method to make our models. But the resulting model does not seems to be appropriate to predict the ridership of NYC subway. We need to address the non-linearity in the data.
Initially I thought maybe if I remove the Hour from the model features list (looking at figure 3.2 I conclude that this is what causes the cycle) the model will work, but I got the same result.