These answers to the assignment questions have been prepared in a Jupyter (formally IPython) notebook. This was chosen to allow clarity of working, enable reproducability, and as it should be suitable and useful for the target audience, and can be converted to html easily. In general, the code necessary for each question is included below each question, although some blocks of necessary code fall inbetween questions.
In [1]:
# Imports
# Numeric Packages
from __future__ import division
import numpy as np
import pandas as pd
import scipy.stats as sps
# Plotting packages
import matplotlib.pyplot as plt
from matplotlib import ticker
import seaborn as sns
%matplotlib inline
sns.set_style('whitegrid')
sns.set_context('talk')
# Other
from datetime import datetime, timedelta
import statsmodels.api as sm
In [2]:
# Import turnstile data and convert datetime column to datetime python objects
df = pd.read_csv('turnstile_weather_v2.csv')
df['datetime'] = pd.to_datetime(df['datetime'])
The objective of this project, as described in the project details, is to figure out if more people ride the subway when it is raining versus when it is not raining.
To evaluate this question through statistical testing, a hypothesis test is used. To perform such a test two opposing hypotheses are constructed: the null hypothesis and the alternative hypothesis. A hypothesis test considers one sample of data to determine if there is sufficient evidence to reject the null hypothesis for the entire population from which it came; that the difference in the two underlying populations are different with statistical significance. The test is performed to a 'significance level' which determines the probability of Type 1 error occuring, where Type 1 error is the incorrect rejection of the null hypothesis; a false positive.
The null hypothesis is constructed to represent the status quo, where the treatment on a population has no effect on the population, chosen this way because the test controls only for Type 1 error. In the context of this assignment, the null hypothesis for this test is on average, no more people ride the subway compared to when it is not; i.e. 'ridership' is the population and 'rain' is the treatment.
i.e. $H_0: \alpha_{raining} \leq \alpha_{not\_raining}$
where $\alpha$ represents the average ridership of the subway.
Consequently, the alternative hypothesis is given by:
$H_1: \alpha_{raining} > \alpha_{not\_raining}$.
Due to the way the hypothesis is framed, that we are only questioning whether ridership increases during rain, a single-tailed test is required. This is because we are only looking for a test statistic that shows an increase in ridership in order to reject the null hypothesis.
A significance value of 0.05 has been chosen to reject the null hypothesis for this test, due to it being the most commonly used value for testing.
The Mann-Whitney U test was chosen for the hypothesis testing due as it is agnostic to the underlying distribution. The entry values are definitely not normally distributed, illustrated below both graphically and using the Shapiro-Wilk test.
In [3]:
W, p = sps.shapiro(df.ENTRIESn_hourly.tolist())
print 'Probability that data is the realisation of a gaussian random variable: {:.3f}'.format(p)
plt.figure(figsize=[8,5])
sns.distplot(df.ENTRIESn_hourly.tolist(), bins=np.arange(0,10001,500), kde=False)
plt.xlim(0,10000)
plt.yticks(np.arange(0,16001,4000))
plt.title('Histogram of Entry Count')
plt.show()
In [4]:
raindata = np.array(df[df.rain==1].ENTRIESn_hourly.tolist())
noraindata = np.array(df[df.rain==0].ENTRIESn_hourly.tolist())
U, p = sps.mannwhitneyu(raindata, noraindata)
print 'Results'
print '-------'
print 'p-value: {:.2f}'.format(p) # Note that p value calculated by scipy is single-tailed
print 'Mean with rain: {:.0f}'.format(raindata.mean())
print 'Mean without rain: {:.0f}'.format(noraindata.mean())
Given the p-value < 0.05, we can reject the null hypothesis that the average ridership is not greater when it is raining, hence the we can accept the alternative hypothesis the average ridership is greater when it rains.
In [5]:
# Because the hour '0' is actually the entries from 20:00 to 24:00, it makes more sense to label it 24 when plotting data
df.datetime -= timedelta(seconds=1)
df['day']= df.datetime.apply(lambda x: x.day)
df['hour'] = df.datetime.apply(lambda x: x.hour+1)
df['weekday'] = df.datetime.apply(lambda x: not bool(x.weekday()//5))
df['day_week'] = df.datetime.apply(lambda x: x.weekday())
# The dataset includes the Memorial Day Public Holiday, which makes more sense to be classify as a weekend.
df.loc[df['day']==30,'weekday'] = False
The final fit used in the model includes multiple components, two of which include the custom input stall_num2, described later:
ENTRIESn_hourly ~ 'ENTRIESn_hourly ~ rain:C(hour) + stall_num2*C(hour) + stall_num2*weekday'
stall_num2
- includes the effect off the stall (unit) number;C(hour)
- (dummy variable) included using dummy variables, since the the entries across hour vary in a highly nonlinear way;weekday
- true/false value for whether it is a weekday;rain:C(hour)
- rain is included as the focus of the study, however it has been combined with the time of day;stall_num2 * C(hour)
- (dummy variable) interaction between the stall number and time of day; andstall_num2 * weekday
- interaction between the stall number and whether it is a weekday.Additionally, an intercept was included in the model, statsmodels appears to automatically create N-1 dummies when this is included.
The variable stall_num2
was created as a substitute to using the UNIT
column as a dummy variable. It was clear early on that using UNIT
has a large impact on the model accuracy, intuitive given the relative popularity of stalls will be important for predicting their entry count. However, with 240 stalls, a lot of dummy variables are created, and it makes interactions between UNIT
and other variables impractical. Additionally, so many dummy variables throws away information relating to the similar response between units of similar popularity.
stall_num2
was constructed by calculating the number of entries that passed through each stall as a proportion of total entries for the entire period of the data. These results were then normalised to have μ=0 and σ=1 (although they're not normally distributed) to make the solution matrix well behaved; keep the condition number within normal bounds.
In [6]:
# Create a new column, stall_num2, representing the proportion of entries through a stall across the entire period.
total_patrons = df.ENTRIESn_hourly.sum()
# Dataframe with the units, and total passing through each unit across the time period
total_by_stall = pd.DataFrame(df.groupby('UNIT').ENTRIESn_hourly.sum())
# Create new variable = proportion of total entries
total_by_stall['stall_num2'] = total_by_stall.ENTRIESn_hourly/total_patrons
# Normalise by mean and standard deviation... fixes orders of magnitude errors in the output
total_stall_mean = total_by_stall.stall_num2.mean()
total_stall_stddev = total_by_stall.stall_num2.std()
total_by_stall.stall_num2 = (
(total_by_stall.stall_num2 - total_stall_mean)
/ total_stall_stddev
)
# Map the new variable back on the original dataframe
df['stall_num2'] = df.UNIT.apply(lambda x: total_by_stall.stall_num2[x])
The first step was to qualitatively assess which parameters may be useful for the model. This begins with looking at a list of the data, and the type of data, which has been captured, illustrated as follows.
In [7]:
for i in df.columns.tolist(): print i,
Some parameters are going to be clearly important:
Additionally, rain needed to be included as a feature due to it being the focus of the overall investigation.
Beyond these parameters, I selected a set of numeric features which may have an impact on the result, and initially computed and plotted the correlations between featires in an effort to screen out some multicollinearity prior linear regression. The results of this correlation matrix indicated a moderately strong correlations between:
There are also a handful of weaker environmental correlations, such as precipitation and fog, rain and precipitation and rain and temperature.
In [8]:
plt.figure(figsize=[8,6])
corr = df[['ENTRIESn_hourly',
'EXITSn_hourly',
'day_week', # Day of the week (0-6)
'weekday', # Whether it is a weekday or not
'day', # Day of the month
'hour', # In set [4, 8, 12, 16, 20, 24]
'fog',
'precipi',
'rain',
'tempi',
'wspdi']].corr()
sns.heatmap(corr)
plt.title('Correlation matrix between potential features')
plt.show()
The final selection of variables was determined through trial and error of rational combinations of variables. The station popularity was captured in using the stall_num2
variable, since it appears to create a superior model compared with just using UNIT dummies, and because it allowed the creation of combinations. Combining the station with hour was useful, and is intuitive since stations in the CBD will have the greatest patronage and have greater entries in the evening peak hour. A similar logic applies to combining the station and whether it is a weekday.
Various combinations of environmental variables were trialled in the model, but none appeared to improve the model accuracy and were subsequently dicarded. Since rain is the focus of this study it was retained, however it was combined with the time of day. The predictive strenght of the model was not really improved with the inclusion of a rain parameter, however combining it with hour
appears to improve it's usefulness for providing insight, as will be discussed in section 4.
In [9]:
# Construct and fit the model
mod = sm.OLS.from_formula('ENTRIESn_hourly ~ rain:C(hour) + stall_num2*C(hour) + stall_num2*weekday', data=df)
res = mod.fit_regularized()
s = res.summary2()
Due to the use of several combinations, there are very few non-dummy features, with the coefficients illustrated below. Since stall_num2
is also used in several combinations, it's individual coefficient doesn't prove very useful.
In [10]:
s.tables[1].ix[['Intercept', 'stall_num2']]
Out[10]:
However when looking at all the combinations for stall_num2 provides greater insight. Here we can see that activity is greater on weekdays, and greatest in the 16:00-20:00hrs block. It is lowest in the 00:00-04:00hrs block, not shown as it was removed by the model due to the generic stall_num2
parameter being there; the other combinations are effectively referenced to the 00:00-04:00hrs block.
In [11]:
s.tables[1].ix[[i for i in s.tables[1].index if i[:5]=='stall']]
Out[11]:
Even more interesting are the coefficient for the rain combinations. These appear to indicate that patronage increases in the 08:00-12:00 and 16:00-20:00, corresponding to peak hour. Conversely, subway entries are lower at all other times. Could it be that subway usage increases if it is raining when people are travelling to and from work, but decreases otherwise because people prefer not to travel in the rain at all?
In [12]:
s.tables[1].ix[[i for i in s.tables[1].index if i[:4]=='rain']]
Out[12]:
In [13]:
print 'Model Coefficient of Determination (R-squared): {:.3f}'.format(res.rsquared)
The final R-squared value of 0.74 is much greater than earlier models that used UNIT
as a dummy variable, which had R-squared values around 0.55.
To evaluate the goodness of fit the residuals of the model have been evaluated in two ways. First, a histogram of the residuals has been plotted below. The distribution of residuals is encouragingly symmetric. However efforts to fit a normal distribution found distributions which underestimated the frequency at the mode and tails. Fitting a fat-tailed distribution, such as the Cauchy distribution below, was far more successful. I'm not sure if there's a good reason why it's worked out this way (but would love to hear ideas as to why).
In [14]:
residuals = res.resid
sns.set_style('whitegrid')
sns.distplot(residuals,bins=np.arange(-10000,10001,200),
kde = False, # kde_kws={'kernel':'gau', 'gridsize':4000, 'bw':100},
fit=sps.cauchy, fit_kws={'gridsize':4000})
plt.xlim(-5000,5000)
plt.title('Distribution of Residuals\nwith fitted cauchy Distribution overlaid')
plt.show()
Secondly, a scatterplot of the residuals against the expected values is plotted. As expected, the largest residuals are associated with cases where the traffic is largest. In general the model appears to underpredict the traffic at the busiest of units. Also clear on this plot is how individual stations form a 'streak' of points on the diagonal. This is because the model essentially makes a prediction for each station per hour per for weekdays and weekends. The natural variation of the actual result in this timeframe creates the run of points.
In [15]:
sns.set_style('whitegrid')
fig = plt.figure(figsize=[6,6])
plt.xlabel('ENTRIESn_hourly')
plt.ylabel('Residuals')
plt.scatter(df.ENTRIESn_hourly, residuals,
c=(df.stall_num2*total_stall_stddev+total_stall_mean)*100, # denormalise values
cmap='YlGnBu')
plt.colorbar(label='UNIT Relative Traffic (%)')
plt.plot([0,20000],[0,-20000], ls=':', c='0.7', lw=2) # Line to show negative prediction values (i.e. negative entries)
plt.xlim(xmin=0)
plt.ylim(-20000,25000)
plt.xticks(rotation='45')
plt.title('Model Residuals vs. Expected Value')
plt.show()
Additionally, note that the condition number for the final model is relatively low, hence there don't appear to be any collinearity issues with this model. By comparison, when UNIT was included as a dummy variable instead, the correlation was weaker and the condition number was up around 220.
In [16]:
print 'Condition Number: {:.2f}'.format(res.condition_number)
In summary, it appears that this linear model has done a reasonable job of predicting ridership in this instance. Clearly some improvements are possible (like fixing the predictions of negative entries!), but given there will always be a degree of random variation, an R-squared value of 0.74 for a linear model seems quite reasonable. To be sure of the model suitability the data should be split into training/test sets. Additionally, more data from extra months could prove beneficial.
In [17]:
sns.set_style('white')
sns.set_context('talk')
mydf = df.copy()
mydf['rain'] = mydf.rain.apply(lambda x: 'Raining' if x else 'Not Raining')
raindata = df[df.rain==1].ENTRIESn_hourly.tolist()
noraindata = df[df.rain==0].ENTRIESn_hourly.tolist()
fig = plt.figure(figsize=[9,6])
ax = fig.add_subplot(111)
plt.hist([raindata,noraindata],
normed=True,
bins=np.arange(0,11500,1000),
color=['dodgerblue', 'indianred'],
label=['Raining', 'Not Raining'],
align='right')
plt.legend()
sns.despine(left=True, bottom=True)
# http://stackoverflow.com/questions/9767241/setting-a-relative-frequency-in-a-matplotlib-histogram
def adjust_y_axis(x, pos):
return '{:.0%}'.format(x * 1000)
ax.yaxis.set_major_formatter(ticker.FuncFormatter(adjust_y_axis))
plt.title('Histogram of Subway Entries per 4 hour Block per Gate')
plt.ylabel('Proportion of Total Entries')
plt.xlim(500,10500)
plt.xticks(np.arange(1000,10001,1000))
plt.show()
Once both plots are normalised, the difference between subway entries when raining and not raining are almost identical. No useful differentiation can be made between the two datasets here.
In [18]:
# Plot to illustrate the average riders per time block for each weekday.
# First we need to sum up the entries per hour (category) per weekday across all units.
# This is done for every day, whilst retaining the 'day_week' field for convenience. reset_index puts it back into a standard dataframe
# For the sake of illustration, memorial day has been excluded since it would incorrectly characterise the Monday ridership
mydf = df.copy()
mydf = mydf[mydf.day!=30].pivot_table(values='ENTRIESn_hourly', index=['day','day_week','hour'], aggfunc=np.sum).reset_index()
# The second pivot takes the daily summed data, and finds the mean for each weekday/hour block.
mydf = mydf.pivot_table(values='ENTRIESn_hourly', index='hour', columns='day_week', aggfunc=np.mean)
# Generate plout using the seaborn heatplot function.
fig = plt.figure(figsize=[9,6])
timelabels = ['Midnight - 4am','4am - 8am','8am - 12pm','12pm - 4pm','4pm - 8pm','8pm - Midnight']
weekdays = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
plot = sns.heatmap(mydf, yticklabels=timelabels, xticklabels=weekdays)
plt.xlabel('') # The axis ticks are descriptive enough to negate the need for axis labels
plt.ylabel('')
plot.tick_params(labelsize=14) # Make stuff bigger!
# Make heatmap ticks bigger http://stackoverflow.com/questions/27832054/change-tick-size-on-colorbar-of-seaborn-heatmap
cax = plt.gcf().axes[-1]
cax.tick_params(labelsize=14)
plt.title('Daily NYC Subway Ridership\n(Data from May 2011)', fontsize=20)
plt.show()
This plot illustrates the variation in ridership of the subway across the week. Ridership is very small in the very early morning, and there are two bold stripes where peak hour occurs in at 8am-12pm and 4pm-8pm. The weekend is clearly far less busy, in fact the 4am-8am block is the quietest of the day!
The statistical test performed in section 1 indicates that on average, more people ride the subway when it is raining. However the relationship between rain and ridership are possibly more complex, the results described in section 2.4 indicate that the impact of rain on ridership may depend on time of day, with more people using the subway when it rains during peak hour, and fewer otherwise.
The statistical tests do indicate that on average ridership is greater when it rains. However, the difference in mean is small. When linear regression was used, the effect of rain can be considered with other variables controlled for. When rain was considered without considering the interaction with time of day, there was no statistically significant result for the effect of rain. Controlling for time of day indicates the more detailed result described previously. Although the p-values for each of the coefficients described previously are small, indicating statistical significance, the effect of rain is a second-order effect on patronage. In fact, the addition of rain, with time of day interaction, didn't improve the accuracy of the model by an appreciable amount; if it wasn't the focus of this study the variable would have been dropped.
The dataset was too short in duration, and therefor had too few rainy days, to draw detailed conclusions. Given this is publicly available data, and once the effort to wrangle the data had been made, it would probably be sensible to run the analyses on far larger timespans. Of course, running the analysis over many months would require adding variables to control for seasons/time of year.
The linear regression model worked better than expected, finishing with an r-squared value of 0.74. However it has curious features, such as predicting negative entries for some cases, which is clearly impossible. I imagine there are superior approaches to modelling the system, however this model can't be too far off the ultimate achievable accuracy, the natural daily variation can't be captured in any (reasonable model). For instance, when the Yankees play the nearest station will have a great increase in patronage, but that won't be captured in any model which doesn't include the playing schedule of the team, which applies to all other large events around NYC.
One aspect of using a correctly constructed linear model is the simplicity of understanding the relative effect of each parameter, since each one is described by a single coefficient. More complex models may not provide the same level of simple insight that comparing coefficients can provide.
This is more to do with the linear model than the dataset: I tried eliminating the negative predictions by fitting the model square-root of the entries, and then squaring the output predictions. This was successful in eliminating the negative predictions, and only reduced the predictive capactiy from an R-squared value of 0.74 to approx 0.71. Although this is one approach to eliminating the negative values (which didn't sit well with me, and which I wouldn't want to include in any published model as it would be ridiculed by the public), I'm curious to know if there are any better approaches to keeping the predictions positive.