In [1]:
%matplotlib inline
import numpy
import pandas
import matplotlib.pyplot as plt
import statsmodels.api as sm
import scipy
import scipy.stats
from sklearn.metrics import r2_score
from sklearn import datasets, linear_model
from ggplot import *
plt.rcParams['figure.figsize'] = (16.0, 8.0)
In [2]:
path = 'turnstile_data_master_with_weather.csv'
turnstile_weather = pandas.read_csv(path)
Given random draws x from the population of people that ride the subuway when it rains and y from the population of people that ride the subway when it does not rain, the standard two-tailed hypotheses are as follows:
$H0: P(x \gt y) = 0.5$
$H1: P(x \gt y) \neq 0.5$
The test used is Mann-Whitney U-statistic, and a two-tail P value is used.
The p-critical value is 0.05.
In [30]:
print ggplot(turnstile_weather, aes(x='ENTRIESn_hourly')) +\
geom_histogram(binwidth=1000,position="identity") +\
scale_x_continuous(breaks=range(0, 60001, 10000), labels = range(0, 60001, 10000))+\
facet_grid("rain")+\
ggtitle('Distribution of ENTRIESn_hourly in non-rainy days (0.0) and rainy days(1.0)')
In [53]:
### YOUR CODE HERE ###
df_with_rain = turnstile_weather[turnstile_weather['rain']==1]
df_without_rain = turnstile_weather[turnstile_weather['rain']==0]
with_rain_mean = df_with_rain['ENTRIESn_hourly'].mean()
without_rain_mean = df_without_rain['ENTRIESn_hourly'].mean()
U, p = scipy.stats.mannwhitneyu(df_with_rain['ENTRIESn_hourly'], df_without_rain['ENTRIESn_hourly'])
print "mean_with_rain=%f mean_without_rain=%f p-value=%.8f" %(with_rain_mean, without_rain_mean, p*2)
The p-value is below the significance value ($\alpha = 0.05$). Thus, the results obtained reject the null hipothesis with a significance level of 0.05. This means that the number of passengers in rainy days is different than the number observed in non-rainy days.
The following statistics support our test:
In [59]:
print "Descriptive statistics for the ridership in rainy days"
df_with_rain['ENTRIESn_hourly'].describe()
Out[59]:
In [60]:
print "Descriptive statistics for the ridership in non-rainy days"
df_without_rain['ENTRIESn_hourly'].describe()
Out[60]:
In [61]:
def linear_regression(features, values):
"""
Perform linear regression given a data set with an arbitrary number of features.
This can be the same code as in the lesson #3 exercise.
"""
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(features, values)
return regr.intercept_, regr.coef_
def predictions(dataframe):
'''
The NYC turnstile data is stored in a pandas dataframe called weather_turnstile.
Using the information stored in the dataframe, let's predict the ridership of
the NYC subway using linear regression with ordinary least squares.
You can download the complete turnstile weather dataframe here:
https://www.dropbox.com/s/meyki2wl9xfa7yk/turnstile_data_master_with_weather.csv
Your prediction should have a R^2 value of 0.40 or better.
You need to experiment using various input features contained in the dataframe.
We recommend that you don't use the EXITSn_hourly feature as an input to the
linear model because we cannot use it as a predictor: we cannot use exits
counts as a way to predict entry counts.
Note: Due to the memory and CPU limitation of our Amazon EC2 instance, we will
give you a random subet (~10%) of the data contained in
turnstile_data_master_with_weather.csv. You are encouraged to experiment with
this exercise on your own computer, locally. If you do, you may want to complete Exercise
8 using gradient descent, or limit your number of features to 10 or so, since ordinary
least squares can be very slow for a large number of features.
If you receive a "server has encountered an error" message, that means you are
hitting the 30-second limit that's placed on running your program. Try using a
smaller number of features.
'''
################################ MODIFY THIS SECTION #####################################
# Select features. You should modify this section to try different features! #
# We've selected rain, precipi, Hour, meantempi, and UNIT (as a dummy) to start you off. #
# See this page for more info about dummy variables: #
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html #
##########################################################################################
features = dataframe[['rain', 'precipi', 'Hour', 'meantempi', 'fog']]
dummy_units = pandas.get_dummies(dataframe['UNIT'], prefix='unit')
features = features.join(dummy_units)
# Values
values = dataframe['ENTRIESn_hourly']
# Perform linear regression
intercept, params = linear_regression(features, values)
predictions = intercept + numpy.dot(features, params)
return predictions, intercept, params
In [6]:
predicted, intercept, params = predictions(turnstile_weather)
values = turnstile_weather['ENTRIESn_hourly']
In [7]:
(turnstile_weather['ENTRIESn_hourly'] - predicted).hist(bins=20)
Out[7]:
In [8]:
print "R2 Score=%f"%r2_score(values, predicted)
I have used rain, precipi, Hour, meantempi and UNIT. UNIT was transformed into dummy variables.
In [ ]:
the selected features will contribute to the predictive power of your model. Your reasons might be based on intuition. For example, response for fog might be: “I decided to use fog because I thought that when it is very foggy outside people might decide to use the subway more often.” Your reasons might also be based on data exploration and experimentation, for example: “I used feature X because as soon as I included it in my model, it drastically improved my R2 value.”
We know that weather, namely precipitation, affects the $\mu_{passengers}$. Thus I have included rain, precipi, meantempi and fog. From the correlation analysis below we can also see that Hour is the most correlated valid feature. For this reason Hour was also included in the input features.
In [23]:
print "Correlation analysis"
turnstile_weather.corr()['ENTRIESn_hourly'].sort_values(inplace=False)
Out[23]:
In [32]:
# plt.rcParams['figure.figsize'] = (12.0, 3.0)
# dtypes = turnstile_weather.dtypes
# for column in turnstile_weather.columns:
# if dtypes[column] in ['int64', 'float64']:
# plt.figure()
# turnstile_weather[column].hist(bins=20)
# #turnstile_weather.plot(kind='kde', x=column)
# plt.title(column)
# plt.rcParams['figure.figsize'] = (16.0, 8.0)
In [10]:
features=['rain', 'precipi', 'Hour', 'meantempi', 'fog']
print "== Non-dummy features coefficients =="
for i in range(5):
output_str = ("%s:"%features[i]).ljust(12)
output_str += "%.3f"%(params[i])
print output_str
In [11]:
r_squared = 1 - ((values-predicted)**2).sum()/((values-values.mean())**2).sum()
assert(r_squared == r2_score(values, predicted))
print "R2 Score=%f"%r_squared
When the coefficient of determination, $R^2$, give us the correlation between the predictor features and the independent variable Entries per hour.
When $R^2$ is close to 1, it means that the model has very good fitness, while when it is close to 0, the model does not fit at all. We have an $R^2$ of 0.46 which means that 0.46 of the variance of data is explained in the regression model. In addition, we should be evaluating our model with data that was not used to train the model. Even if we get a good score, our model might be overfiting.
If we look at our coefficients we can see that rain and meantempi have a negative impact in Entries per hour, while precipi, Hour and Fog have a positive impact. This means that 0.46 of the variance of the data is explained with a negative impact of rain.
Please include two visualizations that show the relationships between two or more variables in the NYC subway data. Remember to add appropriate titles and axes labels to your plots. Also, please add a short description below each figure commenting on the key insights depicted in the figure.
3.1 One visualization should contain two histograms: one of ENTRIESn_hourly for rainy days and one of ENTRIESn_hourly for non-rainy days.
You can combine the two histograms in a single plot or you can use two separate plots.
If you decide to use to two separate plots for the two histograms, please ensure that the x-axis limits for both of the plots are identical. It is much easier to compare the two in that case.
For the histograms, you should have intervals representing the volume of ridership (value of ENTRIESn_hourly) on the x-axis and the frequency of occurrence on the y-axis. For example, each interval (along the x-axis), the height of the bar for this interval will represent the number of records (rows in our data) that have ENTRIESn_hourly that falls in this interval.
Remember to increase the number of bins in the histogram (by having larger number of bars). The default bin width is not sufficient to capture the variability in the two samples.
R:
The following visualization has 2 histograms combined in a single plot. The histogram in red shows the ridership per hour distribution for non-rainy days, while the histogram in blue shows for rainy days. We can see that non-rainy have bigger bars for ENTRIESn_hourly below 10000. This doesn't mean rainy days have less passengers. I just means that we have less data for rainy days, which is natural since we have less rainy days.
In [42]:
print ggplot(aes(x='ENTRIESn_hourly', fill='rain'), data=turnstile_weather) +\
geom_histogram(binwidth=1000) +\
ggtitle('Ridership per hour distribution for rainy and non-rainy days') +\
ylab('Number of tuples')
print "ENTRIESn_hourly max value: %d"%turnstile_weather['ENTRIESn_hourly'].max()
Although the maximum value of ENTRIESn_hourly is above 50000, from the histogram we see that most values are below 10000. Thus, let's generate a histogram limited to 10000 entries.
In [43]:
print ggplot(aes(x='ENTRIESn_hourly', fill='rain'), data=turnstile_weather) +\
geom_histogram(binwidth=100) +\
xlim(0, 10000)+\
ggtitle('Ridership per hour distribution for rainy and non-rainy days limited to 10000') +\
ylab('Number of tuples')
In [51]:
# print ggplot(aes(x='ENTRIESn_hourly', color='rain'), data=turnstile_weather) +\
# geom_density() +\
# ggtitle('Ridership per hour distribution for rainy and non-rainy days limited to 10000') +\
# ylab('Number of tuples')
3.2 One visualization can be more freeform. You should feel free to implement something that we discussed in class (e.g., scatter plots, line plots) or attempt to implement something more advanced if you'd like. Some suggestions are:
R:
The following plot shows the average number of passengers per hour in our dataset. We can see that in average 8pm, 12pm, and 4pm are the times of day with most passengers.
In [19]:
print ggplot(turnstile_weather, aes(x='Hour', y='ENTRIESn_hourly'))+geom_bar(stat = "summary", fun_y=numpy.mean, fill='lightblue')+ggtitle('Average ridership by time-of-day')
Please address the following questions in detail. Your answers should be 1-2 paragraphs long.
The number of people that ride NYC in raining or non-raining days is different, but the analysis made shows that is not clear which days have more ridership.
The Mann-Whitney U-statistic was able to reject the null hypothesis with a significance level of 0.05. When we look at the distributions, we see that the maximum value of ridership per hour is much higher on rainy days (51839 against 43199). The histograms are not able to produce a good visualization to compare distributions since, there are more tuples for non-rainy days. Perhaps, some normalization will help for further analysis.
Nevertheless, when we look at our linear regression model with $R^2=0.46$, the coefficient for rain has a negative value (-39.307), which means that the number of ridership is inversely proportional with the existence of rain. This might happen due to existent correlation or causality between rain and other features. E.g., rain might have some correlation with fog which might also affect ridership.
Please address the following questions in detail. Your answers should be 1-2 paragraphs long.
Regarding the linear regression, this method is not robust against correlated features. The use of correlated features might be reducing the quality of our model and conclusions.
Although our test rejected the null hypothesis, we can't assume that there is a causality between rain and ridership. There is the possibility of having another condition that affects both features.
In [77]:
print pandas.to_datetime(turnstile_weather['DATEn']).describe()
Regarding our dataset, we can see from the descriptive statistics above that data was collected from May 1st until May 30th. In order to make a conclusion of ridership, perhaps it would be more reliable to use data from all seasons.
In [ ]: