Note: Sample sizes differ. "Rain" has many fewer samples. These histograms are primarily to compare the distributions of each.
Also the x-axis in this example image has been truncated at 6,000 cutting off outliers in the long tail which extends beyond 50,000.
Before we perform any analysis, it might be useful to take a look at the data we're hoping to analyze. More specifically, let's
NYC subway data
and turnstile_weather
under the ['ENTRIESn_hourly'] column.Let's plot two histograms on the same axes to show hourly entries when raining vs. when not raining.
Here's an example on how to plot histograms with pandas and matplotlib:
turnstile_weather['column_to_graph'].hist()
Your histogram may look similar to bar graph in the instructor notes below.
You can read a bit about using matplotlib and pandas to plot histograms here: http://pandas.pydata.org/pandas-docs/stable/visualization.html#histograms
You can see the information contained within the turnstile weather data here: https://s3.amazonaws.com/content.udacity-data.com/courses/ud359/turnstile_data_master_with_weather.csv
In [ ]:
import numpy as np
import pandas
import matplotlib.pyplot as plt
def entries_histogram(turnstile_weather):
'''
Task description is above.
'''
plt.figure()
turnstile_weather['ENTRIESn_hourly'].loc[turnstile_weather['rain'] == 1].hist() # your code here to plot a historgram for hourly entries when it is raining
turnstile_weather['ENTRIESn_hourly'].loc[turnstile_weather['rain'] == 0].hist() # your code here to plot a historgram for hourly entries when it is not raining
plt.title('Histogram of ENTRIESn_hourly')
plt.legend(['No Rain', 'Rain'])
return plt
This function will consume the turnstile_weather
dataframe containing
our final turnstile weather data.
You will want to take the means and run the Mann Whitney U-test on the
ENTRIESn_hourly
column in the turnstile_weather dataframe.
This function should return: 1) the mean of entries with rain 2) the mean of entries without rain 3) the Mann-Whitney U-statistic and p-value comparing the number of entries with rain and the number of entries without rain
You should feel free to use scipy's Mann-Whitney implementation, and you might also find it useful to use numpy's mean function.
Here are the functions' documentation: http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html http://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html
You can look at the final turnstile weather data at the link below: https://s3.amazonaws.com/content.udacity-data.com/courses/ud359/turnstile_data_master_with_weather.csv
In [ ]:
import numpy as np
import scipy
import scipy.stats
import pandas
def mann_whitney_plus_means(turnstile_weather):
'''
Task description is above.
'''
### YOUR CODE HERE ###
# Extract rain data through hourly in turnstile_data
rain_data = turnstile_weather['ENTRIESn_hourly'].loc[turnstile_weather['rain'] == 1]
non_rain_data = turnstile_weather['ENTRIESn_hourly'].loc[turnstile_weather['rain'] == 0]
# Compute mean of `rain_data` and `non_rain_data` using numpy's mean function
with_rain_mean = np.mean(rain_data)
without_rain_mean = np.mean(non_rain_data)
# Run Mann Whitney U-test on `ENTRIESn_hourly` column
[U, p] = scipy.stats.mannwhitneyu(rain_data, non_rain_data)
return with_rain_mean, without_rain_mean, U, p # leave this line for the grader
Result: (1105.4463767458733, 1090.278780151855, 1924409167.0, 0.024940392294493356)
Q: Is the distribution of the number of entries statistically different between rainy and non rainy days? A: Yes
For more intuitions about this test: https://www.graphpad.com/guides/prism/7/statistics/how_the_mann-whitney_test_works.htm?toc=0&printWindow
In this question, you need to:
compute_cost()
and gradient_descent()
proceduresNote:
Plotting your cost history
will help you convince yourself that gradient descent has converged to the minimum cost, but it's more of a learning tool than data to include in a report.
In [25]:
import csv
import numpy as np
import pandas as pd
from ggplot import *
"""
In this question, you need to:
1) implement the compute_cost() and gradient_descent() procedures
2) Select features (in the predictions procedure) and make predictions.
"""
def normalize_features(df):
"""
Normalize the features in the data set.
"""
mu = df.mean()
sigma = df.std()
if (sigma == 0).any():
raise Exception("One or more features had the same value for all samples, and thus could " + \
"not be normalized. Please do not include features with only a single value " + \
"in your model.")
df_normalized = (df - df.mean()) / df.std()
return df_normalized, mu, sigma
def compute_cost(features, values, theta):
"""
Compute the cost function given a set of features / values,
and the values for our thetas.
This can be the same code as the compute_cost function in the lesson #3 exercises,
but feel free to implement your own.
"""
# your code here
m = len(values)
sum_of_square_errors = np.square(np.dot(features, theta) - values)
cost = sum_of_square_errors / (2*m)
return cost
def gradient_descent(features, values, theta, alpha, num_iterations):
"""
Perform gradient descent given a data set with an arbitrary number of features.
This can be the same gradient descent code as in the lesson #3 exercises,
but feel free to implement your own.
"""
m = len(values)
cost_history = []
for i in range(num_iterations):
# Append new computed cost of given list of thets to cost_history list
cost_history.append(compute_cost(features, values, theta))
# Compute
diff = np.dot(features.transpose(), values - np.dot(features, theta))
theta += (alpha/len(values))*diff
return theta, pandas.Series(cost_history)
def predictions(dataframe):
'''
The NYC turnstile data is stored in a pandas dataframe called weather_turnstile.
Using the information stored in the dataframe, let's predict the ridership of
the NYC subway using linear regression with gradient descent.
You can download the complete turnstile weather dataframe here:
https://www.dropbox.com/s/meyki2wl9xfa7yk/turnstile_data_master_with_weather.csv
Your prediction should have a R^2 value of 0.40 or better.
You need to experiment using various input features contained in the dataframe.
We recommend that you don't use the EXITSn_hourly feature as an input to the
linear model because we cannot use it as a predictor: we cannot use exits
counts as a way to predict entry counts.
Note: Due to the memory and CPU limitation of our Amazon EC2 instance, we will
give you a random subet (~15%) of the data contained in
turnstile_data_master_with_weather.csv. You are encouraged to experiment with
this computer on your own computer, locally.
If you'd like to view a plot of your cost history, uncomment the call to
plot_cost_history below. The slowdown from plotting is significant, so if you
are timing out, the first thing to do is to comment out the plot command again.
If you receive a "server has encountered an error" message, that means you are
hitting the 30-second limit that's placed on running your program. Try using a
smaller number for num_iterations if that's the case.
If you are using your own algorithm/models, see if you can optimize your code so
that it runs faster.
'''
# Select Features (try different features!)
features_list = ['fog', 'meandewpti', 'rain', 'precipi', 'Hour', 'meantempi']
features = dataframe[features_list]
# Add UNIT to features using dummy variables
dummy_units = pandas.get_dummies(dataframe['UNIT'], prefix='unit')
features = features.join(dummy_units)
# Values
values = dataframe['ENTRIESn_hourly']
m = len(values)
features, mu, sigma = normalize_features(features)
features['ones'] = np.ones(m) # Add a column of 1s (y intercept)
# Convert features and values to numpy arrays
features_array = np.array(features)
values_array = np.array(values)
# Set values for alpha, number of iterations.
alpha = 0.1 # please feel free to change this value
num_iterations = 75 # please feel free to change this value
# Initialize theta, perform gradient descent
theta_gradient_descent = np.zeros(len(features.columns))
theta_gradient_descent, cost_history = gradient_descent(features_array,
values_array,
theta_gradient_descent,
alpha,
num_iterations)
plot = None
# -------------------------------------------------
# Uncomment the next line to see your cost history
# -------------------------------------------------
plot = plot_cost_history(alpha, cost_history)
#
# Please note, there is a possibility that plotting
# this in addition to your calculation will exceed
# the 30 second limit on the compute servers.
predictions = np.dot(features_array, theta_gradient_descent)
return predictions, plot
def plot_cost_history(alpha, cost_history):
"""This function is for viewing the plot of your cost history.
You can run it by uncommenting this
plot_cost_history(alpha, cost_history)
call in predictions.
If you want to run this locally, you should print the return value
from this function.
"""
cost_df = pandas.DataFrame({
'Cost_History': cost_history,
'Iteration': range(len(cost_history))
})
return ggplot(cost_df, aes('Iteration', 'Cost_History')) + \
geom_point() + ggtitle('Cost History for alpha = %.3f' % alpha )
# -------------------------------------------------
# Locally run
# -------------------------------------------------
# data = pd.read_csv('./data/turnstile_data_master_with_weather.csv')
# predictions(data)
Using the same methods that we used to plot a histogram of entries per hour for our data, why don't you make a histogram of the residuals (that is, the difference between the original hourly entry data and the predicted values). Try different binwidths for your histogram.
Based on this residual histogram, do you have any insight into how our model performed? Reading a bit on this webpage might be useful: http://www.itl.nist.gov/div898/handbook/pri/section2/pri24.htm
In [ ]:
import numpy as np
import scipy
import matplotlib.pyplot as plt
def plot_residuals(turnstile_weather, predictions):
'''
Task description is above.
'''
plt.figure()
(turnstile_weather['ENTRIESn_hourly'] - predictions).hist(bins=100)
return plt
In exercise 5, we calculated the $R^2$ value for you. But why don't you try and and calculate the $R^2$ value yourself.
Given a list of original data points
, and also a list of predicted data points
,
numpy.mean()
and numpy.sum()
might both be useful here, but not necessary.
Documentation about numpy.mean()
and numpy.sum()
below:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html
http://docs.scipy.org/doc/numpy/reference/generated/numpy.sum.html
In [ ]:
import numpy as np
import scipy
import matplotlib.pyplot as plt
import sys
def compute_r_squared(data, predictions):
'''
Task description is above.
'''
# your code here
r_squared = 1 - sum((predictions - data)**2) / sum((data - np.mean(data))**2)
return r_squared
Before choosing to use Linear Regression, follow a checklist like this: http://www.graphpad.com/guides/prism/6/curve-fitting/index.htm?reg_analysischeck_linearreg.htm
[ ] Can the relationship between X and Y be graphed as a straight line? In many experiments the relationship between X and Y is curved, making linear regression inappropriate. It rarely helps to transform the data to force the relationship to be linear. Better, use nonlinear curve fitting.
[ ] Is the scatter of data around the line Gaussian (at least approximately)? Linear regression analysis assumes that the scatter of data around the best-fit line is Gaussian.
[ ] Is the variability the same everywhere? Linear regression assumes that scatter of points around the best-fit line has the same standard deviation all along the curve. The assumption is violated if the points with high or low X values tend to be further from the best-fit line. The assumption that the standard deviation is the same everywhere is termed homoscedasticity. (If the scatter goes up as Y goes up, you need to perform a weighted regression. Prism can't do this via the linear regression analysis. Instead, use nonlinear regression but choose to fit to a straight-line model.
[ ] Do you know the X values precisely? The linear regression model assumes that X values are exactly correct, and that experimental error or biological variability only affects the Y values. This is rarely the case, but it is sufficient to assume that any imprecision in measuring X is very small compared to the variability in Y.
[ ] Are the data points independent? Whether one point is above or below the line is a matter of chance, and does not influence whether another point is above or below the line.
[ ] Are the X and Y values intertwined? If the value of X is used to calculate Y (or the value of Y is used to calculate X) then linear regression calculations are invalid. One example is a Scatchard plot, where the Y value (bound/free) is calculated from the X value. Another example would be a graph of midterm exam scores (X) vs. total course grades(Y). Since the midterm exam score is a component of the total course grade, linear regression is not valid for these data.
Note: that if you are using statsmodels and your model does not include a constant statsmodels will calculate R^2 differently from the grader. See this link for more information: http://www.ats.ucla.edu/stat/mult_pkg/faq/general/noconstant.htm
In this optional exercise, you should complete the function called predictions(turnstile_weather). This function takes in our pandas turnstile weather dataframe, and returns a set of predicted ridership values, based on the other information in the dataframe.
In exercise 3.5 we used Gradient Descent
in order to compute the coefficients
theta used for the ridership prediction.
Here you should attempt to implement another way of computing the coeffcients theta.
You may also try using a reference implementation such as:
http://statsmodels.sourceforge.net/devel/generated/statsmodels.regression.linear_model.OLS.html
One of the advantages of the statsmodels implementation is that it gives you
easy access to the values of the coefficients theta
. This can help you infer relationships
between variables in the dataset.
You may also experiment with polynomial terms as part of the input variables.
The following links might be useful: http://en.wikipedia.org/wiki/Ordinary_least_squares http://en.wikipedia.org/w/index.php?title=Linear_least_squares_(mathematics) http://en.wikipedia.org/wiki/Polynomial_regression
This is your playground. Go wild!
How does your choice of linear regression compare to linear regression with gradient descent computed in Exercise 3.5?
You can look at the information contained in the turnstile_weather dataframe below: https://s3.amazonaws.com/content.udacity-data.com/courses/ud359/turnstile_data_master_with_weather.csv
Note: due to the memory and CPU limitation of our amazon EC2 instance, we will give you a random subset (~10%) of the data contained in turnstile_data_master_with_weather.csv
If you receive a "server has encountered an error" message, that means you are hitting the 30 second limit that's placed on running your program. See if you can optimize your code so it runs faster.
In [ ]:
# -*- coding: utf-8 -*-
import numpy as np
import pandas
import scipy
import statsmodels.api as sm
"""
"""
def predictions(weather_turnstile):
#
# Your implementation goes here. Feel free to write additional
# helper functions
#
return prediction
In [ ]: