This project will analyze the effect of a few different variables on the number of bike rentals at a bike sharing station in Washington, DC. The variables which will be analyzed are the temperature and the weather conditions on a scale of 1-4. Hopefully we will be able to see trends in the number of rentals as the weather and temperature change.
In [178]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
import scipy.stats as ss
import seaborn as sns
import scipy.optimize
sns.set_style('whitegrid')
sns.set_context('notebook')
In [179]:
#the data is read in from the csv
all_data = pd.read_csv('Dengler_L_Project/train.csv')
all_data.info()
In [180]:
#The datetime column is converted to useble dates instead of the string they are
dates = pd.to_datetime(all_data.datetimes)
We will now check for unsual data. The first check will be whether any numbers are NaN values or above a certain value.
Each list of values is checked for having bad values by checking that each element of the array is both valid and is in a reasonable range for the data it is describing.
In the following cell the number of bad values is counted and if a list contains no bad values a message stating such is printed out.
In [181]:
badSeasonCount = (np.logical_and(np.isnan(all_data.season), all_data.season>4))
total=0
for i in badSeasonCount:
if(i):
total += 1
if(total ==0):
print("None of the 'season' data is invalid")
badHolidayCount = (np.logical_and(np.isnan(all_data.holiday), all_data.holiday>1))
total=0
for i in badHolidayCount:
if(i):
total += 1
if(total ==0):
print("None of the 'holiday' data is invalid")
badWorkCount = (np.logical_and(np.isnan(all_data.workingday), all_data.workingday>1))
total=0
for i in badWorkCount:
if(i):
total += 1
if(total ==0):
print("None of the 'workingDay' data is invalid")
badWeatherCount = (np.logical_and(np.isnan(all_data.weather), all_data.weather>4))
total=0
for i in badWeatherCount:
if(i):
total += 1
if(total ==0):
print("None of the 'weather' data is invalid")
badTempCount = (np.logical_and(np.isnan(all_data.temp), all_data.temp>50))
total=0
for i in badTempCount:
if(i):
total += 1
if(total ==0):
print("None of the 'temp' data is invalid")
badaTempCount = (np.logical_and(np.isnan(all_data.atemp), all_data.atemp>50))
total=0
for i in badaTempCount:
if(i):
total += 1
if(total ==0):
print("None of the 'atemp' data is invalid")
badHumidCount = (np.logical_and(np.isnan(all_data.humidity), all_data.humidity>4))
total=0
for i in badHumidCount:
if(i):
total += 1
if(total ==0):
print("None of the 'humidity' data is invalid")
badWindCount = (np.logical_and(np.isnan(all_data.windspeed), all_data.windspeed>25))
total=0
for i in badWindCount:
if(i):
total += 1
if(total ==0):
print("None of the 'windspeed' data is invalid")
badCasualCount = (np.logical_and(np.isnan(all_data.casual), all_data.casual>1000))
total=0
for i in badCasualCount:
if(i):
total += 1
if(total ==0):
print("None of the 'casual' data is invalid")
badRegisteredCount = (np.logical_and(np.isnan(all_data.registered), all_data.registered>1000))
total=0
for i in badRegisteredCount:
if(i):
total += 1
if(total ==0):
print("None of the 'registered' data is invalid")
badCountsCount = (np.logical_and(np.isnan(all_data.counts), all_data.counts>1000))
total=0
for i in badCountsCount:
if(i):
total += 1
if(total ==0):
print("None of the 'counts' data is invalid")
print("\n\nAll of these printing as such shows that there should be no missing or errored data.")
In [182]:
plt.plot(dates[0:168], all_data.casual[0:168], 'b', label = "casual")
plt.plot(dates[0:168], all_data.registered[0:168], 'r', label = "registered")
plt.plot(dates[0:168], all_data.counts[0:168], 'g', label = "total")
plt.legend(loc = 'upper left')
plt.xlabel('Date and Time')
plt.ylabel('Counts')
plt.title("Counts for the first 7 days of points", fontsize = 20)
plt.figure(figsize=(15,5), dpi = 1000)
plt.gcf().autofmt_xdate()
plt.show()
The $isDaytime$ array will hold $True$ if the value is between 7am and 9pm and $false$ otherwise.
By knowing which locations in the array are during the day time we can make an array that contains only the values from the day time and keeps them in order.
After this cell is run, slicing a data array by this $isDaytime$ array will remove all of the late night data points.
In [183]:
#extracting the hour from the date time
hours =[]
for date in dates:
hours.append(date.hour)
isDaytime = []
for i in hours:
if(7<i and i<21):
isDaytime.append(True)
else:
isDaytime.append(False)
In [184]:
totalRents = sum(all_data.counts)
reducedRents = sum(all_data.counts[isDaytime])
print("Total Rentals: {} \n Reduced Rentals: {}".format(totalRents, reducedRents))
print("We maintained {:.4}% of the rentals.".format(reducedRents/totalRents*100))
Using the 1-4 weather scale (see Appendix B) we will compare the weather to the number of rentals.
A violin plot is used to show the distribution of the Y-Values that correspond to a certain X-value. The "violin" at each X-value is wider where there are more points and very thin where there are few values. This style of plot works best for when ther are a limited number of discrete X-values with a high number of Y-Values each.
We will be plotting the weather values on the x-axis and the number of rentals on the y-axis
In [185]:
xvals = all_data.weather[isDaytime]
yvals = all_data.counts[isDaytime]
fig = sns.violinplot(xvals, yvals)
plt.gca().set_xticklabels(['Nice', 'Mild', 'Bad', 'Severe'])
plt.xlabel("Weather")
plt.ylabel("Number of Bikes rented")
plt.title("Weather and Rental Counts", fontsize = 20)
plt.ylim([0, 1200])
Out[185]:
In [186]:
cor_weatherAndCounts = ss.spearmanr(xvals, yvals)[1]
print("P-value: {:.3}".format(cor_weatherAndCounts))
This p-value is less than .05, we reject the null hypothesis and conclude there is a correlation between the weather and the number of rentals.
We can now try to make a regression for this relationship. The relation beween these two values seems to be linear.
Starting with a linear model we will optimize this equation: $$ R = \beta_0 * W + \beta_1 $$ Where: $$R = Number\ of\ Rentals,\ \ W = Weather\ Condition\ Number $$
The defined function takes in values for $\beta_0$ and $\beta_1$ and returns what the sum of the sum of the squares of the residuals.
The optimize funtion will find values for $\beta_0$ and $\beta_1$ which minimize this total.
In [187]:
def bestFit(x, b0, b1):
'''This function uses beta values an an x array to make a yHat value '''
return x*b0 + b1
def SSR(args):
'''Given an array of 2 betas, the sum of the squares of the residuals is calcualted by using the previous function
for a yHat then summing the square of the difference between the '''
b0 = args[0]
b1 = args[1]
yhat = bestFit(xvals, b0, b1)
ssr = np.sum((yhat - yvals)**2)
return ssr
optResult = scipy.optimize.minimize(SSR, x0=[-75, 300])
print("B0 = {:.4} B1 = {:.4}".format(optResult.x[0], optResult.x[1]))
In [188]:
fig = sns.violinplot(xvals, yvals)
plt.plot([0, 1, 2, 3], bestFit(np.array([1, 2, 3, 4]), optResult.x[0], optResult.x[1]), 'bo-')
#the negative 1 adjustment in the x-values is an adjustmen to make the line allign properly. It has no effect on calculations
plt.gca().set_xticklabels(['Nice', 'Mild', 'Bad', 'Severe'])
plt.xlabel("Weather")
plt.ylabel("Number of Bikes rented")
plt.title("Weather and Rental Counts With Estimation", fontsize = 20)
plt.ylim([0, 1200])
Out[188]:
Taking the expected counts subtracting the actual rental count for each weather value will give us the residuals which we will plot.
We will also run a Shapiro-Wilkes test to see if the residuals are normally distributed.
In [189]:
resids = yvals - bestFit(xvals, optResult.x[0], optResult.x[1])
plt.hist(resids)
plt.title("Residuals of the Linear Best Fit", fontsize = 20)
plt.xlabel("Difference in estimated and actual")
plt.ylabel("Count")
Out[189]:
In [190]:
normality = ss.shapiro(resids)
print("P-value:", normality[1])
In [191]:
xvals2 = all_data.temp[isDaytime]
yvals2 = all_data.counts[isDaytime]
#fig = sns.violinplot(all_data.temp[isDaytime], all_data.counts[isDaytime])
plt.plot(xvals2, yvals2, 'o')
plt.xlabel("Temperature (Degrees Celsius)")
plt.ylabel("Number of Rentals")
plt.title("Temperatures and Counts Plot", fontsize = 20)
plt.ylim([0, 1100])
Out[191]:
In [192]:
cor_tempAndCount = ss.spearmanr(xvals2, yvals2)[1]
print("P-value: {:.4}".format(cor_tempAndCount))
This p-value is less than .05, we reject the null hypothesis and conclude there is a correlation between the temperature and the number of rentals.
We can now try to make a regression for this relationship. The relation beween the data could be a negative polynomial. This also makes sense in the context, as temperatures get to be too far from a comfortable range the willingness to ride a bike decreases.
Starting with a linear model we will optimize this equation: $$ R = \beta_0 *T^2 + \beta_1 * T + \beta_2 $$ Where: $$R = Number\ of\ Rentals,\ \ T = Temperature\ in\ C^\circ$$
This time the function will take in values for $\beta_0$, $\beta_1$, and $\beta_2$ and returns what the sum of the sum of the squares of the residuals.
BasinHopping is also used for this equation to minimize the residiuals. This method will avoid getting caught in local minimums by "hopping" out of "basins".
In [193]:
def bestFit2(x, b0, b1, b2):
'''This takes in 3 betas to make a yHat for the temperature data'''
return b0*x**2 + x*b1 + b2
def SSR2(args):
'''This uses the same method as in the first example to calculate the Sum of the squares
of the residual.'''
b0 = args[0]
b1 = args[1]
b2 = args[2]
yhat = bestFit2(xvals2, b0, b1, b2)
ssr = np.sum((yhat - yvals2)**2)
return ssr
print(bestFit2(10, -5,100,0))
#optResult2 = scipy.optimize.minimize(SSR2, x0=[0, 0,0])
optResult2 = scipy.optimize.basinhopping(SSR2, x0 = [-5,25,-50], niter = 5000)
print(optResult2)
beta0 = optResult2.x[0]
beta1 = optResult2.x[1]
beta2 = optResult2.x[2]
print("B0 = {:.4} B1 = {:.4} B2 = {:.4}".format(optResult2.x[0], optResult2.x[1], optResult2.x[2]))
In [194]:
plt.plot(xvals2, yvals2, 'o')
plt.plot(np.linspace(0, 45,100) , bestFit2(np.linspace(0, 45,100), beta0, beta1, beta2), 'r-')
plt.xlabel("Temperature (Degrees Celsius)")
plt.ylabel("Number of Bikes rented")
plt.title("Temperature and Rental Counts With Estimation", fontsize = 20)
plt.ylim([0, 1200])
Out[194]:
Again we will take the expected values from our newly optimized function and subtract the actual values
We will also run a Shapiro-Wilkes test to see if the residuals are normally distributed.
In [195]:
resids2 = yvals2 - bestFit2(xvals2, optResult2.x[0], optResult2.x[1], optResult2.x[2])
plt.hist(resids2)
plt.title("Residuals of the Linear Best Fit for Temperature", fontsize = 16)
plt.xlabel("Difference in estimated and actual")
plt.ylabel("Count")
Out[195]:
In [196]:
normality2 = ss.shapiro(resids2)
print("P-value:", normality2[1])
While both of these regressions ended up not being strong fits for the data they did give us insight to how we could further analyze the data.
Seeing as both histograms have a very similar shape, a longer left tail and a very short right tale. This disticntly shows us that the error in our regression line is not normally distributed. This occurs because the regression line can not represent how the data has a few values with a great number of riders and then many values with fewer or even zero rentals. The high rental numbers end up pulling the average values above the bulk of the values that are down around 0 rentals. Perhaps finding another way to remove some of the consistently low values that are caused by other factors would allow for a better estimate of the values that we are interested in.
Another possible explanation for this error could be that we are not looking at enough of the facors. Perhaps we should be looking more factors simultaneously to estimate the number of riders.
Based on some literature, tne last consideration should be the nature of data. It is possible that the data's "random error" or "noise" is not normal but rather of a binomial or Poisson distribution. This would make sense because there is a large population of people that can use these bikes and each person has a probability of renting a bike at a given time. With this in mind we would expect a Poisson distribution of the number of rentals and this would lead to a Poisson distribution of error which more closely matches the residuals that were found.
In [197]:
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(xvals, xvals2, yvals, zdir='z', s=20, c='b', depthshade=True)
ax.set_xlabel('Weather')
ax.set_ylabel('Temperature (C)')
ax.set_zlabel('Rental Count')
plt.show()
We can see the 4 distinct weather values and as the weather gets to be harsher we can see that the high numbers of rentals start to be less frequent while the low number of rentals stays fairly dense. This shows us that as we expect, people are less likely to want to rent a bike in harsh weather.
We can also see across all 4 weather values the affect of temperature. As the temperature gets low, there are no data points that have a high number of rentals.
The last very interesting thing that we can see from this plot that could not possibly be seen in the earlier plots is that as the weather gets harsher the distribution of teperatures changes drastically. The harsher weather clusters have a much lower average temperature because they have much fewer high temperature data points.
Lets look at some data that has a known relationship. We will be plotting the "feels like" temperature as a function of temperature and windspeed. We know that these values are related because the "feels like" temperature is determined by the real temperature and the wind speed. Due to the sheer number of data points it has been decided to only plot every 10^{th} point to make the data more visible. We have also made the axis disproportianate to show the shape of the data a little better.
In [198]:
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(all_data.temp[::10], all_data.windspeed[::10], all_data.atemp[::10], zdir='z', s=20, c='b', depthshade=True)
ax.set_xlim([0, 60])
ax.set_ylim([0, 30])
ax.set_zlim([0, 100])
ax.set_xlabel('Real Temperature (C)')
ax.set_ylabel('WindSpeed (km/hr)')
ax.set_zlabel('"Feels Like" Temp (C)')
plt.show()
This plot shows that as temperature increases so does the "Feels like" Temperature. It is hard see what how the wind affects the "Feels like" temp because of the nature of the graphics and the minimum effect of the wind speed. Perhaps by selecting a different graphic for the points this could be more visible.
Hadi Fanaee Tork using data from Capital Bikeshare
28 May 2014
Fanaee-T, Hadi, and Gama, Joao, Event labeling combining ensemble detectors and background knowledge, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg.
https://www.kaggle.com/c/bike-sharing-demand/data?sampleSubmission.csv
Data Fields
weather -
temp - temperature in Celsius
R Data Analysis Examples: Poisson Regression
UCLA: Statistical Consulting Group.
Accessed 4/28/16
http://www.ats.ucla.edu/stat/r/dae/poissonreg.htm
Germán Rodríguez
Princeton University Course Notes. Fall 2015
http://data.princeton.edu/wws509/notes/c4.pdf
In [ ]: