Estimate a regression using the Capital Bikeshare data
We'll be working with a dataset from Capital Bikeshare that was used in a Kaggle competition (data dictionary).
Get started on this competition through Kaggle Scripts
Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world.
The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city. In this competition, participants are asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bikeshare program in Washington, D.C.
In [1]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
# read the data and set the datetime as the index
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/bikeshare.csv'
bikes = pd.read_csv(url, index_col='datetime', parse_dates=True)
# "count" is a method, so it's best to name that column something else
bikes.rename(columns={'count':'total'}, inplace=True)
bikes.head()
Out[1]:
In [2]:
bikes.shape
Out[2]:
In [4]:
# Pandas scatter plot
bikes.plot(kind='scatter', x='temp', y='total', alpha=0.2)
Out[4]:
In [5]:
feature_cols = ['temp']
X1 = bikes[feature_cols]
Y1 = bikes.total
In [6]:
from sklearn.linear_model import SGDRegressor
from sklearn.linear_model import LinearRegression
clf1 = LinearRegression()
clf1.fit(X1, Y1)
clf1.predict(X1)
Out[6]:
In [7]:
print(clf1.coef_)
print(clf1.intercept_)
La relación entre la temperatura y el total es directamente proporcional. Podemos observar en la primer grafica que a medida que aumenta la temperatura, el numero total de bicicletas rentadas aumenta también, esto lo podemos corroborar con el coeficiente del modelo lineal de regresion (B1) que al ser positivo indica que cuando la variable X (Temp) aumenta, la variable Y (Total) lo hace también.
Si la temperatura aumenta en 1 unidad, tenemos un incremento de 9 unidades en el total de ciclas rentadas.
In [8]:
prediction=clf1.intercept_+(clf1.coef_*31)
prediction
Out[8]:
El total de bibiclitas rentadas cuando la temperatura es de 31º es de 290 unds.
In [9]:
Y1_pred=clf1.predict(X1)
Y1_pred
Out[9]:
In [60]:
from sklearn import metrics
import numpy as np
print('MSE:', metrics.mean_squared_error(Y1, Y1_pred))
In [11]:
bikes["temp_conv"]=(bikes.temp*(9/5))+32
bikes.head()
Out[11]:
In [12]:
feature_cols2 = ['temp_conv']
X2 = bikes[feature_cols]
Y2= bikes.total
In [13]:
clf2 = LinearRegression()
clf2.fit(X2, Y2)
Y2_pred=clf2.predict(X2)
Y2_pred
Out[13]:
In [14]:
Y1_pred-Y2_pred
Out[14]:
Cómo se ve la diferencia entre las predicciones o valores estimados de la regresión con la temperatura en grados Celsius y los grados Fahrenheit, es muy cercana a cero casi nula. Es decir, que a pesar que la escala de la temperatura esté diferente y la escala cambie, las predicciones no van a tener ninguna variación.
In [15]:
bikes['temp2']=bikes.temp**2
bikes.head()
Out[15]:
In [16]:
feature_cols3 = ['temp','temp2']
X3 = bikes[feature_cols]
Y3 = bikes.total
In [17]:
clf3 = LinearRegression()
clf3.fit(X3, Y3)
clf3.coef_,clf3.intercept_
Out[17]:
In [18]:
Y3_pred=clf3.predict(X3)
Y3_pred
Out[18]:
In [19]:
# explore more features
feature_cols = ['temp', 'season', 'weather', 'humidity']
In [20]:
# multiple scatter plots in Pandas
fig, axs = plt.subplots(1, len(feature_cols), sharey=True)
for index, feature in enumerate(feature_cols):
bikes.plot(kind='scatter', x=feature, y='total', ax=axs[index], figsize=(16, 3))
Are you seeing anything that you did not expect?
seasons:
In [24]:
# pivot table of season and month
month = bikes.index.month
pd.pivot_table(bikes, index='season', columns=month, values='temp', aggfunc=np.count_nonzero).fillna(0)
Out[24]:
In [25]:
# box plot of rentals, grouped by season
bikes.boxplot(column='total', by='season')
Out[25]:
In [26]:
# line plot of rentals
bikes.total.plot()
Out[26]:
El comportamiento inesperado es ver que se rentan más bicicletas en invierno. Uno pensaría que en las demás estaciones es cuando más rentan bicicletas, pero las gráficas nos dicen lo contrario. Podemos ver en los Boxplot que el promedio más alto es de "winter" y podemos ver en la gráfica anterior que en el mes de Octubre es cuando mas se rentan bicicletas y este mes corresponde a invierno. Otra razon que tenemos para decir que es un comportamiento inesperado es que en el primer punto de este taller nos pudimos dar cuenta que la relacion de la temperatura con el total de bicicletas rentadas es directamente proporcional entonces a más temperatura más bicicletas se rentan y este comportamiento no se ve gráficamente.
In [27]:
feature_cols4 = ['temp', 'season', 'weather', 'humidity']
X4 = bikes[feature_cols]
Y4 = bikes.total
In [28]:
clf4 = LinearRegression()
clf4.fit(X4, Y4)
clf4.coef_,clf4.intercept_
Out[28]:
In [30]:
clf1.score(X1,Y1,sample_weight=None), clf4.score(X4,Y4,sample_weight=None)
Out[30]:
In [55]:
Y4_pred=clf4.predict(bikes[['temp', 'season', 'weather', 'humidity']])
Y4_pred
Out[55]:
In [58]:
print('MSE:', metrics.mean_squared_error(Y1, Y1_pred))
In [59]:
print('MSE:', metrics.mean_squared_error(Y4, Y4_pred))
In [84]:
import numpy as np
from sklearn.cross_validation import train_test_split
X4_train, X4_test, Y4_train, Y4_test = train_test_split(X4, Y4, test_size=0.35, random_state=666)
print(Y4_train.shape, Y4_test.shape)
In [85]:
feature_cols = ['temp', 'season', 'weather']
X5 = bikes[feature_cols]
Y5 = bikes.total
In [86]:
import numpy as np
from sklearn.cross_validation import train_test_split
X5_train, X5_test, Y5_train, Y5_test = train_test_split(X5, Y5, test_size=0.35, random_state=666)
print(Y5_train.shape, Y5_test.shape)
In [87]:
feature_cols = ['temp', 'season', 'humidity']
X6 = bikes[feature_cols]
Y6 = bikes.total
In [83]:
import numpy as np
from sklearn.cross_validation import train_test_split
X6_train, X6_test, Y6_train, Y6_test = train_test_split(X6, Y6, test_size=0.35, random_state=333)
print(Y6_train.shape, Y6_test.shape)
In [89]:
clf4 = LinearRegression()
clf4.fit(X4_train, Y4_train)
Out[89]:
In [90]:
clf5 = LinearRegression()
clf5.fit(X5_train, Y5_train)
Out[90]:
In [91]:
clf6 = LinearRegression()
clf6.fit(X6_train, Y6_train)
Out[91]:
In [101]:
Y4_pred=clf4.predict(X4_test)
(Y4_test == Y4_pred).mean()
Out[101]:
In [106]:
Y4_test
Out[106]:
In [107]:
Y4_pred
Out[107]:
In [96]:
Y5_pred=clf5.predict(X5_test)
(Y5_test == Y5_pred).mean()
Out[96]:
In [98]:
Y6_pred=clf6.predict(X6_test)
(Y6_test == Y6_pred).mean()
Out[98]:
In [100]:
print('MSE:', metrics.mean_squared_error(Y4_test, Y4_pred)),
print('MSE:', metrics.mean_squared_error(Y5_test, Y5_pred)),
print('MSE:', metrics.mean_squared_error(Y6_test, Y6_pred))
El menor MSE resultó ser el del modelo (temp, season, humidity) así que este es el mejor modelo usando el metodo de train-test.