Exercise 04

Estimate a regression using the Capital Bikeshare data

Forecast use of a city bikeshare system

We'll be working with a dataset from Capital Bikeshare that was used in a Kaggle competition (data dictionary).

Get started on this competition through Kaggle Scripts

Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world.

The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city. In this competition, participants are asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bikeshare program in Washington, D.C.



In [1]:

    
import pandas as pd
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt

# read the data and set the datetime as the index
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/bikeshare.csv'
bikes = pd.read_csv(url, index_col='datetime', parse_dates=True)

# "count" is a method, so it's best to name that column something else
bikes.rename(columns={'count':'total'}, inplace=True)

bikes.head()









    Out[1]:






  
    
      
      season
      holiday
      workingday
      weather
      temp
      atemp
      humidity
      windspeed
      casual
      registered
      total
    
    
      datetime
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2011-01-01 00:00:00
      1
      0
      0
      1
      9.84
      14.395
      81
      0
      3
      13
      16
    
    
      2011-01-01 01:00:00
      1
      0
      0
      1
      9.02
      13.635
      80
      0
      8
      32
      40
    
    
      2011-01-01 02:00:00
      1
      0
      0
      1
      9.02
      13.635
      80
      0
      5
      27
      32
    
    
      2011-01-01 03:00:00
      1
      0
      0
      1
      9.84
      14.395
      75
      0
      3
      10
      13
    
    
      2011-01-01 04:00:00
      1
      0
      0
      1
      9.84
      14.395
      75
      0
      0
      1
      1

datetime - hourly date + timestamp
season -
- 1 = spring
- 2 = summer
- 3 = fall
- 4 = winter
holiday - whether the day is considered a holiday
workingday - whether the day is neither a weekend nor holiday
weather -
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp - temperature in Celsius
atemp - "feels like" temperature in Celsius
humidity - relative humidity
windspeed - wind speed
casual - number of non-registered user rentals initiated
registered - number of registered user rentals initiated
total - number of total rentals



In [2]:

    
bikes.shape









    Out[2]:





(10886, 11)

Exercise 4.1

What is the relation between the temperature and total?

For a one percent increase in temperature how much the bikes shares increases?

Using sklearn estimate a linear regression and predict the total bikes share when the temperature is 31 degrees



In [4]:

    
# Pandas scatter plot
bikes.plot(kind='scatter', x='temp', y='total', alpha=0.2)









    Out[4]:





<matplotlib.axes._subplots.AxesSubplot at 0xaf41f98>



In [5]:

    
feature_cols = ['temp']
X1 = bikes[feature_cols]
Y1 = bikes.total



In [6]:

    
from sklearn.linear_model import SGDRegressor
from sklearn.linear_model import LinearRegression

clf1 = LinearRegression()
clf1.fit(X1, Y1)
clf1.predict(X1)









    Out[6]:





array([  96.2843313 ,   88.7644881 ,   88.7644881 , ...,  133.88354727,
        133.88354727,  126.36370408])



In [7]:

    
print(clf1.coef_)
print(clf1.intercept_)









    



[ 9.17054048]
6.04621295962

La relación entre la temperatura y el total es directamente proporcional. Podemos observar en la primer grafica que a medida que aumenta la temperatura, el numero total de bicicletas rentadas aumenta también, esto lo podemos corroborar con el coeficiente del modelo lineal de regresion (B1) que al ser positivo indica que cuando la variable X (Temp) aumenta, la variable Y (Total) lo hace también.

Si la temperatura aumenta en 1 unidad, tenemos un incremento de 9 unidades en el total de ciclas rentadas.



In [8]:

    
prediction=clf1.intercept_+(clf1.coef_*31)
prediction









    Out[8]:





array([ 290.33296788])

El total de bibiclitas rentadas cuando la temperatura es de 31º es de 290 unds.

Exercise 04.2

Evaluate the model using the MSE



In [9]:

    
Y1_pred=clf1.predict(X1)
Y1_pred









    Out[9]:





array([  96.2843313 ,   88.7644881 ,   88.7644881 , ...,  133.88354727,
        133.88354727,  126.36370408])



In [60]:

    
from sklearn import metrics
import numpy as np
print('MSE:', metrics.mean_squared_error(Y1, Y1_pred))









    



MSE: 27705.2238053

Exercise 04.3

Does the scale of the features matter?

Let's say that temperature was measured in Fahrenheit, rather than Celsius. How would that affect the model?



In [11]:

    
bikes["temp_conv"]=(bikes.temp*(9/5))+32
bikes.head()









    Out[11]:






  
    
      
      season
      holiday
      workingday
      weather
      temp
      atemp
      humidity
      windspeed
      casual
      registered
      total
      temp_conv
    
    
      datetime
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2011-01-01 00:00:00
      1
      0
      0
      1
      9.84
      14.395
      81
      0
      3
      13
      16
      49.712
    
    
      2011-01-01 01:00:00
      1
      0
      0
      1
      9.02
      13.635
      80
      0
      8
      32
      40
      48.236
    
    
      2011-01-01 02:00:00
      1
      0
      0
      1
      9.02
      13.635
      80
      0
      5
      27
      32
      48.236
    
    
      2011-01-01 03:00:00
      1
      0
      0
      1
      9.84
      14.395
      75
      0
      3
      10
      13
      49.712
    
    
      2011-01-01 04:00:00
      1
      0
      0
      1
      9.84
      14.395
      75
      0
      0
      1
      1
      49.712



In [12]:

    
feature_cols2 = ['temp_conv']
X2 = bikes[feature_cols]
Y2= bikes.total



In [13]:

    
clf2 = LinearRegression()
clf2.fit(X2, Y2)
Y2_pred=clf2.predict(X2)
Y2_pred









    Out[13]:





array([  96.2843313 ,   88.7644881 ,   88.7644881 , ...,  133.88354727,
        133.88354727,  126.36370408])



In [14]:

    
Y1_pred-Y2_pred









    Out[14]:





array([  7.10542736e-14,   7.10542736e-14,   7.10542736e-14, ...,
         5.68434189e-14,   5.68434189e-14,   7.10542736e-14])

Cómo se ve la diferencia entre las predicciones o valores estimados de la regresión con la temperatura en grados Celsius y los grados Fahrenheit, es muy cercana a cero casi nula. Es decir, que a pesar que la escala de la temperatura esté diferente y la escala cambie, las predicciones no van a tener ninguna variación.

Exercise 04.4

Run a regression model using as features the temperature and temperature$^2$ using the OLS equations



In [15]:

    
bikes['temp2']=bikes.temp**2
bikes.head()









    Out[15]:






  
    
      
      season
      holiday
      workingday
      weather
      temp
      atemp
      humidity
      windspeed
      casual
      registered
      total
      temp_conv
      temp2
    
    
      datetime
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2011-01-01 00:00:00
      1
      0
      0
      1
      9.84
      14.395
      81
      0
      3
      13
      16
      49.712
      96.8256
    
    
      2011-01-01 01:00:00
      1
      0
      0
      1
      9.02
      13.635
      80
      0
      8
      32
      40
      48.236
      81.3604
    
    
      2011-01-01 02:00:00
      1
      0
      0
      1
      9.02
      13.635
      80
      0
      5
      27
      32
      48.236
      81.3604
    
    
      2011-01-01 03:00:00
      1
      0
      0
      1
      9.84
      14.395
      75
      0
      3
      10
      13
      49.712
      96.8256
    
    
      2011-01-01 04:00:00
      1
      0
      0
      1
      9.84
      14.395
      75
      0
      0
      1
      1
      49.712
      96.8256



In [16]:

    
feature_cols3 = ['temp','temp2']
X3 = bikes[feature_cols]
Y3 = bikes.total



In [17]:

    
clf3 = LinearRegression()
clf3.fit(X3, Y3)
clf3.coef_,clf3.intercept_









    Out[17]:





(array([ 6.82614372,  0.05789996]), 26.262915296862019)



In [18]:

    
Y3_pred=clf3.predict(X3)
Y3_pred









    Out[18]:





array([  99.03836803,   92.54549569,   92.54549569, ...,  132.67068773,
        132.67068773,  125.78849605])

Exercise 04.5

Data visualization.

What behavior is unexpected?



In [19]:

    
# explore more features
feature_cols = ['temp', 'season', 'weather', 'humidity']



In [20]:

    
# multiple scatter plots in Pandas
fig, axs = plt.subplots(1, len(feature_cols), sharey=True)
for index, feature in enumerate(feature_cols):
    bikes.plot(kind='scatter', x=feature, y='total', ax=axs[index], figsize=(16, 3))

Are you seeing anything that you did not expect?

seasons:

1 = spring
2 = summer
3 = fall
4 = winter



In [24]:

    
# pivot table of season and month
month = bikes.index.month
pd.pivot_table(bikes, index='season', columns=month, values='temp', aggfunc=np.count_nonzero).fillna(0)



In [25]:

    
# box plot of rentals, grouped by season
bikes.boxplot(column='total', by='season')









    Out[25]:





<matplotlib.axes._subplots.AxesSubplot at 0xca47eb8>



In [26]:

    
# line plot of rentals
bikes.total.plot()









    Out[26]:





<matplotlib.axes._subplots.AxesSubplot at 0xc9a6278>

El comportamiento inesperado es ver que se rentan más bicicletas en invierno. Uno pensaría que en las demás estaciones es cuando más rentan bicicletas, pero las gráficas nos dicen lo contrario. Podemos ver en los Boxplot que el promedio más alto es de "winter" y podemos ver en la gráfica anterior que en el mes de Octubre es cuando mas se rentan bicicletas y este mes corresponde a invierno. Otra razon que tenemos para decir que es un comportamiento inesperado es que en el primer punto de este taller nos pudimos dar cuenta que la relacion de la temperatura con el total de bicicletas rentadas es directamente proporcional entonces a más temperatura más bicicletas se rentan y este comportamiento no se ve gráficamente.

Exercise 04.6

Estimate a regression using more features ['temp', 'season', 'weather', 'humidity'].

How is the performance compared to using only the temperature?



In [27]:

    
feature_cols4 = ['temp', 'season', 'weather', 'humidity']
X4 = bikes[feature_cols]
Y4 = bikes.total



In [28]:

    
clf4 = LinearRegression()
clf4.fit(X4, Y4)
clf4.coef_,clf4.intercept_









    Out[28]:





(array([  7.86482499,  22.53875753,   6.67030204,  -3.11887338]),
 159.52068786129925)



In [30]:

    
clf1.score(X1,Y1,sample_weight=None), clf4.score(X4,Y4,sample_weight=None)









    Out[30]:





(0.15559367802794855, 0.25829758327282126)



In [55]:

    
Y4_pred=clf4.predict(bikes[['temp', 'season', 'weather', 'humidity']])
Y4_pred









    Out[55]:





array([  13.49088138,   10.16059827,   10.16059827, ...,  175.7304041 ,
        175.7304041 ,  153.68688069])



In [58]:

    
print('MSE:', metrics.mean_squared_error(Y1, Y1_pred))









    



MSE: 27705.2238053



In [59]:

    
print('MSE:', metrics.mean_squared_error(Y4, Y4_pred))









    



MSE: 24335.4779775

Ya que el MSE del segundo modelo (temp, season, weather, humidity) es mas pequeño, podemos afirmar que es mejor que el modelo que tiene solo a la temperatura como variable independiente.

Exercise 04.7 (3 points)

Split randomly the data in train and test

Which of the following models is the best in the testing set?

['temp', 'season', 'weather', 'humidity']
['temp', 'season', 'weather']
['temp', 'season', 'humidity']



In [84]:

    
import numpy as np
from sklearn.cross_validation import train_test_split
X4_train, X4_test, Y4_train, Y4_test = train_test_split(X4, Y4, test_size=0.35, random_state=666)
print(Y4_train.shape, Y4_test.shape)









    



(7075,) (3811,)



In [85]:

    
feature_cols = ['temp', 'season', 'weather']
X5 = bikes[feature_cols]
Y5 = bikes.total



In [86]:

    
import numpy as np
from sklearn.cross_validation import train_test_split
X5_train, X5_test, Y5_train, Y5_test = train_test_split(X5, Y5, test_size=0.35, random_state=666)
print(Y5_train.shape, Y5_test.shape)









    



(7075,) (3811,)



In [87]:

    
feature_cols = ['temp', 'season', 'humidity']
X6 = bikes[feature_cols]
Y6 = bikes.total



In [83]:

    
import numpy as np
from sklearn.cross_validation import train_test_split
X6_train, X6_test, Y6_train, Y6_test = train_test_split(X6, Y6, test_size=0.35, random_state=333)
print(Y6_train.shape, Y6_test.shape)









    



(7075,) (3811,)



In [89]:

    
clf4 = LinearRegression()
clf4.fit(X4_train, Y4_train)









    Out[89]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [90]:

    
clf5 = LinearRegression()
clf5.fit(X5_train, Y5_train)









    Out[90]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [91]:

    
clf6 = LinearRegression()
clf6.fit(X6_train, Y6_train)









    Out[91]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [101]:

    
Y4_pred=clf4.predict(X4_test)
(Y4_test == Y4_pred).mean()









    Out[101]:





0.0



In [106]:

    
Y4_test









    Out[106]:





datetime
2012-03-16 21:00:00    152
2011-07-08 13:00:00    229
2011-08-18 00:00:00     56
2011-01-17 12:00:00     80
2011-04-05 01:00:00     15
2012-02-17 01:00:00     18
2012-11-11 00:00:00    124
2012-01-04 08:00:00    315
2011-07-12 06:00:00    115
2011-07-14 20:00:00    348
2012-11-14 12:00:00    240
2011-07-08 02:00:00     22
2011-05-07 01:00:00     58
2012-08-08 00:00:00     58
2011-09-09 07:00:00    108
2011-02-01 00:00:00      8
2012-10-14 02:00:00     66
2011-07-10 12:00:00    377
2011-06-09 03:00:00      2
2012-12-04 22:00:00    181
2012-12-03 12:00:00    268
2012-03-16 05:00:00     32
2011-08-05 00:00:00     54
2012-11-16 19:00:00    332
2011-09-09 20:00:00    210
2011-03-08 20:00:00     76
2011-04-10 14:00:00    281
2011-10-04 07:00:00    309
2011-11-13 15:00:00    310
2012-02-13 21:00:00    147
                      ... 
2012-09-02 08:00:00    129
2011-03-06 00:00:00     52
2012-07-03 01:00:00     14
2011-07-11 21:00:00    130
2012-03-04 03:00:00     26
2012-05-07 21:00:00    247
2012-12-19 11:00:00    200
2011-11-05 12:00:00    372
2012-04-06 21:00:00    190
2012-04-07 06:00:00     19
2012-01-05 15:00:00    119
2012-04-05 17:00:00    822
2011-03-15 07:00:00    119
2012-11-09 20:00:00    255
2012-05-02 12:00:00    245
2012-12-08 18:00:00    304
2011-09-13 21:00:00    245
2012-01-06 15:00:00    222
2011-07-19 23:00:00     92
2012-09-19 20:00:00    409
2012-01-06 20:00:00    177
2012-01-10 18:00:00    385
2012-10-19 07:00:00    154
2012-06-08 21:00:00    339
2011-03-06 08:00:00      9
2011-01-13 21:00:00     40
2012-04-16 14:00:00    288
2012-10-08 02:00:00     15
2012-08-19 19:00:00    341
2011-04-06 10:00:00     69
Name: total, dtype: int64



In [107]:

    
Y4_pred









    Out[107]:





array([  45.54118706,  288.11052454,  252.37525776, ...,  111.88364762,
        220.11900439,  226.42690061])



In [96]:

    
Y5_pred=clf5.predict(X5_test)
(Y5_test == Y5_pred).mean()









    Out[96]:





0.0



In [98]:

    
Y6_pred=clf6.predict(X6_test)
(Y6_test == Y6_pred).mean()









    Out[98]:





0.0



In [100]:

    
print('MSE:', metrics.mean_squared_error(Y4_test, Y4_pred)),
print('MSE:', metrics.mean_squared_error(Y5_test, Y5_pred)),
print('MSE:', metrics.mean_squared_error(Y6_test, Y6_pred))









    



MSE: 25072.9136442
MSE: 27901.1120978
MSE: 24442.3617874

El menor MSE resultó ser el del modelo (temp, season, humidity) así que este es el mejor modelo usando el metodo de train-test.

	1	2	3	4	5	6	7	8	9	10	11	12
season
1	884	901	901	0	0	0	0	0	0	0	0	0
2	0	0	0	909	912	912	0	0	0	0	0	0
3	0	0	0	0	0	0	912	912	909	0	0	0
4	0	0	0	0	0	0	0	0	0	911	911	912

	season	holiday	workingday	weather	temp	atemp	humidity	windspeed	casual	registered	total
datetime
2011-01-01 00:00:00	1	0	0	1	9.84	14.395	81	0	3	13	16
2011-01-01 01:00:00	1	0	0	1	9.02	13.635	80	0	8	32	40
2011-01-01 02:00:00	1	0	0	1	9.02	13.635	80	0	5	27	32
2011-01-01 03:00:00	1	0	0	1	9.84	14.395	75	0	3	10	13
2011-01-01 04:00:00	1	0	0	1	9.84	14.395	75	0	0	1	1