Exercise 7

Capital Bikeshare data

Introduction

  • Capital Bikeshare dataset from Kaggle: data, data dictionary
  • Each observation represents the bikeshare rentals initiated during a given hour of a given day

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
from sklearn.cross_validation import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor, export_graphviz


C:\Users\albah\Anaconda3\lib\site-packages\sklearn\cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

In [2]:
# read the data and set "datetime" as the index
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/bikeshare.csv'
bikes = pd.read_csv(url, index_col='datetime', parse_dates=True)

In [3]:
# "count" is a method, so it's best to rename that column
bikes.rename(columns={'count':'total'}, inplace=True)

In [4]:
# create "hour" as its own feature
bikes['hour'] = bikes.index.hour

In [5]:
bikes.head()


Out[5]:
season holiday workingday weather temp atemp humidity windspeed casual registered total hour
datetime
2011-01-01 00:00:00 1 0 0 1 9.84 14.395 81 0.0 3 13 16 0
2011-01-01 01:00:00 1 0 0 1 9.02 13.635 80 0.0 8 32 40 1
2011-01-01 02:00:00 1 0 0 1 9.02 13.635 80 0.0 5 27 32 2
2011-01-01 03:00:00 1 0 0 1 9.84 14.395 75 0.0 3 10 13 3
2011-01-01 04:00:00 1 0 0 1 9.84 14.395 75 0.0 0 1 1 4

In [6]:
bikes.tail()


Out[6]:
season holiday workingday weather temp atemp humidity windspeed casual registered total hour
datetime
2012-12-19 19:00:00 4 0 1 1 15.58 19.695 50 26.0027 7 329 336 19
2012-12-19 20:00:00 4 0 1 1 14.76 17.425 57 15.0013 10 231 241 20
2012-12-19 21:00:00 4 0 1 1 13.94 15.910 61 15.0013 4 164 168 21
2012-12-19 22:00:00 4 0 1 1 13.94 17.425 61 6.0032 12 117 129 22
2012-12-19 23:00:00 4 0 1 1 13.12 16.665 66 8.9981 4 84 88 23
  • hour ranges from 0 (midnight) through 23 (11pm)
  • workingday is either 0 (weekend or holiday) or 1 (non-holiday weekday)

Exercise 7.1

Run these two groupby statements and figure out what they tell you about the data.


In [7]:
# mean rentals for each value of "workingday"
bikes.groupby('workingday').total.mean()


Out[7]:
workingday
0    188.506621
1    193.011873
Name: total, dtype: float64

In [8]:
# mean rentals for each value of "hour"
bikes.groupby('hour').total.mean()


Out[8]:
hour
0      55.138462
1      33.859031
2      22.899554
3      11.757506
4       6.407240
5      19.767699
6      76.259341
7     213.116484
8     362.769231
9     221.780220
10    175.092308
11    210.674725
12    256.508772
13    257.787281
14    243.442982
15    254.298246
16    316.372807
17    468.765351
18    430.859649
19    315.278509
20    228.517544
21    173.370614
22    133.576754
23     89.508772
Name: total, dtype: float64

Exercise 7.2

Run this plotting code, and make sure you understand the output. Then, separate this plot into two separate plots conditioned on "workingday". (In other words, one plot should display the hourly trend for "workingday=0", and the other should display the hourly trend for "workingday=1".)


In [9]:
# mean rentals for each value of "hour"
bikes.groupby('hour').total.mean().plot()


Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c8fbac82b0>

Plot for workingday == 0 and workingday == 1


In [10]:
# hourly rental trend for "workingday=0"
bikes[bikes.workingday==0].groupby('hour').total.mean().plot()


Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c8fa8976a0>

In [11]:
# hourly rental trend for "workingday=1"
bikes[bikes.workingday==1].groupby('hour').total.mean().plot()


Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c8fb9ca7f0>

In [12]:
# combine the two plots
bikes.groupby(['hour', 'workingday']).total.mean().unstack().plot()


Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c8fbce9ac8>

Write about your findings

Exercise 7.3

Fit a linear regression model to the entire dataset, using "total" as the response and "hour" and "workingday" as the only features. Then, print the coefficients and interpret them. What are the limitations of linear regression in this instance?


In [ ]:

Exercice 7.4

Create a Decision Tree to forecast "total" by manually iterating over the features "hour" and "workingday". The algorithm must at least have 6 end nodes.


In [ ]:

Exercise 7.5

Train a Decision Tree using scikit-learn. Comment about the performance of the models.


In [ ]: