A notebook that explores the Whashingthon DC bikeshare rentals



In [26]:

    
# Data wrangling packages
import pandas as pd
import numpy as np

# ML packages
from sklearn.cross_validation import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor 
from sklearn.ensemble import RandomForestRegressor

# Visualizations packages
import matplotlib.pylab as plt
import seaborn as sns


%matplotlib inline

Exploring the data



In [3]:

    
bike_rentals = pd.read_csv('bike_rental_hour.csv')



In [4]:

    
bike_rentals.head()

Exploring the `cnt` column (the target)



In [7]:

    
bike_rentals.cnt.hist()









    Out[7]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f50d236bf60>



In [10]:

    
sns.distplot(bike_rentals['cnt'])









    Out[10]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f50d2383c50>

Explore the data correlations



In [16]:

    
data = bike_rentals.corr()
df = pd.DataFrame(np.tril(data), index=data.index, 
                  columns=data.columns)
sns.heatmap(df)









    Out[16]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f50bc5a6b00>

Feature engineering



In [5]:

    
def assign_label(hour):
    hour_segments = {1: range(6,12), 
                     2: range(12, 18),
                    3: range(18, 24),
                    4: range(0,6)}
    for label, hours in hour_segments.items():
        if hour in hours:
            return int(label)



In [6]:

    
bike_rentals['time_label'] = bike_rentals['hr'].apply(
    assign_label)

Choosing the metric

Since this is a regression problem, the RMSE error seems a reasonable first choice. Another metric could be chosen later if better alternatives appear.



In [7]:

    
def rmse(pred, true):
    return np.sqrt(mean_squared_error(pred, true))

Train/Test split



In [8]:

    
feature_columns = bike_rentals.columns.drop(['cnt', 'casual', 'registered',
                                             'dteday'])



In [10]:

    
features = bike_rentals[feature_columns]
target = bike_rentals['cnt']



In [11]:

    
f_train, f_test, t_train, t_test = train_test_split(features, 
                                                    target, 
                                                    train_size=0.8,
                                                   random_state=314)

Try different models

Linear model



In [18]:

    
linear_model = LinearRegression()
linear_model.fit(f_train, t_train)
rmse(linear_model.predict(f_test), t_test)









    Out[18]:





134.50592946052785

=> The linear model perfroms poorly since a lot of non-linearities are involved in the decision of renting a bike.

Decision trees



In [23]:

    
decision_trees = DecisionTreeRegressor(random_state=314159265, 
                                       min_samples_leaf=3)
decision_trees.fit(f_train, t_train)
rmse(decision_trees.predict(f_test), t_test)









    Out[23]:





54.198566604028734

=> The decision trees model perfroms better compared to the linear model. In fact, this model works well with non-linearities.

Random forests



In [27]:

    
random_forests = RandomForestRegressor(random_state=314159265, 
                                       min_samples_leaf=3)
random_forests.fit(f_train, t_train)
rmse(random_forests.predict(f_test), t_test)









    Out[27]:





44.983179435408545

=> The random forests model perfroms even better compared to other ones (linear and decision trees). In fact, this model works well with non-linearities and doesn't overfit a lot (contrary to the decision trees one which might overfit).

	instant	dteday	season	mnth	hr	weekday	weathersit	temp	atemp	hum	casual	registered	cnt
0	1	2011-01-01	1	1	0	6	1	0.24	0.2879	0.81	3	13	16
1	2	2011-01-01	1	1	1	6	1	0.22	0.2727	0.80	8	32	40
2	3	2011-01-01	1	1	2	6	1	0.22	0.2727	0.80	5	27	32
3	4	2011-01-01	1	1	3	6	1	0.24	0.2879	0.75	3	10	13
4	5	2011-01-01	1	1	4	6	1	0.24	0.2879	0.75	0	1	1