Bike Rental

Daily bike rental ridership prediction using an artificial neural network in Keras

Supervised Learning. Regression

Based on the first neural network project from the Deep Learning Nanodegree Foundation of Udacity

Click Here to check my original solution in Numpy



In [1]:

    
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import helper
import keras

helper.info_gpu()
sns.set_palette("Reds")
helper.reproducible(seed=0)  # setup reproducible results from run to run using Keras

%matplotlib inline
%load_ext autoreload
%autoreload 2









    



Using TensorFlow backend.






    



/device:GPU:0
Keras		v2.1.4
TensorFlow	v1.4.1

1. Data Processing and Exploratory Data Analysis



In [2]:

    
data_path = 'data/Bike-Sharing-Dataset/hour.csv'
target = ['cnt', 'casual', 'registered']

df = pd.read_csv(data_path)

Explore the data



In [3]:

    
helper.info_data(df, target)









    



Samples: 	17379 
Features: 	14
Target: 	cnt
Target: 	casual
Target: 	registered



In [4]:

    
df.head(3)

Numerical Data



In [5]:

    
df.describe(percentiles=[0.5])









    Out[5]:







  
    
      
      instant
      season
      yr
      mnth
      hr
      holiday
      weekday
      workingday
      weathersit
      temp
      atemp
      hum
      windspeed
      casual
      registered
      cnt
    
  
  
    
      count
      17379.0000
      17379.000000
      17379.000000
      17379.000000
      17379.000000
      17379.000000
      17379.000000
      17379.000000
      17379.000000
      17379.000000
      17379.000000
      17379.000000
      17379.000000
      17379.000000
      17379.000000
      17379.000000
    
    
      mean
      8690.0000
      2.501640
      0.502561
      6.537775
      11.546752
      0.028770
      3.003683
      0.682721
      1.425283
      0.496987
      0.475775
      0.627229
      0.190098
      35.676218
      153.786869
      189.463088
    
    
      std
      5017.0295
      1.106918
      0.500008
      3.438776
      6.914405
      0.167165
      2.005771
      0.465431
      0.639357
      0.192556
      0.171850
      0.192930
      0.122340
      49.305030
      151.357286
      181.387599
    
    
      min
      1.0000
      1.000000
      0.000000
      1.000000
      0.000000
      0.000000
      0.000000
      0.000000
      1.000000
      0.020000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      1.000000
    
    
      50%
      8690.0000
      3.000000
      1.000000
      7.000000
      12.000000
      0.000000
      3.000000
      1.000000
      1.000000
      0.500000
      0.484800
      0.630000
      0.194000
      17.000000
      115.000000
      142.000000
    
    
      max
      17379.0000
      4.000000
      1.000000
      12.000000
      23.000000
      1.000000
      6.000000
      1.000000
      4.000000
      1.000000
      1.000000
      1.000000
      0.850700
      367.000000
      886.000000
      977.000000

Missing values



In [6]:

    
helper.missing(df);









    



No missing values found

Transform the data

Add new features



In [7]:

    
ind = pd.to_datetime(df['dteday'])
df['day'] = pd.DatetimeIndex(ind).day

Remove irrelevant features



In [8]:

    
droplist = ['atemp', 'day', 'instant']
df.drop(droplist, axis='columns', inplace=True)

Classify variables



In [9]:

    
numerical = ['temp', 'hum', 'windspeed', 'cnt', 'casual', 'registered']

df = helper.classify_data(df, target, numerical)

helper.get_types(df)









    



Numerical features: 	3
Categorical features: 	9
Target: 		cnt (float32)
Target: 		casual (float32)
Target: 		registered (float32)






    Out[9]:







  
    
      
      temp
      hum
      windspeed
      dteday
      season
      yr
      mnth
      hr
      holiday
      weekday
      workingday
      weathersit
      cnt
      casual
      registered
    
  
  
    
      Type
      float32
      float32
      float32
      category
      category
      category
      category
      category
      category
      category
      category
      category
      float32
      float32
      float32

Visualize the data

This dataset has the number of riders for each hour of each day from January 1 2011 to December 31 2012. The number of riders is split between casual and registered, summed up in the cnt column. We also have information about temperature, humidity, and windspeed, all of these likely affecting the number of riders. Below is a plot showing the hourly rentals over the first 10 days in the data set. The weekends have lower over all ridership and there are spikes when people are biking to and from work during the week.



In [10]:

    
df[:24 * 10].plot(x='dteday', y='cnt');

Categorical features



In [11]:

    
helper.show_categorical(df[['holiday', 'workingday', 'weathersit']], sharey=True)

Target vs Categorical features



In [12]:

    
helper.show_target_vs_categorical(
    df.drop(['dteday', 'season'], axis='columns'), target, ncols=7)



In [13]:

    
g = sns.PairGrid(
    df, y_vars='casual', x_vars='registered', size=5, aspect=9 / 4, hue='weathersit')
g.map(sns.regplot, fit_reg=False).add_legend()
g.axes[0, 0].set_ylim(0, 350)









    Out[13]:





(0, 350)

This plot shows the differences between the number of registered and casual riders for the different weather situations. Most of the riders are registered with very bad weather (weathersit=4).

Numerical features



In [14]:

    
helper.show_numerical(df, kde=True, ncols=6)

Target vs numerical features



In [15]:

    
helper.show_target_vs_numerical(df, target, jitter=0.05)

Correlation between numerical features and target



In [16]:

    
helper.correlation(df, target, figsize=(10,4))

2. Neural Network model

Select the features



In [17]:

    
droplist = ['dteday'] # features to drop from the model

# For the model 'data' instead of 'df'
data = df.copy()
data.drop(droplist, axis='columns', inplace=True)
data.head(3)

Scale numerical variables

Shift and scale numerical variables to a standard normal distribution. The scaling factors are saved to be used for predictions.



In [18]:

    
data, scale_param = helper.scale(data)

Create dummy features

Replace categorical features (no target) with dummy features



In [19]:

    
data, dict_dummies = helper.replace_by_dummies(data, target)

model_features = [f for f in data if f not in target] # sorted neural network inputs

data.head(3)









    Out[19]:







  
    
      
      temp
      hum
      windspeed
      cnt
      casual
      registered
      season_1
      season_2
      season_3
      season_4
      ...
      weekday_3
      weekday_4
      weekday_5
      weekday_6
      workingday_0
      workingday_1
      weathersit_1
      weathersit_2
      weathersit_3
      weathersit_4
    
  
  
    
      0
      -1.334609
      0.947345
      -1.553844
      -0.956312
      -0.662736
      -0.930162
      1
      0
      0
      0
      ...
      0
      0
      0
      1
      1
      0
      1
      0
      0
      0
    
    
      1
      -1.438475
      0.895513
      -1.553844
      -0.823998
      -0.561327
      -0.804632
      1
      0
      0
      0
      ...
      0
      0
      0
      1
      1
      0
      1
      0
      0
      0
    
    
      2
      -1.438475
      0.895513
      -1.553844
      -0.868103
      -0.622172
      -0.837666
      1
      0
      0
      0
      ...
      0
      0
      0
      1
      1
      0
      1
      0
      0
      0
    
  

3 rows × 63 columns

Split the data into training, validation, and test sets

Data leakage: Test set hidden when training the model, but seen when preprocessing the dataset



In [20]:

    
# Save the last 21 days as a test set
test = data[-21 * 24:]
train = data[:-21 * 24]

# Hold out the last 60 days of the remaining data as a validation set
val = train[-60 * 24:]
train = train[:-60 * 24]

# Separate the data into features(x) and targets(y)
x_train, y_train = train.drop(target, axis=1).values, train[target].values
x_val, y_val = val.drop(target, axis=1).values, val[target].values
x_test, y_test = test.drop(target, axis=1).values, test[target].values

print("train size \t X:{} \t Y:{}".format(x_train.shape, y_train.shape))
print("val size \t X:{} \t Y:{}".format(x_val.shape, y_val.shape))
print("test size  \t X:{} \t Y:{} ".format(x_test.shape, y_test.shape))









    



train size 	 X:(15435, 60) 	 Y:(15435, 3)
val size 	 X:(1440, 60) 	 Y:(1440, 3)
test size  	 X:(504, 60) 	 Y:(504, 3)

Build the Neural Network



In [21]:

    
model = helper.build_nn_reg(
    x_train.shape[1], y_train.shape[1], hidden_layers=2, dropout=0.2, summary=True)









    



_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 60)                3660      
_________________________________________________________________
dropout_1 (Dropout)          (None, 60)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 61)                3721      
_________________________________________________________________
dropout_2 (Dropout)          (None, 61)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 3)                 186       
=================================================================
Total params: 7,567
Trainable params: 7,567
Non-trainable params: 0
_________________________________________________________________

Train the Neural Network



In [23]:

    
model_path = os.path.join("models", "bike_rental.h5")

model = None
model = helper.build_nn_reg(
    x_train.shape[1], y_train.shape[1], hidden_layers=2, dropout=0.2, summary=True)

callbacks = [keras.callbacks.EarlyStopping(monitor='val_loss', patience=2, verbose=0)]

helper.train_nn(
    model,
    x_train,
    y_train,
    validation_data=[x_val, y_val],
    path=model_path,
    epochs=100,
    batch_size=2048,
    callbacks=callbacks)

from sklearn.metrics import r2_score

ypred_train = model.predict(x_train)
ypred_val = model.predict(x_val)
print('\nTraining   R2-score: \t{:.3f}'.format(r2_score(y_train, ypred_train)))
print('Validation R2-score: \t{:.3f}'.format(r2_score(y_val, ypred_val)))









    



_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_7 (Dense)              (None, 60)                3660      
_________________________________________________________________
dropout_5 (Dropout)          (None, 60)                0         
_________________________________________________________________
dense_8 (Dense)              (None, 61)                3721      
_________________________________________________________________
dropout_6 (Dropout)          (None, 61)                0         
_________________________________________________________________
dense_9 (Dense)              (None, 3)                 186       
=================================================================
Total params: 7,567
Trainable params: 7,567
Non-trainable params: 0
_________________________________________________________________
Training ....
time: 	 3.3 s






    












    



Training loss:  	0.1275
Validation loss: 	0.1668

Model saved at models/bike_rental.h5

Training   R2-score: 	0.915
Validation R2-score: 	0.867

Evaluate the model



In [24]:

    
y_pred_test = model.predict(x_test, verbose=0)
helper.regression_scores(y_test, y_pred_test, return_dataframe=True, index="DNN")

Make predictions



In [25]:

    
fig, ax = plt.subplots(figsize=(14, 5))

mean, std = scale_param['cnt']
predictions = y_pred_test * std + mean
ax.plot(predictions[:, 0], label='Prediction')
ax.plot((test['cnt'] * std + mean).values, label='Data')
ax.set_xlim(right=len(predictions))
ax.legend()

dates = pd.to_datetime(df.iloc[test.index]['dteday'])
dates = dates.apply(lambda d: d.strftime('%b %d'))
ax.set_xticks(np.arange(len(dates))[12::24])
_ = ax.set_xticklabels(dates[12::24], rotation=45)

The model seems quite accurate considering that only two years of data were available.

It fails on the last 10 days of December where we expected more bike riders.

The model was not trained to predict this fall. The training set included data from December 22 to December 31 from one year only (2011), which is not enough. An exploratory analysis and some tests with different models led me to the following results and conclusions:

Adding more features from the dataset has a negligible impact on the accuracy of the model, only increasing the size of the neural network.
Removing or replacing the current features makes the model worse.
The training period December 22 to December 31 in 2011 had more registered riders (mean = 73.6) than the test period in 2012 (mean = 58.3). A ridership drop on Christmas 2012 can be predicted from the weather (worse than 2011), but not the large decline registered. Adding new features could help solve this issue, such as active registrations or Christmas.

Compare with classical ML



In [26]:

    
# restore training set
x_train = np.vstack([x_train, x_val])
y_train = np.vstack([y_train, y_val])



In [27]:

    
helper.ml_regression(x_train, y_train[:,0], x_test, y_test[:,0])









    



Linear
Bayesian Ridge
Decision Tree
KNeighbors
AdaBoost
Random Forest






    Out[27]:







  
    
      
      Time (s)
      Test loss
      Test R2 score
    
  
  
    
      Random Forest
      8.4
      0.16
      0.76
    
    
      KNeighbors
      0.1
      0.31
      0.54
    
    
      Bayesian Ridge
      0.0
      0.35
      0.47
    
    
      Linear
      0.0
      0.35
      0.46
    
    
      Decision Tree
      0.2
      0.66
      0.00
    
    
      AdaBoost
      0.9
      0.72
      -0.09

Best tree-based model



In [28]:

    
from sklearn.ensemble import RandomForestRegressor

random_forest = RandomForestRegressor(
    n_jobs=-1, n_estimators=100, random_state=0).fit(x_train, np.ravel(y_train[:, 0]))

y_pred = random_forest.predict(x_test)

helper.regression_scores(y_test[:, 0], y_pred, return_dataframe=True, index="Random Forest")









    Out[28]:







  
    
      
      Loss
      R2 Score
    
  
  
    
      Random Forest
      0.16
      0.76

Feature importances



In [29]:

    
results = helper.feature_importances(model_features, random_forest)

	instant	dteday	season	mnth	hr	weekday	weathersit	temp	atemp	hum	casual	registered	cnt
0	1	2011-01-01	1	1	0	6	1	0.24	0.2879	0.81	3	13	16
1	2	2011-01-01	1	1	1	6	1	0.22	0.2727	0.80	8	32	40
2	3	2011-01-01	1	1	2	6	1	0.22	0.2727	0.80	5	27	32

	instant	season	yr	mnth	hr	holiday	weekday	workingday	weathersit	temp	atemp	hum	windspeed	casual	registered	cnt
count	17379.0000	17379.000000	17379.000000	17379.000000	17379.000000	17379.000000	17379.000000	17379.000000	17379.000000	17379.000000	17379.000000	17379.000000	17379.000000	17379.000000	17379.000000	17379.000000
mean	8690.0000	2.501640	0.502561	6.537775	11.546752	0.028770	3.003683	0.682721	1.425283	0.496987	0.475775	0.627229	0.190098	35.676218	153.786869	189.463088
std	5017.0295	1.106918	0.500008	3.438776	6.914405	0.167165	2.005771	0.465431	0.639357	0.192556	0.171850	0.192930	0.122340	49.305030	151.357286	181.387599
min	1.0000	1.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000	1.000000	0.020000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000
50%	8690.0000	3.000000	1.000000	7.000000	12.000000	0.000000	3.000000	1.000000	1.000000	0.500000	0.484800	0.630000	0.194000	17.000000	115.000000	142.000000
max	17379.0000	4.000000	1.000000	12.000000	23.000000	1.000000	6.000000	1.000000	4.000000	1.000000	1.000000	1.000000	0.850700	367.000000	886.000000	977.000000

	temp	hum	season	mnth	hr	weekday	weathersit	cnt	casual	registered
0	0.24	0.81	1	1	0	6	1	16.0	3.0	13.0
1	0.22	0.80	1	1	1	6	1	40.0	8.0	32.0
2	0.22	0.80	1	1	2	6	1	32.0	5.0	27.0

	temp	hum	windspeed	cnt	casual	registered	season_1	...	weekday_6	workingday_0	weathersit_1
0	-1.334609	0.947345	-1.553844	-0.956312	-0.662736	-0.930162	1	...	1	1	1
1	-1.438475	0.895513	-1.553844	-0.823998	-0.561327	-0.804632	1	...	1	1	1
2	-1.438475	0.895513	-1.553844	-0.868103	-0.622172	-0.837666	1	...	1	1	1

	Time (s)	Test loss	Test R2 score
Random Forest	8.4	0.16	0.76
KNeighbors	0.1	0.31	0.54
Bayesian Ridge	0.0	0.35	0.47
Linear	0.0	0.35	0.46
Decision Tree	0.2	0.66	0.00
AdaBoost	0.9	0.72	-0.09