Bike Rental

Daily bike rental ridership prediction using an artificial neural network in Keras

Supervised Learning. Regression

Based on the first neural network project from the Deep Learning Nanodegree Foundation of Udacity

Click Here to check my original solution in Numpy


In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import helper
import keras

helper.info_gpu()
sns.set_palette("Reds")
helper.reproducible(seed=0)  # setup reproducible results from run to run using Keras

%matplotlib inline
%load_ext autoreload
%autoreload 2


Using TensorFlow backend.
/device:GPU:0
Keras		v2.1.4
TensorFlow	v1.4.1

1. Data Processing and Exploratory Data Analysis


In [2]:
data_path = 'data/Bike-Sharing-Dataset/hour.csv'
target = ['cnt', 'casual', 'registered']

df = pd.read_csv(data_path)

Explore the data


In [3]:
helper.info_data(df, target)


Samples: 	17379 
Features: 	14
Target: 	cnt
Target: 	casual
Target: 	registered

In [4]:
df.head(3)


Out[4]:
instant dteday season yr mnth hr holiday weekday workingday weathersit temp atemp hum windspeed casual registered cnt
0 1 2011-01-01 1 0 1 0 0 6 0 1 0.24 0.2879 0.81 0.0 3 13 16
1 2 2011-01-01 1 0 1 1 0 6 0 1 0.22 0.2727 0.80 0.0 8 32 40
2 3 2011-01-01 1 0 1 2 0 6 0 1 0.22 0.2727 0.80 0.0 5 27 32

Numerical Data


In [5]:
df.describe(percentiles=[0.5])


Out[5]:
instant season yr mnth hr holiday weekday workingday weathersit temp atemp hum windspeed casual registered cnt
count 17379.0000 17379.000000 17379.000000 17379.000000 17379.000000 17379.000000 17379.000000 17379.000000 17379.000000 17379.000000 17379.000000 17379.000000 17379.000000 17379.000000 17379.000000 17379.000000
mean 8690.0000 2.501640 0.502561 6.537775 11.546752 0.028770 3.003683 0.682721 1.425283 0.496987 0.475775 0.627229 0.190098 35.676218 153.786869 189.463088
std 5017.0295 1.106918 0.500008 3.438776 6.914405 0.167165 2.005771 0.465431 0.639357 0.192556 0.171850 0.192930 0.122340 49.305030 151.357286 181.387599
min 1.0000 1.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.020000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000
50% 8690.0000 3.000000 1.000000 7.000000 12.000000 0.000000 3.000000 1.000000 1.000000 0.500000 0.484800 0.630000 0.194000 17.000000 115.000000 142.000000
max 17379.0000 4.000000 1.000000 12.000000 23.000000 1.000000 6.000000 1.000000 4.000000 1.000000 1.000000 1.000000 0.850700 367.000000 886.000000 977.000000

Missing values


In [6]:
helper.missing(df);


No missing values found

Transform the data

Add new features


In [7]:
ind = pd.to_datetime(df['dteday'])
df['day'] = pd.DatetimeIndex(ind).day

Remove irrelevant features


In [8]:
droplist = ['atemp', 'day', 'instant']
df.drop(droplist, axis='columns', inplace=True)

Classify variables


In [9]:
numerical = ['temp', 'hum', 'windspeed', 'cnt', 'casual', 'registered']

df = helper.classify_data(df, target, numerical)

helper.get_types(df)


Numerical features: 	3
Categorical features: 	9
Target: 		cnt (float32)
Target: 		casual (float32)
Target: 		registered (float32)
Out[9]:
temp hum windspeed dteday season yr mnth hr holiday weekday workingday weathersit cnt casual registered
Type float32 float32 float32 category category category category category category category category category float32 float32 float32

Visualize the data

This dataset has the number of riders for each hour of each day from January 1 2011 to December 31 2012. The number of riders is split between casual and registered, summed up in the cnt column. We also have information about temperature, humidity, and windspeed, all of these likely affecting the number of riders. Below is a plot showing the hourly rentals over the first 10 days in the data set. The weekends have lower over all ridership and there are spikes when people are biking to and from work during the week.


In [10]:
df[:24 * 10].plot(x='dteday', y='cnt');


Categorical features


In [11]:
helper.show_categorical(df[['holiday', 'workingday', 'weathersit']], sharey=True)


Target vs Categorical features


In [12]:
helper.show_target_vs_categorical(
    df.drop(['dteday', 'season'], axis='columns'), target, ncols=7)



In [13]:
g = sns.PairGrid(
    df, y_vars='casual', x_vars='registered', size=5, aspect=9 / 4, hue='weathersit')
g.map(sns.regplot, fit_reg=False).add_legend()
g.axes[0, 0].set_ylim(0, 350)


Out[13]:
(0, 350)

This plot shows the differences between the number of registered and casual riders for the different weather situations. Most of the riders are registered with very bad weather (weathersit=4).

Numerical features


In [14]:
helper.show_numerical(df, kde=True, ncols=6)


Target vs numerical features


In [15]:
helper.show_target_vs_numerical(df, target, jitter=0.05)


Correlation between numerical features and target


In [16]:
helper.correlation(df, target, figsize=(10,4))


2. Neural Network model

Select the features


In [17]:
droplist = ['dteday'] # features to drop from the model

# For the model 'data' instead of 'df'
data = df.copy()
data.drop(droplist, axis='columns', inplace=True)
data.head(3)


Out[17]:
temp hum windspeed season yr mnth hr holiday weekday workingday weathersit cnt casual registered
0 0.24 0.81 0.0 1 0 1 0 0 6 0 1 16.0 3.0 13.0
1 0.22 0.80 0.0 1 0 1 1 0 6 0 1 40.0 8.0 32.0
2 0.22 0.80 0.0 1 0 1 2 0 6 0 1 32.0 5.0 27.0

Scale numerical variables

Shift and scale numerical variables to a standard normal distribution. The scaling factors are saved to be used for predictions.


In [18]:
data, scale_param = helper.scale(data)

Create dummy features

Replace categorical features (no target) with dummy features


In [19]:
data, dict_dummies = helper.replace_by_dummies(data, target)

model_features = [f for f in data if f not in target] # sorted neural network inputs

data.head(3)


Out[19]:
temp hum windspeed cnt casual registered season_1 season_2 season_3 season_4 ... weekday_3 weekday_4 weekday_5 weekday_6 workingday_0 workingday_1 weathersit_1 weathersit_2 weathersit_3 weathersit_4
0 -1.334609 0.947345 -1.553844 -0.956312 -0.662736 -0.930162 1 0 0 0 ... 0 0 0 1 1 0 1 0 0 0
1 -1.438475 0.895513 -1.553844 -0.823998 -0.561327 -0.804632 1 0 0 0 ... 0 0 0 1 1 0 1 0 0 0
2 -1.438475 0.895513 -1.553844 -0.868103 -0.622172 -0.837666 1 0 0 0 ... 0 0 0 1 1 0 1 0 0 0

3 rows × 63 columns

Split the data into training, validation, and test sets

Data leakage: Test set hidden when training the model, but seen when preprocessing the dataset


In [20]:
# Save the last 21 days as a test set
test = data[-21 * 24:]
train = data[:-21 * 24]

# Hold out the last 60 days of the remaining data as a validation set
val = train[-60 * 24:]
train = train[:-60 * 24]

# Separate the data into features(x) and targets(y)
x_train, y_train = train.drop(target, axis=1).values, train[target].values
x_val, y_val = val.drop(target, axis=1).values, val[target].values
x_test, y_test = test.drop(target, axis=1).values, test[target].values

print("train size \t X:{} \t Y:{}".format(x_train.shape, y_train.shape))
print("val size \t X:{} \t Y:{}".format(x_val.shape, y_val.shape))
print("test size  \t X:{} \t Y:{} ".format(x_test.shape, y_test.shape))


train size 	 X:(15435, 60) 	 Y:(15435, 3)
val size 	 X:(1440, 60) 	 Y:(1440, 3)
test size  	 X:(504, 60) 	 Y:(504, 3) 

Build the Neural Network


In [21]:
model = helper.build_nn_reg(
    x_train.shape[1], y_train.shape[1], hidden_layers=2, dropout=0.2, summary=True)


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 60)                3660      
_________________________________________________________________
dropout_1 (Dropout)          (None, 60)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 61)                3721      
_________________________________________________________________
dropout_2 (Dropout)          (None, 61)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 3)                 186       
=================================================================
Total params: 7,567
Trainable params: 7,567
Non-trainable params: 0
_________________________________________________________________

Train the Neural Network


In [23]:
model_path = os.path.join("models", "bike_rental.h5")

model = None
model = helper.build_nn_reg(
    x_train.shape[1], y_train.shape[1], hidden_layers=2, dropout=0.2, summary=True)

callbacks = [keras.callbacks.EarlyStopping(monitor='val_loss', patience=2, verbose=0)]

helper.train_nn(
    model,
    x_train,
    y_train,
    validation_data=[x_val, y_val],
    path=model_path,
    epochs=100,
    batch_size=2048,
    callbacks=callbacks)

from sklearn.metrics import r2_score

ypred_train = model.predict(x_train)
ypred_val = model.predict(x_val)
print('\nTraining   R2-score: \t{:.3f}'.format(r2_score(y_train, ypred_train)))
print('Validation R2-score: \t{:.3f}'.format(r2_score(y_val, ypred_val)))


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_7 (Dense)              (None, 60)                3660      
_________________________________________________________________
dropout_5 (Dropout)          (None, 60)                0         
_________________________________________________________________
dense_8 (Dense)              (None, 61)                3721      
_________________________________________________________________
dropout_6 (Dropout)          (None, 61)                0         
_________________________________________________________________
dense_9 (Dense)              (None, 3)                 186       
=================================================================
Total params: 7,567
Trainable params: 7,567
Non-trainable params: 0
_________________________________________________________________
Training ....
time: 	 3.3 s
Training loss:  	0.1275
Validation loss: 	0.1668

Model saved at models/bike_rental.h5

Training   R2-score: 	0.915
Validation R2-score: 	0.867

Evaluate the model


In [24]:
y_pred_test = model.predict(x_test, verbose=0)
helper.regression_scores(y_test, y_pred_test, return_dataframe=True, index="DNN")


Out[24]:
Loss R2 Score
DNN 0.12 0.72

Make predictions


In [25]:
fig, ax = plt.subplots(figsize=(14, 5))

mean, std = scale_param['cnt']
predictions = y_pred_test * std + mean
ax.plot(predictions[:, 0], label='Prediction')
ax.plot((test['cnt'] * std + mean).values, label='Data')
ax.set_xlim(right=len(predictions))
ax.legend()

dates = pd.to_datetime(df.iloc[test.index]['dteday'])
dates = dates.apply(lambda d: d.strftime('%b %d'))
ax.set_xticks(np.arange(len(dates))[12::24])
_ = ax.set_xticklabels(dates[12::24], rotation=45)


The model seems quite accurate considering that only two years of data were available.

It fails on the last 10 days of December where we expected more bike riders.

The model was not trained to predict this fall. The training set included data from December 22 to December 31 from one year only (2011), which is not enough. An exploratory analysis and some tests with different models led me to the following results and conclusions:

  • Adding more features from the dataset has a negligible impact on the accuracy of the model, only increasing the size of the neural network.

  • Removing or replacing the current features makes the model worse.

  • The training period December 22 to December 31 in 2011 had more registered riders (mean = 73.6) than the test period in 2012 (mean = 58.3). A ridership drop on Christmas 2012 can be predicted from the weather (worse than 2011), but not the large decline registered. Adding new features could help solve this issue, such as active registrations or Christmas.

Compare with classical ML


In [26]:
# restore training set
x_train = np.vstack([x_train, x_val])
y_train = np.vstack([y_train, y_val])

In [27]:
helper.ml_regression(x_train, y_train[:,0], x_test, y_test[:,0])


Linear
Bayesian Ridge
Decision Tree
KNeighbors
AdaBoost
Random Forest
Out[27]:
Time (s) Test loss Test R2 score
Random Forest 8.4 0.16 0.76
KNeighbors 0.1 0.31 0.54
Bayesian Ridge 0.0 0.35 0.47
Linear 0.0 0.35 0.46
Decision Tree 0.2 0.66 0.00
AdaBoost 0.9 0.72 -0.09

Best tree-based model


In [28]:
from sklearn.ensemble import RandomForestRegressor

random_forest = RandomForestRegressor(
    n_jobs=-1, n_estimators=100, random_state=0).fit(x_train, np.ravel(y_train[:, 0]))

y_pred = random_forest.predict(x_test)

helper.regression_scores(y_test[:, 0], y_pred, return_dataframe=True, index="Random Forest")


Out[28]:
Loss R2 Score
Random Forest 0.16 0.76

Feature importances


In [29]:
results = helper.feature_importances(model_features, random_forest)