Daily bike rental ridership prediction using an artificial neural network in Keras
Supervised Learning. Regression
Based on the first neural network project from the Deep Learning Nanodegree Foundation of Udacity
Click Here to check my original solution in Numpy
In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import helper
import keras
helper.info_gpu()
sns.set_palette("Reds")
helper.reproducible(seed=0) # setup reproducible results from run to run using Keras
%matplotlib inline
%load_ext autoreload
%autoreload 2
In [2]:
data_path = 'data/Bike-Sharing-Dataset/hour.csv'
target = ['cnt', 'casual', 'registered']
df = pd.read_csv(data_path)
In [3]:
helper.info_data(df, target)
In [4]:
df.head(3)
Out[4]:
In [5]:
df.describe(percentiles=[0.5])
Out[5]:
In [6]:
helper.missing(df);
In [7]:
ind = pd.to_datetime(df['dteday'])
df['day'] = pd.DatetimeIndex(ind).day
In [8]:
droplist = ['atemp', 'day', 'instant']
df.drop(droplist, axis='columns', inplace=True)
In [9]:
numerical = ['temp', 'hum', 'windspeed', 'cnt', 'casual', 'registered']
df = helper.classify_data(df, target, numerical)
helper.get_types(df)
Out[9]:
This dataset has the number of riders for each hour of each day from January 1 2011 to December 31 2012. The number of riders is split between casual and registered, summed up in the cnt
column. We also have information about temperature, humidity, and windspeed, all of these likely affecting the number of riders. Below is a plot showing the hourly rentals over the first 10 days in the data set. The weekends have lower over all ridership and there are spikes when people are biking to and from work during the week.
In [10]:
df[:24 * 10].plot(x='dteday', y='cnt');
In [11]:
helper.show_categorical(df[['holiday', 'workingday', 'weathersit']], sharey=True)
In [12]:
helper.show_target_vs_categorical(
df.drop(['dteday', 'season'], axis='columns'), target, ncols=7)
In [13]:
g = sns.PairGrid(
df, y_vars='casual', x_vars='registered', size=5, aspect=9 / 4, hue='weathersit')
g.map(sns.regplot, fit_reg=False).add_legend()
g.axes[0, 0].set_ylim(0, 350)
Out[13]:
This plot shows the differences between the number of registered and casual riders for the different weather situations. Most of the riders are registered with very bad weather (weathersit=4
).
In [14]:
helper.show_numerical(df, kde=True, ncols=6)
In [15]:
helper.show_target_vs_numerical(df, target, jitter=0.05)
In [16]:
helper.correlation(df, target, figsize=(10,4))
In [17]:
droplist = ['dteday'] # features to drop from the model
# For the model 'data' instead of 'df'
data = df.copy()
data.drop(droplist, axis='columns', inplace=True)
data.head(3)
Out[17]:
In [18]:
data, scale_param = helper.scale(data)
In [19]:
data, dict_dummies = helper.replace_by_dummies(data, target)
model_features = [f for f in data if f not in target] # sorted neural network inputs
data.head(3)
Out[19]:
In [20]:
# Save the last 21 days as a test set
test = data[-21 * 24:]
train = data[:-21 * 24]
# Hold out the last 60 days of the remaining data as a validation set
val = train[-60 * 24:]
train = train[:-60 * 24]
# Separate the data into features(x) and targets(y)
x_train, y_train = train.drop(target, axis=1).values, train[target].values
x_val, y_val = val.drop(target, axis=1).values, val[target].values
x_test, y_test = test.drop(target, axis=1).values, test[target].values
print("train size \t X:{} \t Y:{}".format(x_train.shape, y_train.shape))
print("val size \t X:{} \t Y:{}".format(x_val.shape, y_val.shape))
print("test size \t X:{} \t Y:{} ".format(x_test.shape, y_test.shape))
In [21]:
model = helper.build_nn_reg(
x_train.shape[1], y_train.shape[1], hidden_layers=2, dropout=0.2, summary=True)
In [23]:
model_path = os.path.join("models", "bike_rental.h5")
model = None
model = helper.build_nn_reg(
x_train.shape[1], y_train.shape[1], hidden_layers=2, dropout=0.2, summary=True)
callbacks = [keras.callbacks.EarlyStopping(monitor='val_loss', patience=2, verbose=0)]
helper.train_nn(
model,
x_train,
y_train,
validation_data=[x_val, y_val],
path=model_path,
epochs=100,
batch_size=2048,
callbacks=callbacks)
from sklearn.metrics import r2_score
ypred_train = model.predict(x_train)
ypred_val = model.predict(x_val)
print('\nTraining R2-score: \t{:.3f}'.format(r2_score(y_train, ypred_train)))
print('Validation R2-score: \t{:.3f}'.format(r2_score(y_val, ypred_val)))
In [24]:
y_pred_test = model.predict(x_test, verbose=0)
helper.regression_scores(y_test, y_pred_test, return_dataframe=True, index="DNN")
Out[24]:
In [25]:
fig, ax = plt.subplots(figsize=(14, 5))
mean, std = scale_param['cnt']
predictions = y_pred_test * std + mean
ax.plot(predictions[:, 0], label='Prediction')
ax.plot((test['cnt'] * std + mean).values, label='Data')
ax.set_xlim(right=len(predictions))
ax.legend()
dates = pd.to_datetime(df.iloc[test.index]['dteday'])
dates = dates.apply(lambda d: d.strftime('%b %d'))
ax.set_xticks(np.arange(len(dates))[12::24])
_ = ax.set_xticklabels(dates[12::24], rotation=45)
The model seems quite accurate considering that only two years of data were available.
It fails on the last 10 days of December where we expected more bike riders.
The model was not trained to predict this fall. The training set included data from December 22 to December 31 from one year only (2011), which is not enough. An exploratory analysis and some tests with different models led me to the following results and conclusions:
Adding more features from the dataset has a negligible impact on the accuracy of the model, only increasing the size of the neural network.
Removing or replacing the current features makes the model worse.
The training period December 22 to December 31 in 2011 had more registered riders (mean = 73.6) than the test period in 2012 (mean = 58.3). A ridership drop on Christmas 2012 can be predicted from the weather (worse than 2011), but not the large decline registered. Adding new features could help solve this issue, such as active registrations or Christmas.
In [26]:
# restore training set
x_train = np.vstack([x_train, x_val])
y_train = np.vstack([y_train, y_val])
In [27]:
helper.ml_regression(x_train, y_train[:,0], x_test, y_test[:,0])
Out[27]:
In [28]:
from sklearn.ensemble import RandomForestRegressor
random_forest = RandomForestRegressor(
n_jobs=-1, n_estimators=100, random_state=0).fit(x_train, np.ravel(y_train[:, 0]))
y_pred = random_forest.predict(x_test)
helper.regression_scores(y_test[:, 0], y_pred, return_dataframe=True, index="Random Forest")
Out[28]:
In [29]:
results = helper.feature_importances(model_features, random_forest)