In [ ]:

    
"""This area sets up the Jupyter environment.
Please do not modify anything in this cell.
"""
import os
import sys
import time

# Add project to PYTHONPATH for future use
sys.path.insert(1, os.path.join(sys.path[0], '..'))

# Import miscellaneous modules
from IPython.core.display import display, HTML

# Set CSS styling
with open('../admin/custom.css', 'r') as f:
    style = """<style>\n{}\n</style>""".format(f.read())
    display(HTML(style))

Multivariate Regression with Keras

In this notebook we will get more familiar with the high-level artificial neural network package [Keras](https://keras.io/) by walking through a multivariate linear regression example.

Background

Public bike-sharing systems are a new generation of traditional bike rentals where the process from membership, rental, and return back of bicycles have become automatic. Through these systems, a user is able to easily rent a bicycle from a particular position and return it back to another position. Currently, there are about 500 bike-sharing systems around the world which are composed of over 500 thousand bicycles. Today, there exist great interest in these systems due to their important role in traffic, environmental, and health issues.

Apart from interesting real-world applications of these kinds of bike-sharing systems, the data being generated by these systems make them desirable for research as well. As opposed to other transport services such as bus or subway, the duration of travel, departure, and arrival position is explicitly recorded. This feature turns bike-sharing into a virtual sensor network that can be used for sensing mobility in a city. Hence, it is expected that significant events in a city could be detected by monitoring these data.

The bike-sharing rental process is highly correlated to environmental and seasonal settings. For instance, weather conditions, precipitation, day of the week, season, hour of the day, and more can affect rental behaviours. The core dataset is related to a two-year historical log between 2011 and 2012 from the Capital Bikeshare system (Washington D.C., USA) which is publicly available at http://capitalbikeshare.com/system-data. The data was aggregated hourly as well as daily and then combined with weather and seasonal information. Weather information was extracted from http://www.freemeteo.com.

We have already standardised some of the features, i.e. zero mean and unit variance.

Task: Regression

Predict the hourly bicycle rental count based on the environmental and seasonal settings.

Dataset Characteristics

day.csv - Bike-sharing counts aggregated on a daily basis (731 days)

Features:

- instance: record index
- dteday : date
- season : season (1:springer, 2:summer, 3:fall, 4:winter)
- yr : year (0: 2011, 1:2012)
- mnth : month ( 1 to 12)
- holiday : weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
- weekday : day of the week
- workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
+ weathersit : 
    - 1: Clear, Few clouds, Partly cloudy, Partly cloudy
    - 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    - 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    - 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- temp : Normalized temperature in Celsius. The values are divided to 41 (max)
- atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
- hum: Normalized humidity. The values are divided to 100 (max)
- windspeed: Normalized wind speed. The values are divided to 67 (max)
- casual: count of casual users
- registered: count of registered users
- cnt: count of total rental bikes including both casual and registered

License

This dataset was created and preprocessed in:

[1] Fanaee-T, Hadi, and Gama, Joao, "Event labeling combining ensemble detectors and background knowledge", Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg, doi:10.1007/s13748-013-0040-3.

Loading the Data

In the following code snippets we will:

Load the dataset from a slew of CSV files.



In [ ]:

    
# Plots will be show inside the notebook
%matplotlib notebook
import matplotlib.pyplot as plt

# High-level package for creating and training artificial neural networks
import keras

# NumPy is a package for manipulating N-dimensional array objects 
import numpy as np

# Pandas is a data analysis package
import pandas as pd

import admin.tools as tools
import problem_unittests as tests

Load features for training:



In [ ]:

    
train_features = tools.load_csv_with_dates('resources/bike_training_features.csv', 'dteday')

Load targets for training:



In [ ]:

    
train_targets = tools.load_csv_with_dates('resources/bike_training_targets.csv', 'dteday')

Load features for testing:



In [ ]:

    
test_features = tools.load_csv_with_dates('resources/bike_test_features.csv', 'dteday')

Load targets for testing:



In [ ]:

    
test_targets = tools.load_csv_with_dates('resources/bike_test_targets.csv', 'dteday')
test_dates = test_targets.index.strftime('%b %d')
print('\n', test_targets.head(n=5))

Unpack the Pandas DataFrames to NumPy arrays:



In [ ]:

    
# Unpack features
X_train = train_features.values
X_test = test_features.values

# Unpack targets
y_train = train_targets['cnt'].values
y_test = test_targets['cnt'].values

# Record number of inputs and outputs
nb_features = X_train.shape[1]
nb_outputs = 1

Task I: Build the Model

Now, using Keras we will build a multivariate regression model. Remember, these kinds of models can be represented as artifical neural networks, hence why we can implement them using Keras.

The model, an artificial neural network, will consist of a $d$ dimensional input that is fully- or densely-connected to a single output neuron.

The model will be made using the Keras functional guide, which allows us to take advantage of a functional API to create complex models with an arbitrary number of input and output neurons. Below is some example code for how to set up a simple model using this API with 32 inputs and 4 outputs:

from keras.models import Model
from keras.layers import Input, Dense

a = Input(shape=(32,))
b = Dense(4)
model = Model(inputs=a, outputs=b)

Notice how this is the same setup we used for the previous notebook on linear regression. Make sure to revisit that notebook if you have trouble understanding the basic usage of this API.

**Task**: Build a model using the Keras functional guide for the bike-sharing dataset. Use the following functions to put together your model:

It may be helpful to browse other parts of the Keras documentation.



In [ ]:

    
# Import what we need
from keras.layers import (Input, Dense)
from keras.models import Model


def simple_model(nb_inputs, nb_outputs):
    """Return a Keras Model.
    """
    model = None

    return model

### Do *not* modify the following line ###
# Test and see that the model has been created correctly
tests.test_simple_model(simple_model)

Selecting Hyperparameters

As opposed to standard model parameters, such as the weights in a linear model, hyperparamters are user-specified parameters not learned by the training process, i.e. they are specified a priori. In the following section we will look at how we can define and evaluate a few different hyperparameters relevant to our previously defined model. The hyperparameters we will take a look at are:

Learning rate
Number of epochs
Batch size

Digression: Different Sets of Data

One of the ultimate goals of machine learning is for our models to generalise well. That is, we would like the performance of our model on the data we have trained on, i.e. the in-sample error, to be representative of the performance of our model on the data we are attempting to model, i.e. the out-of-sample error. Unfortunately, for most problems we are unable to test our model on all possible data that we have not trained on. This might be due to difficulties gathering new data or simply because the amount of possible data is very large.

For this reason, we have to settle for a different solution when we want to evaluate our trained models. The go-to solution is to gather a second set of data, in addtion to the training set, called a test set. For the test set to be useful it is important that it is representative of the data we have not trained on. In order words, the error we get on the test set should be close to the out-of-sample error.

Selecting appropriate hyperparameters can be seen as a sort of meta-optimisation task on top of the learning task. Now, we could train a model several times, alter some hyperparameters each time, and record the final performance on the test set, however, this will likely yield errors that are overly optimistic. This is because looking at the test set when making learning choices, i.e. selecting hyperparamters, introduces bias and causes the estimated out-of-sample error to diverge from the true out-of-sample error. Remember, this is the reason why we have a test set in the first place.

The solution to this problem is to create a third set: the validation set. This is typically a partition of the training set, however there exist several cross validation methodologies for how to create and use validation sets efficiently. By having this third set we can: (i) use the training set to train the trainable model parameters, (ii) use the validation set to select hyperparameters, and (iii) use the test set to estimate the out-of-sample error. This split ensures that the test set remains unbiased.

Learning Rate

As we saw in the previous notebook, learning rate is an important parameter that decides how big of a jump we will make during gradient descent-based optimisation when moving in the negative gradient direction.

In order to select a good learning rate it is paramount that we track the state of the current error / loss / cost during training after each application of the gradient descent update rule. Below is a cartoon diagram illustrating the loss over the course of training. The shape of the error as training progresses can give a good indication as to what constitutes a good learning rate.

source

Ideally we would want:

Small training error
Little to no overfitting, i.e. *validation* performance measure matches the training performance measure (see figure below)

Validation error refers to the error taken over a validation set on the current model.

source

Epochs

In artificial neural network terminology one epoch typically means that every example in the training set has been seen once by the learning algorithm. It is generally preferable to track the number of epochs as opposed to the number iterations, i.e. applications of an update rule, because the latter depends on the batch size.

In literature, iteration is sometimes used synonymously with epoch.

Ideally we would want:

To avoid stopping the training too early
To avoid training for too long

Batch Size

As we saw in the previous notebook, we typically sum over multiple examples for a single application of an update rule. The number of examples we include is the batch size.

The batch size allows us to control how much memory we need during training because we only need to sample examples for a single batch. This is important for when the entire dataset cannot fit in memory. The important thing to keeep in mind when it comes to batch size is that the smaller the batch size the less accurate the estimate of the gradient over the training set will be. In other words, moves done by the update rule in the space over all trainable parameters become more noisy the smaller the batch size is.

Ideally we would want:

To fit a number of examples in memory
Avoid unnecessary amounts of noise when updating trainable model parameters

Plotting Error vs. Epoch with Keras

In the following code snippet we will:

Create a model using the `simple_model()` function we made earlier
Define all of the hyperparameters we will need
Train the network using gradient descent
Plot how the error evolves throughout training

Make sure you understand most of the code below before you continue.



In [ ]:

    
"""Do not modify the following code. It is to be used as a refence for future tasks.
"""

# Create a simple model
model = simple_model(nb_features, nb_outputs)

#
# Define hyperparameters
#
lr = 0.2
nb_epochs = 10
batch_size = 10

# Fraction of the training data held as a validation set
validation_split = 0.1

# Define optimiser
optimizer = keras.optimizers.sgd(lr=lr)

# Compile model, use mean squared error
model.compile(loss='mean_squared_error', optimizer=optimizer)

# Print model
model.summary()

# Train and record history
logs = model.fit(X_train, y_train,
                 batch_size=batch_size,
                 epochs=nb_epochs,
                 validation_split=validation_split,
                 verbose=2)

# Plot the error
fig, ax = plt.subplots(1,1)

pd.DataFrame(logs.history).plot(ax=ax)
ax.grid(linestyle='dotted')
ax.legend()

plt.show()

# Estimation on unseen data can be done using the `predict()` function, e.g.:
_y = model.predict(X_test)

Analysis

Neither of the errors seem very good
The training performance (loss) does not seem to generalise well to the validation set (val_loss)
The training performance (loss) does not improve

Task II: Tuning Hyperparameters

In this task you will get the opportunity to play with the hyperparameters we discussed in the previous section.

**Task**: Tune the following hyperparameters until the `loss` (training error) and `val_loss` (validation error) both converge to low numbers:

Learning rate
Number of epochs
Batch size

Notice that there is no code for creating the optimiser nor for creating the model in the code below. Take a look in the previous code snippet for how to do this. Remember, it is better to write the missing components down manually rather than copy-pasting them.



In [ ]:

    
# Create a simple model
model = None

#
# Define hyperparameters
#
lr = 0.2
nb_epochs = 10
batch_size = 10

# Fraction of the training data held as a validation set
validation_split = 0.1

# Define optimiser


# Compile model, use mean squared error


### Do *not* modify the following lines ###

# Print model
model.summary()

# Train our network and do live plots of loss 
tools.assess_multivariate_model(model, X_train, y_train, X_test, y_test,
                                test_dates, nb_epochs, batch_size,
                                validation_split
)

Task III: Adding Regularization

Regularisation is any modification made to a learning algorithm intended to reduce the generalisation error, i.e. the expected value of the error on an unseen example, but not the training error. Typically, this is interpreted as adjusting the complexity of the model by adding a regularisation term, or regulariser to the error function that we minimise:

$$ \begin{equation*} \min_{h}\sum_{i=1}^{N}E(h(\mathbf{x}_i), y_i) + \lambda R(h) \end{equation*} $$

where $h$ is a hypothesis, $E$ is an error function, $R$ is the regularizer, and $\lambda$ is a parameter for controlling the aforementioned regularizer. There are other ways to control the model complexity as well, such as noise injection, data augmentation, and early stopping, but in this notebook we will focus on the type above.

In case you want to review regularization material you can refer to the following material:

Adding $L^2$ Regularization to Our Model

$L^2$ regularization, otherwise known as weight decay, ridge regression, or Tikhonov regularization, is a popular form of regularization that penalises the norm of the model parameters. This is done by letting $R(h) = \frac{1}{2}\lVert\mathbf{w}\rVert_{2}^{2}$, which drives the weights towards the origin. Any point can be selected, but the origin is a good choice if we do not know the correct value. By multiplying with a factor of $\frac{1}{2}$ we will simplify the gradient of $R(h)$.

**Task**: Build a model using the Keras functional guide for the bike-sharing dataset, however, this time you will have to add $L^2$ regularization. Use the following functions to put together your model:

Input()
Dense() - Take a look at kernel_regularizer for how to regularize the weights of a layer
Model()

As before, it may be helpful to browse other parts of the Keras documentation.



In [ ]:

    
# Import what we need
from keras import regularizers


def simple_model_l2(nb_inputs, nb_outputs, reg_factor):
    """Return a L2 regularized Keras Model.
    """
    model = None

    return model

### Do *not* modify the following line ###
# Test and see that the model has been created correctly
tests.test_simple_model_regularized(simple_model_l2)

Now, with this model, let's try to optimize the regularization factor $\lambda$. This adjusts the strength of the regularizer.

**Task**: Alter the regularization factor and assess the performance over 100 epochs using a batch size of 128. At a minimum, test out the following regularization strengths:

$\lambda = 0.01$
$\lambda = 0.005$
$\lambda = 0.0005$
$\lambda = 0.00005$

Similarly to the task where you had to tune hyperparameters, you will have to write down Keras code for creating an optimiser as well as the model. Remember, it is better to write the missing components down manually rather than copy-pasting them.



In [ ]:

    
# Regularization factor (lambda)
reg_factor = 0.005

# Create a simple model
model = None

#
# Define hyperparameters
#
lr = 0.0005
nb_epochs = 100
batch_size = 128

reg_factor = 0.0005

# Fraction of the training data held as a validation set
validation_split = 0.1

# Define optimiser


# Compile model, use mean squared error


### Do *not* modify the following lines ###

# Print model
model.summary()

# Train our network and do live plots of loss 
tools.assess_multivariate_model(model, X_train, y_train, X_test, y_test,
                                test_dates, nb_epochs, batch_size,
                                validation_split
)

Topics to Think About

Which of the models above performance better?
How can we improve the performance even further?



In [ ]:

Multivariate Regression with Keras

Dataset: Bike-Sharing System

Background

Task: Regression

Dataset Characteristics

License

Loading the Data

Task I: Build the Model

Selecting Hyperparameters

Digression: Different Sets of Data

Learning Rate

Epochs

Batch Size

Plotting Error vs. Epoch with Keras

Analysis

Task II: Tuning Hyperparameters

Task III: Adding Regularization

Adding $L^2$ Regularization to Our Model

Topics to Think About